CN103136378B - A kind of data reconstruction method of structure based summary - Google Patents

A kind of data reconstruction method of structure based summary Download PDF

Info

Publication number
CN103136378B
CN103136378B CN201310100931.4A CN201310100931A CN103136378B CN 103136378 B CN103136378 B CN 103136378B CN 201310100931 A CN201310100931 A CN 201310100931A CN 103136378 B CN103136378 B CN 103136378B
Authority
CN
China
Prior art keywords
node
data
interested
mode
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310100931.4A
Other languages
Chinese (zh)
Other versions
CN103136378A (en
Inventor
陈琳
陈海涛
夏冬
王奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Original Assignee
TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd filed Critical TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Priority to CN201310100931.4A priority Critical patent/CN103136378B/en
Publication of CN103136378A publication Critical patent/CN103136378A/en
Application granted granted Critical
Publication of CN103136378B publication Critical patent/CN103136378B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of data reconstruction method of structure based summary, comprising: Analysis of X ML document tree data, calculate the data message that XML document leaf nodes is corresponding, and described leaf node data message is carried out storing and index; XPath query statement is resolved to Twig scheme-tree, the simple path that resolution model tree splits, and mark interested mode node collection; Record interested mode node sequence sets, the merging of row mode sequence node of going forward side by side; According to the interested mode node sequence sets of record, recover the data section point set matched with interested mode node sequence sets.The present invention utilizes XML data structural summary information and the index based on path, in index coupling with the process chosen, and the data required for recovery.From date restoring scope and date restoring these two aspects on opportunity, existing method is improved, required data can be recovered correctly, efficiently and accurately, avoid existing method inaccurate, produce the shortcomings such as redundancy.

Description

A kind of data reconstruction method of structure based summary
Technical field
The present invention relates to XML data library inquiry field, particularly relate to the data reconstruction method of structure based summary in a kind of XML data storehouse.
Background technology
XML(ExtensibleMarkupLanguage), i.e. extend markup language is a kind of partly-structured data form.Along with the develop rapidly of Internet technology, XML has become the data representation of sing on web application and the de facto standard of exchange.Some distinguishing features of XML are self-described, semi-structured, hierarchy nesting structure.
XQuery/XPath is the standardized XML data query language of W3C.XQuery is the programming language that a figure is clever complete, can describe the abundant processing logic for XML data.XPath can think a subset of XQuery, for by path mode, extracts the data of coupling from XML data.
Twig pattern match is in semi-structured data, utilizes tree schema to carry out a kind of technology of match selection data.Twig pattern match is considered to the core operation of XQuery/XPath, is usually used in the realization of XQuery/XPath engine.The input of Twig pattern match is a scheme-tree to be matched, and the back end stream that the mode node in scheme-tree is corresponding, output is the back end tuple meeting pattern.In order to accelerate the speed of Twig pattern matching algorithm, currently available technology can be supported from storage and index two aspect.
First, in storage, in XML data storehouse, the storage mode based on node can provide the more fine-grained control of XML data.An XML document can be modeled as a tree structure, and the node of tree can be divided into leaf node and intermediate node.Because XML data is actually stored in leaf node, intermediate node can think a logical organization, so adopt the storage scheme only storing leaf node data can improve the space performance of data storage, and can reduce data I/O.
Secondly, in index, usually adopt the index based on path, using a simple path as the form of Definition of index, so-called simple, namely the description form in this path be one from root, to the path of certain node, in the node on path, do not relate to the complex logics such as predicate.The access of the data to certain simple path of coupling can be accelerated in index based on path.This index is usually used to as Twig pattern matching algorithm provides the stream of the node data needed for input.
Above storage and index technology scheme improve the performance of Twig pattern matching algorithm.But, for in the inquiry of XML data, such as XPath, often need for or carry out match selection with reference to intermediate node, and the object of storage and index is all the leaf node of XML data, the leaf node that path ends that what path indexing was chosen is also is corresponding, so the evaluation of inquiry needs to carry out necessary date restoring.
Current data recovery scheme adopts prefix code to encode when storing to leaf node usually, in conjunction with the data structure (such as structural outline) that some is auxiliary, when Twig pattern match based on these information, all data between leaf node to root node are recovered.This mode can ensure correctness when selecting recovery scope, but can introduce data redundancy, when having duplication of name node to occur in the data particularly had.
In addition, how to select also be an important problem opportunity of recovering.Processing scheme general is at present that the leaf node according to obtaining from path indexing recovers when Twig mates, and then applies corresponding Twig matching algorithm.But this recovery brings very large complicacy opportunity.Because accurately will determine that node on path is the need of recovery, on path, particularly there is the situation of the node of duplication of name, just whether must meet path structure to the node on path with regard to it to investigate, this process is needs in the process recovering each node, so can have a strong impact on the efficiency of date restoring.And do not remove to recover exactly intermediate node, certainly will redundant node be introduced again, add the data scale of algorithm process.
Summary of the invention
For solving above-mentioned middle Problems existing and defect, the invention provides a kind of data reconstruction method of structure based summary, the method is that the content stored in XML data storehouse is when only having leaf node information, when inquiring about based on path indexing, the method of the intermediate node data required for reconstruct, described technical scheme is as follows:
A data reconstruction method for structure based summary, comprising:
Analysis of X ML document tree data, calculate the data message that XML document leaf nodes is corresponding, and are carried out storing and index by described leaf node data message;
XPath query statement is resolved to Twig scheme-tree, the simple path that resolution model tree splits, and mark interested mode node collection;
Record interested mode node sequence sets, the merging of row mode sequence node of going forward side by side;
According to the interested mode node sequence sets of record, recover the data section point set matched with interested mode node sequence sets.
The beneficial effect of technical scheme provided by the invention is:
Utilize XML data structural summary information and the index based on path, in index coupling with the process chosen, the data required for recovery.From date restoring scope and date restoring these two aspects on opportunity, existing method is improved, required data can be recovered correctly, efficiently and accurately, avoid existing method inaccurate, produce the shortcomings such as redundancy.
Accompanying drawing explanation
Fig. 1 is the data reconstruction method process flow diagram of structure based summary;
Fig. 2 is XML data exemplary plot;
Fig. 3 is Tree pattern queries exemplary plot;
Fig. 4 is structural outline exemplary plot;
Fig. 5 is index matching process exemplary plot;
Fig. 6 is date restoring exemplary plot.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail:
See Fig. 1, provide the data reconstruction method flow process of structure based summary, comprise the following steps:
Step 10 Analysis of X ML document tree data, calculate the data message that XML document leaf nodes is corresponding, and are carried out storing and index by described leaf node data message;
After XML document tree data analysis, leaf node is encoded with prefix coding form, calculate structural outline information corresponding to leaf node, by the data message of leaf node, the respective nodes ID value information in prefix code information and structural summary carries out storing and based on the index in path.
XPath query statement is resolved to Twig scheme-tree by step 20, the simple path that resolution model tree splits, and marks interested mode node collection;
Above-mentioned every bar simple path correspond to a mode node sequence.Usually, a Twig scheme-tree is the one tree comprising several branch, can extract the simple path to each leaf from root from this tree.Based in the querying method of path indexing, scheme-tree only has some mode node be necessary to pattern match, these namely date restoring time interested mode node.The strategy marking interested node adds the dirigibility of process, it also avoid unnecessary date restoring.
Interested mode node comprises four classes: in general, and for Twig scheme-tree, branch node and leaf node are needs in pattern match, and so-called branch node refers to the mode node of multiple child node.In addition, also have two category nodes to need mark, a class is the node needing to carry out position counting in postorder Twig algorithm, and another kind of is after pattern match, need one or more mode node and the subtree thereof of the Query Result correspondence in scheme-tree returning to user, this category node is referred to as return node.
Tree construction as Fig. 2 presents the structure of an XML data example, this data instance is the set of a letter breath, comprise three books, the information of book comprises title and author, author information comprises name and Email Information, author's name packets of information draws together firstname and lastname, for simplicity, omits irrelevant data content in figure.
Assuming that for the data shown in Fig. 2, we need the name of writing books of the author inquiring about surname Gao, then inquiry can be expressed as: //book [author/name/lastname=" height "]/name, this is a simple XPath query statement, can be transformed to the form of Tree pattern queries as shown in Figure 3.First this path is split as two simple paths by the inquiry based on path indexing: 1) //book/name; 2) //book/author/name/lastname=" height ", and mark interested mode node.
Step 30 records interested mode node sequence sets, the merging of row mode sequence node of going forward side by side;
This step utilizes navigation-type route matching mode, and structural outline is mated, and finds destination node, and in the process, records the interested sequence node set matched.Such as, as shown in Figure 3, structural outline adopts the form of DataGuide, and gives an ID value (numbering in figure bracket) for each node.For two paths that example in step 1 is decomposed, the index based on path can be utilized to obtain name and lastname data stream.Select the process of index, namely to the process of the navigation coupling of simple path in the structural outline shown in Fig. 3.For path 2), as shown in Figure 4, this path is in the process of coupling, first pass through //book, obtain the set { { 2}:s}, then according to each node in this set of node of set of node, obtain the set { { 4}:2} of a set of node, { { 5}:4}, then try to achieve { { 7}:5}, and node 7 has been required lastname node according to the set of this set of node is tried to achieve again according to the set of this set of node.In the node set of more than mating, colon content representation below this set carries out evaluation with reference to which node, and such as, node 7 is actually and carries out evaluation with reference to node 5, and first set of node, with reference to reference position evaluation, is labeled as s.It should be noted that the set of routine interior joint set only has an element, and node set wherein also only has an element.But in some cases, this set may comprise multiple element, thus Fig. 5 is made to present a tree-like structure.Data shown in Fig. 5 are also the critical datas needing record.In addition, if only interested in branch node, so only have that { { 2}:s} can go on record.When recording the set of above-mentioned set of node, it is noted that identical destination node, different matching processs may be experienced by, need the mode node sequence in this matching process to carry out to merge (these nodes merged must also on identical path).In addition, mode node different in structural outline, also can on identical path, and such sequence node also needs to merge.
Step 40, according to the interested mode node sequence sets of record, recovers the data section point set matched with interested mode node sequence sets.
This determines the process of index path, exactly reflects the matching process of each node exactly, can be used for judging which node certain node is through and arrives root node.This process confirmed is unique for each node, thus can recover the data of needs arbitrarily exactly.
If interested mode node is the node in all pattern dictionary, then full recovery.If interested mode node is branch node, only recover the back end of its correspondence, Twig pattern as shown in Figure 3, corresponding to the summary info of Fig. 4, only needs the node recovering " ID=2 " to get final product (this relation is that above-mentioned matching process is recorded).Fig. 6 illustrates the example of a full recovery path data: assuming that leaf node have employed prefix code, coding for the lastname node of the leftmost side shown in Fig. 2 is: 1.1.2.1.2, can know that the ID of this node in structural outline is 7 simultaneously, by the character of prefix code, we can obtain this node and be followed successively by 1.1.2.1.2 to the coding of each back end on the path of root node, 1.1.2.1, 1.1.2, 1.1, 1, matching process simultaneously according to Fig. 5, easily can understand this node is how to mate, if interested node is all nodes in pattern dictionary, then can know that each node ID (ID in the structural outline) sequence in this process is followed successively by 2, 4, 5, 7, thus can according to the information of the id information of node in structural outline and prefix code, complete the date restoring of intermediate node in whole piece path.It should be noted that this example gives the explanation of full recovery, be intended to feasibility and correctness that recovery is described, but recovery policy of the present invention is not limited thereto, and also only can recover the back end matched with interested four quasi-mode nodes.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (4)

1. a data reconstruction method for structure based summary, is characterized in that, described method comprises:
Analysis of X ML document tree data, calculate the data message that XML document leaf nodes is corresponding, and are carried out storing and index by described leaf node data message;
XPath query statement is resolved to Twig scheme-tree, the simple path that resolution model tree splits, and mark interested mode node collection;
Record interested mode node sequence sets, the merging of row mode sequence node of going forward side by side; Utilize navigation-type route matching mode, structural outline is mated, find destination node, and in the process, record the interested sequence node set matched;
According to the interested mode node sequence sets of record, recover the data section point set matched with interested mode node sequence sets; This determines the process of index path, reflect the matching process of each node exactly, can be used for judging which node certain node is through and arrives root node, this process confirmed, be unique for each node, thus the data that need arbitrarily can be recovered exactly; If interested mode node is the node in all pattern dictionary, then full recovery; If interested mode node is branch node, only recover the back end of its correspondence;
Described every bar simple path is a corresponding mode node sequence respectively; Interested mode node comprises the node and the return node that carry out position counting in the branch node of Twig scheme-tree, leaf node, Twig algorithm.
2. the data reconstruction method of structure based summary according to claim 1, is characterized in that, described leaf node data message comprises the respective nodes ID value information in prefix code information and structural summary.
3. the data reconstruction method of structure based summary according to claim 1, is characterized in that, described every bar simple path is a corresponding mode node sequence respectively.
4. the data reconstruction method of structure based summary according to claim 1, is characterized in that, described interested mode node comprises the node and the return node that carry out position counting in the branch node of Twig scheme-tree, leaf node, Twig algorithm.
CN201310100931.4A 2013-03-27 2013-03-27 A kind of data reconstruction method of structure based summary Active CN103136378B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310100931.4A CN103136378B (en) 2013-03-27 2013-03-27 A kind of data reconstruction method of structure based summary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310100931.4A CN103136378B (en) 2013-03-27 2013-03-27 A kind of data reconstruction method of structure based summary

Publications (2)

Publication Number Publication Date
CN103136378A CN103136378A (en) 2013-06-05
CN103136378B true CN103136378B (en) 2016-04-20

Family

ID=48496203

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310100931.4A Active CN103136378B (en) 2013-03-27 2013-03-27 A kind of data reconstruction method of structure based summary

Country Status (1)

Country Link
CN (1) CN103136378B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016071B (en) * 2017-03-23 2019-06-18 中国科学院计算技术研究所 A kind of method and system using simple path characteristic optimization tree data

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339252A (en) * 2011-07-25 2012-02-01 大连理工大学 Static state detecting system based on XML (Extensive Makeup Language) middle model and defect mode matching

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339252A (en) * 2011-07-25 2012-02-01 大连理工大学 Static state detecting system based on XML (Extensive Makeup Language) middle model and defect mode matching

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XML数据流系统的小枝模式过滤算法;李永锋;《计算机工程》;20100531;全文 *
标签劣质的XML 数据上的查询处理;姜国华等;《Journal of Frontiers of Computer Science and Technology》;20111231;全文 *

Also Published As

Publication number Publication date
CN103136378A (en) 2013-06-05

Similar Documents

Publication Publication Date Title
US8321478B2 (en) System and method of translating a relational database into an XML document and vice versa
CN102033954B (en) Full text retrieval inquiry index method for extensible markup language document in relational database
US8255394B2 (en) Apparatus, system, and method for efficient content indexing of streaming XML document content
CN103177120B (en) A kind of XPath query pattern tree matching method based on index
JPH11242676A (en) Method for registering structured document, method for retrieving structured document, and portable medium used in these methods
CN102693310A (en) Resource description framework querying method and system based on relational database
CN100587673C (en) Arrangement generation method and arrangement generation program
CN101180623A (en) Layout generation method, information processing device, and program
CN102033885A (en) Method and system for XPath execution in XML (extensible markup language) data storage bank
CN107256217B (en) Quick query method of XML data
Wojnar et al. Structural and semantic aspects of similarity of document type definitions and XML schemas
CN106649769A (en) Method for converting XBRL data into OWL data based on semantics
Chuang et al. Context-aware wrapping: Synchronized data extraction
US9037553B2 (en) System and method for efficient maintenance of indexes for XML files
CN103136378B (en) A kind of data reconstruction method of structure based summary
Dohrn et al. Design and implementation of wiki content transformations and refactorings
US20090307187A1 (en) Tree automata based methods for obtaining answers to queries of semi-structured data stored in a database environment
CN104217025A (en) System and method for extracting record items of multi-record web page
Consens et al. Exploring XML web collections with DescribeX
Janga et al. Schema extraction and integration of heterogeneous XML document collections
Rosado et al. Representing versions in XML documents using versionstamp
CN100338609C (en) Maintenance method for package device
Zhang et al. Schemas extraction for XML documents by XML element sequence patterns
Zhang et al. Schema-less, semantics-based change detection for XML documents
Gilleron et al. Interactive tuples extraction from semi-structured data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant