CN101777045A

CN101777045A - Method for analyzing XML file by indexing

Info

Publication number: CN101777045A
Application number: CN200810150767A
Authority: CN
Inventors: 杨刚; 周兴社; 张海辉; 詹涛
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2008-09-01
Filing date: 2008-09-01
Publication date: 2010-07-14

Abstract

The invention relates to a method for analyzing an XML file by indexing. The technical characteristics are as follows: the method comprises the steps: traversing to illustrate a DTD file with an XML file structure and extracting a subtree tag name under a root node in the DTD file; then creating a Hash table, traversing the XML file needing to be analyzed according to the extracted subtree tag name, inquiring and recording relative positions of starting of all subtree tag names in the XML file, constructing a new item according to data items, and adding the new item into the Hash table to form a subtree index table; and creating a key element index table, and then utilizing an unvalidated IXP analytic model or validated IXP analytic model to carry out analysis. The method has the benefits that: for the large XML file, the analytic speed of an IXP method is far faster than that of a DOM method and an SAX method. By providing a general interface, the mode can be widely applied in analysis of various XML files, and provides a new method for analyzing an XML text.

Description

A kind of method by the index analyzing XML file

Technical field

The present invention relates to a kind of method, belong to the XML field of information processing by the index analyzing XML file.

Background technology

XML (eXtensible Markup Language) language by the proposition of W3C tissue, because its dirigibility and self descriptiveness, become Web2.0 uses and even various information handling system generally adopts information organization and descriptor format, obtained application more and more widely.The XML resolver that obtains comparatively widespread use at present has: JavaTM, the JDOM etc. of XML Parser, the Sun of the XML4J of IBM, the MSXML of Microsoft, Oracle.The analytic method that these resolvers adopted can be divided into and is DOM and SAX two big classes.

DOM is a kind of analytic method based on tree construction that W3C proposes.DOM regards the element in the document, attribute, note, processing instruction some nodes of tree construction as when the analyzing XML file, and the content of XML document is organized into a tree type message structure.Because the DOM analytic method need be constructed the tree structure data structure corresponding with quasi-analytic XML file,, therefore be not suitable for large-scale XML document so the internal memory that it takies is directly proportional with document size (generally being 2 to 5 times).In order to improve the analysis feature of DOM method, the researchist has carried out certain improvement to it.Main achievement has the DOM analytic method (DDOM) of delay and compression DOM analytic method (SEDOM).DDOM is embodied in the improvement of DOM method: DDOM does not need to construct complete parsing and belongs to, and just constructs the part-structure of dom tree when access document as required.DDOM mainly is applicable to the occasion of XML being carried out sparse visit, if when the document major part all needs to visit, the performance of DDOM is slower than common DOM mode.SEDOM obviously reduces the consumption of common DOM analytic method to storage space then by having introduced compress technique, but owing to need carry out squeeze operation, inevitably analysis feature is impacted.

SAX is the XML document analytic method of being used widely by the another kind that the member in the XML_DEV mail tabulation proposes.The core of SAX method is by the linear sweep XML document, and the label that retrieval user is concerned about also triggers events corresponding, finishes visit and the parsing of user to XML document in event handler procedure.The SAX method can be resolved the file of any size, realizes that simply resource consumption is less.When not needing to change the content of document and under the situation of sequential access, analyzing efficiency is than higher.

Matthias points out that the XML analytic method has direct and significant influence to the performance of this type systematic with in the infosystem of XML as general data description and organizational form.Though the DOM method can be set up the complete structure of XML document and have the random access ability, its computational resource consumption is bigger, is not suitable for the fast resolving to large-scale XML document; Though SAX method resource consumption is less, can resolve large-scale XML document with higher efficient, it goes not possess random access ability and online modification ability to XML document.

Summary of the invention

The technical matters that solves

For fear of the deficiencies in the prior art part, the present invention proposes a kind of method by the index analyzing XML file, can improve resolution speed and recall precision to extensive XML document, reduces resource consumption, and the random access ability to XML document is provided.

Technical scheme

A kind of method by the index analyzing XML file is introduced index mechanism in resolving, and by the speed of random access of index acceleration to each Element in the XML document.Its technical characterictic is that method step is divided into initial phase and resolution phase:

Initial phase:

The DTD document of step 1, traversal explanation XML document structure extracts the subtree tag name under the root node in the DTD document;

Step 2, at first create an empty Hash table, according to the subtree tag name that extracts, traversal needs the XML document of parsing, searches and note the initial relative position of each subtree tag name in XML document, joins in the Hash table according to the new project of structure item structure and forms the subtree concordance list; Described structure item is: the structure item of read-only XML document is: call number, initial relative position; The structure item of non-read-only XML document is: call number, and whether initial relative position more corrects one's mistakes;

Step 3, establishment key element concordance list: the Hash table of at first creating a sky, press the bookmark name traversal XML document of key element, extract the value of the bookmark name correspondence of all these key elements in the document, and the query sub tree concordance list obtains this key element place subtree call number, press the structure item: subtree call number, the value of the bookmark name correspondence of this key element are inserted the concordance list that forms this key element in the Hash table;

Resolution phase comprises two kinds of patterns: the IXP interpretive model of non-checking and the IXP interpretive model of checking;

Read-only XML document adopts the IXP pattern of non-checking as follows:

Key word name in step 4, the extraction querying condition, the concordance list of the key element that inquiry and key word are of the same name according to every value in the querying condition coupling key element concordance list, finds occurrence, extracts the subtree call number in the occurrence; If do not find the concordance list with key word key element of the same name, then turn back to step 3;

Step 5, according to subtree call number query sub tree concordance list, after in the subtree concordance list, finding occurrence, extract the initial relative position content in this, navigate to subtree reference position in the XML document according to initial relative position, whole subtree load content is arrived internal memory, traversal finds the element value that meets querying condition to return poll-final in the subtree scope;

Non-read-only XML document adopts the IXP pattern of checking as follows:

Key word name in step 4, the extraction querying condition, inquiry and key word key element concordance list of the same name according to every value in the querying condition coupling key element concordance list, are found occurrence, extract the subtree call number in the occurrence;

Step 5, according to subtree call number query sub tree concordance list, after finding occurrence, extract " initial relative position " content in the occurrence, navigate to corresponding subtree reference position in the XML document according to " initial relative position ", begin whole subtree load content to internal memory from this reference position, travel through in this subtree scope that in internal memory, loads then, find the value of the element that meets querying condition to return poll-final; If also need to change this element value after inquiring element, then continue following step;

The value of the element that finds in step 6, the renewal above-mentioned steps 5, whether with " more correcting one's mistakes " in the project corresponding to this subtree in the subtree concordance list sign changes " very " into; When closing XML document, earlier all values of statistical indicant of " whether more correcting one's mistakes " are write disk file successively for the content of subtree in internal memory of " very " in subtree index list item, carry out close file then.

Beneficial effect

The method of passing through the index analyzing XML file that the present invention proposes has following characteristics:

1. to before the XML document operation, there is one to set up the index process, by specifying subtree node and key node, creates key element concordance list and subtree concordance list during this.Then, IXP allows application program to carry out query manipulation based on these tables.

2. in to XML document resolving, adopt the mode of concordance list to write down nodal information, navigate to subtree position in the XML file fast by the subtree concordance list when searching, accelerated retrieval rate.

3. read in the content of appointed area according to the subtree concordance list,, obtain target information by the traversal subtree because only the partial content in the XML document is loaded into internal memory.This process has been avoided whole XML files are arrived the loading internal memory, thereby has saved memory headroom.

4.IXP-nv the various occasions that the competent SAX of method uses, but more efficient than SAX because it by initialization forming element subtree after, can navigate to element-specific place subtree fast when searching, and travel through among a small circle, and SAX need begin the traversal document from file usually.IXP-nv takies internal memory seldom, and does not become big along with file and increase.

5. resolving comprises two parts: search concordance list and carry out the element coupling in the subtree of element-specific.The IXP method adopts hash function to optimize concordance list in implementation procedure, and it almost is constant searching the concordance list time, and the time of searching element subtree is only relevant with the size of subtree.Therefore the IXP method also can have good performance under large-scale XML document.

Because the resolver performance depends on file characteristics usually, as label and data rate, attribute usage degree, element subtree number and average element subtree size etc.The present invention is based on the IXP analytic method and realized the IXP resolver, in an embodiment by with the comparison of the performance of the performance of IXP and MSXML, effect of the present invention as can be seen at initialization and mean access time with C Plus Plus.

Description of drawings

Fig. 1: set up index and query script diagram among the IXP

The initialized performance of Fig. 2: DOM, SAX and IXP method relatively

Fig. 3: DOM, SAX and IXP method element access time are relatively

Fig. 4: formed concordance list structure after the initialization of IXP method

Embodiment

Now in conjunction with the embodiments, accompanying drawing is further described the present invention:

With the BookSet.xml document is example, and the implementation process of IXP analytic method is illustrated.The DTD form of BookSet.xml document meets the definition of following table:

IXP resolves traversal XML data, and is that root splits into many element subtrees with the designed element with entire document.In this DTD explanation document, can find that its subtree element is " Book ", the position forming element subtree concordance list of all start-tags of mark (＜Book 〉) and end-tag (＜/Book 〉).If can navigate to the subtree that comprises object element fast, then can accelerate retrieval rate greatly.Adopt needed any label to create index, recommend to adopt element with unique value.In this example, adopt ISBN, and the text between record "＜ISBN〉" and "＜/ISBN〉" is as index value as crucial label.After the initialization procedure, IXP has created element subtree concordance list and crucial tab indexes table.IXP has created as mistake after the initialization procedure! Do not find Reference source.Subtree concordance list of element shown in 4 and crucial tab indexes table.

If want to inquire about the author's name of books ISBN number for " 7-302-04517 ", only need be written into and searching position from 67 to 140, have only the document of 73 characters altogether.

Test case:

With BookSet.xml is example, creates 7 XML document, and size is respectively 13KB, 119KB, and 238KB, 471KB, 934KB, 1871KB, 4652KB, they have comprised the element of varying number.Quantize the time overhead relatively adopt MSXML4.0 and IXP to carry out initialization and parsing then.

The initialization performance compares:

XML document adopts DOM, SAX and IXP initialization respectively 10 times, average initialization time such as Fig. 2.

In initialization procedure, DOM need travel through with analyzing XML file and form document tree in the internal memory, and initialization time is directly proportional with the size of XML document.SAX has just created the handle that opens file, but inreal reading of data, so its expense is minimum.IXP need resolve and set up concordance list, needs the regular hour expense.Because it does not need to set up complete tree structure, thus the overhead time much smaller than than DOM, this advantage is along with the expansion of document scale is also obvious further.

The XML document analysis feature compares:

7 XML document are not being carried out under the situation of grammer buffer memory, stochastic searching 1000 times, and mean access time has reflected the analysis feature of each analytic method substantially, the result is as shown in Figure 3.

From mistake! Do not find Reference source.3 can find, the DOM resolver has good performance when little document is resolved, but at the expense of big document head and shoulders above expectation.SAX resolver resolves performance is better than DOM mode, still inefficiency.Discover that further DOM and SAX under the used DTD of example describes, when XML document is 720KB, have approximate performance.In the IXP initialization procedure, created element subtree table and crucial tab indexes table, resolving comprises two parts: search concordance list and carry out the element coupling in the element-specific subtree.IXP adopts hash function to optimize concordance list, and it almost is constant searching the concordance list time, and the time of searching element subtree is only relevant with the size of subtree.Therefore the IXP resolver also can have good performance under large-scale XML document.

In worst case, IXP need reinitialize concordance list with new index tab.DOM and the SAX time of resolving is compared with IXP initialization and parsing time sum, and the IXP performance still is better than DOM and SAX mode greatly, as table 1:

Table 1:DOM, SAX method are resolved time and IXP initialization time and are resolved the comparison of temporal summation

Document size	DOM	SAX	IXP	Document size	DOM	SAX	IXP
Document size	DOM	SAX	IXP	Document size	DOM	SAX	IXP	13	8.9	11.1	7.0	119	11.1	44.4	12.8
238	17.8	66.6	22.1	471	55.5	122.1	39	13	8.9	11.1	7.0	119	11.1	44.4	12.8
238	17.8	66.6	22.1	471	55.5	122.1	39	934	1000.1	233.1	72.6	1871	5758.7	577.2	144.7
4652	35381.3	1144.4	426.5					934	1000.1	233.1	72.6	1871	5758.7	577.2	144.7

1, document size is a unit with " kilobyte (K) ", and the time is unit with " millisecond (ms) ".

2, the value of these two row of DOM and SAX is respectively DOM method and the SAX method parsing time that element spent.

3, the value of these row is the summation of the IBP method initialization time and the time of parsing.

The contrast experiment shows that for large-scale XML document, the resolution speed of IXP method is far away faster than DOM method and SAX method.By general-purpose interface is provided, this mode can be widely used in the parsing of various XML document, for the XML text analyzing provides a kind of new method.

Claims

1. method by the index analyzing XML file is characterized in that step is divided into initial phase and resolution phase:

Initial phase:

The DTD document of step traversal explanation XML document structure extracts the subtree tag name under the root node in the DTD document;

Read-only XML document adopts the IXP pattern of non-checking as follows:

Non-read-only XML document adopts the IXP pattern of checking as follows:

Step 4 ', extract the key word name in the querying condition, inquiry and key word key element concordance list of the same name according to every value in the querying condition coupling key element concordance list, are found occurrence, extract the subtree call number in the occurrence;

Step 5 ', according to subtree call number query sub tree concordance list, after finding occurrence, extract " initial relative position " content in the occurrence, navigate to corresponding subtree reference position in the XML document according to " initial relative position ", begin whole subtree load content to internal memory from this reference position, travel through in this subtree scope that in internal memory, loads then, find the value of the element that meets querying condition to return poll-final; If also need to change this element value after inquiring element, then continue following step;

The value of the element that finds in the step 6 ', upgrade above-mentioned steps 5 ', whether with " more correcting one's mistakes " in the project corresponding to this subtree in the subtree concordance list sign changes " very " into; When closing XML document, earlier all values of statistical indicant of " whether more correcting one's mistakes " are write disk file successively for the content of subtree in internal memory of " very " in subtree index list item, carry out close file then.