CN101777045A - Method for analyzing XML file by indexing - Google Patents

Method for analyzing XML file by indexing Download PDF

Info

Publication number
CN101777045A
CN101777045A CN200810150767A CN200810150767A CN101777045A CN 101777045 A CN101777045 A CN 101777045A CN 200810150767 A CN200810150767 A CN 200810150767A CN 200810150767 A CN200810150767 A CN 200810150767A CN 101777045 A CN101777045 A CN 101777045A
Authority
CN
China
Prior art keywords
subtree
concordance list
xml document
ixp
xml
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200810150767A
Other languages
Chinese (zh)
Inventor
杨刚
周兴社
张海辉
詹涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN200810150767A priority Critical patent/CN101777045A/en
Publication of CN101777045A publication Critical patent/CN101777045A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention relates to a method for analyzing an XML file by indexing. The technical characteristics are as follows: the method comprises the steps: traversing to illustrate a DTD file with an XML file structure and extracting a subtree tag name under a root node in the DTD file; then creating a Hash table, traversing the XML file needing to be analyzed according to the extracted subtree tag name, inquiring and recording relative positions of starting of all subtree tag names in the XML file, constructing a new item according to data items, and adding the new item into the Hash table to form a subtree index table; and creating a key element index table, and then utilizing an unvalidated IXP analytic model or validated IXP analytic model to carry out analysis. The method has the benefits that: for the large XML file, the analytic speed of an IXP method is far faster than that of a DOM method and an SAX method. By providing a general interface, the mode can be widely applied in analysis of various XML files, and provides a new method for analyzing an XML text.

Description

A kind of method by the index analyzing XML file
Technical field
The present invention relates to a kind of method, belong to the XML field of information processing by the index analyzing XML file.
Background technology
XML (eXtensible Markup Language) language by the proposition of W3C tissue, because its dirigibility and self descriptiveness, become Web2.0 uses and even various information handling system generally adopts information organization and descriptor format, obtained application more and more widely.The XML resolver that obtains comparatively widespread use at present has: JavaTM, the JDOM etc. of XML Parser, the Sun of the XML4J of IBM, the MSXML of Microsoft, Oracle.The analytic method that these resolvers adopted can be divided into and is DOM and SAX two big classes.
DOM is a kind of analytic method based on tree construction that W3C proposes.DOM regards the element in the document, attribute, note, processing instruction some nodes of tree construction as when the analyzing XML file, and the content of XML document is organized into a tree type message structure.Because the DOM analytic method need be constructed the tree structure data structure corresponding with quasi-analytic XML file,, therefore be not suitable for large-scale XML document so the internal memory that it takies is directly proportional with document size (generally being 2 to 5 times).In order to improve the analysis feature of DOM method, the researchist has carried out certain improvement to it.Main achievement has the DOM analytic method (DDOM) of delay and compression DOM analytic method (SEDOM).DDOM is embodied in the improvement of DOM method: DDOM does not need to construct complete parsing and belongs to, and just constructs the part-structure of dom tree when access document as required.DDOM mainly is applicable to the occasion of XML being carried out sparse visit, if when the document major part all needs to visit, the performance of DDOM is slower than common DOM mode.SEDOM obviously reduces the consumption of common DOM analytic method to storage space then by having introduced compress technique, but owing to need carry out squeeze operation, inevitably analysis feature is impacted.
SAX is the XML document analytic method of being used widely by the another kind that the member in the XML_DEV mail tabulation proposes.The core of SAX method is by the linear sweep XML document, and the label that retrieval user is concerned about also triggers events corresponding, finishes visit and the parsing of user to XML document in event handler procedure.The SAX method can be resolved the file of any size, realizes that simply resource consumption is less.When not needing to change the content of document and under the situation of sequential access, analyzing efficiency is than higher.
Matthias points out that the XML analytic method has direct and significant influence to the performance of this type systematic with in the infosystem of XML as general data description and organizational form.Though the DOM method can be set up the complete structure of XML document and have the random access ability, its computational resource consumption is bigger, is not suitable for the fast resolving to large-scale XML document; Though SAX method resource consumption is less, can resolve large-scale XML document with higher efficient, it goes not possess random access ability and online modification ability to XML document.
Summary of the invention
The technical matters that solves
For fear of the deficiencies in the prior art part, the present invention proposes a kind of method by the index analyzing XML file, can improve resolution speed and recall precision to extensive XML document, reduces resource consumption, and the random access ability to XML document is provided.
Technical scheme
A kind of method by the index analyzing XML file is introduced index mechanism in resolving, and by the speed of random access of index acceleration to each Element in the XML document.Its technical characterictic is that method step is divided into initial phase and resolution phase:
Initial phase:
The DTD document of step 1, traversal explanation XML document structure extracts the subtree tag name under the root node in the DTD document;
Step 2, at first create an empty Hash table, according to the subtree tag name that extracts, traversal needs the XML document of parsing, searches and note the initial relative position of each subtree tag name in XML document, joins in the Hash table according to the new project of structure item structure and forms the subtree concordance list; Described structure item is: the structure item of read-only XML document is: call number, initial relative position; The structure item of non-read-only XML document is: call number, and whether initial relative position more corrects one's mistakes;
Step 3, establishment key element concordance list: the Hash table of at first creating a sky, press the bookmark name traversal XML document of key element, extract the value of the bookmark name correspondence of all these key elements in the document, and the query sub tree concordance list obtains this key element place subtree call number, press the structure item: subtree call number, the value of the bookmark name correspondence of this key element are inserted the concordance list that forms this key element in the Hash table;
Resolution phase comprises two kinds of patterns: the IXP interpretive model of non-checking and the IXP interpretive model of checking;
Read-only XML document adopts the IXP pattern of non-checking as follows:
Key word name in step 4, the extraction querying condition, the concordance list of the key element that inquiry and key word are of the same name according to every value in the querying condition coupling key element concordance list, finds occurrence, extracts the subtree call number in the occurrence; If do not find the concordance list with key word key element of the same name, then turn back to step 3;
Step 5, according to subtree call number query sub tree concordance list, after in the subtree concordance list, finding occurrence, extract the initial relative position content in this, navigate to subtree reference position in the XML document according to initial relative position, whole subtree load content is arrived internal memory, traversal finds the element value that meets querying condition to return poll-final in the subtree scope;
Non-read-only XML document adopts the IXP pattern of checking as follows:
Key word name in step 4, the extraction querying condition, inquiry and key word key element concordance list of the same name according to every value in the querying condition coupling key element concordance list, are found occurrence, extract the subtree call number in the occurrence;
Step 5, according to subtree call number query sub tree concordance list, after finding occurrence, extract " initial relative position " content in the occurrence, navigate to corresponding subtree reference position in the XML document according to " initial relative position ", begin whole subtree load content to internal memory from this reference position, travel through in this subtree scope that in internal memory, loads then, find the value of the element that meets querying condition to return poll-final; If also need to change this element value after inquiring element, then continue following step;
The value of the element that finds in step 6, the renewal above-mentioned steps 5, whether with " more correcting one's mistakes " in the project corresponding to this subtree in the subtree concordance list sign changes " very " into; When closing XML document, earlier all values of statistical indicant of " whether more correcting one's mistakes " are write disk file successively for the content of subtree in internal memory of " very " in subtree index list item, carry out close file then.
Beneficial effect
The method of passing through the index analyzing XML file that the present invention proposes has following characteristics:
1. to before the XML document operation, there is one to set up the index process, by specifying subtree node and key node, creates key element concordance list and subtree concordance list during this.Then, IXP allows application program to carry out query manipulation based on these tables.
2. in to XML document resolving, adopt the mode of concordance list to write down nodal information, navigate to subtree position in the XML file fast by the subtree concordance list when searching, accelerated retrieval rate.
3. read in the content of appointed area according to the subtree concordance list,, obtain target information by the traversal subtree because only the partial content in the XML document is loaded into internal memory.This process has been avoided whole XML files are arrived the loading internal memory, thereby has saved memory headroom.
4.IXP-nv the various occasions that the competent SAX of method uses, but more efficient than SAX because it by initialization forming element subtree after, can navigate to element-specific place subtree fast when searching, and travel through among a small circle, and SAX need begin the traversal document from file usually.IXP-nv takies internal memory seldom, and does not become big along with file and increase.
5. resolving comprises two parts: search concordance list and carry out the element coupling in the subtree of element-specific.The IXP method adopts hash function to optimize concordance list in implementation procedure, and it almost is constant searching the concordance list time, and the time of searching element subtree is only relevant with the size of subtree.Therefore the IXP method also can have good performance under large-scale XML document.
Because the resolver performance depends on file characteristics usually, as label and data rate, attribute usage degree, element subtree number and average element subtree size etc.The present invention is based on the IXP analytic method and realized the IXP resolver, in an embodiment by with the comparison of the performance of the performance of IXP and MSXML, effect of the present invention as can be seen at initialization and mean access time with C Plus Plus.
Description of drawings
Fig. 1: set up index and query script diagram among the IXP
The initialized performance of Fig. 2: DOM, SAX and IXP method relatively
Fig. 3: DOM, SAX and IXP method element access time are relatively
Fig. 4: formed concordance list structure after the initialization of IXP method
Embodiment
Now in conjunction with the embodiments, accompanying drawing is further described the present invention:
With the BookSet.xml document is example, and the implementation process of IXP analytic method is illustrated.The DTD form of BookSet.xml document meets the definition of following table:
Figure G2008101507677D00051
IXP resolves traversal XML data, and is that root splits into many element subtrees with the designed element with entire document.In this DTD explanation document, can find that its subtree element is " Book ", the position forming element subtree concordance list of all start-tags of mark (<Book 〉) and end-tag (</Book 〉).If can navigate to the subtree that comprises object element fast, then can accelerate retrieval rate greatly.Adopt needed any label to create index, recommend to adopt element with unique value.In this example, adopt ISBN, and the text between record "<ISBN〉" and "</ISBN〉" is as index value as crucial label.After the initialization procedure, IXP has created element subtree concordance list and crucial tab indexes table.IXP has created as mistake after the initialization procedure! Do not find Reference source.Subtree concordance list of element shown in 4 and crucial tab indexes table.
If want to inquire about the author's name of books ISBN number for " 7-302-04517 ", only need be written into and searching position from 67 to 140, have only the document of 73 characters altogether.
Test case:
With BookSet.xml is example, creates 7 XML document, and size is respectively 13KB, 119KB, and 238KB, 471KB, 934KB, 1871KB, 4652KB, they have comprised the element of varying number.Quantize the time overhead relatively adopt MSXML4.0 and IXP to carry out initialization and parsing then.
The initialization performance compares:
XML document adopts DOM, SAX and IXP initialization respectively 10 times, average initialization time such as Fig. 2.
In initialization procedure, DOM need travel through with analyzing XML file and form document tree in the internal memory, and initialization time is directly proportional with the size of XML document.SAX has just created the handle that opens file, but inreal reading of data, so its expense is minimum.IXP need resolve and set up concordance list, needs the regular hour expense.Because it does not need to set up complete tree structure, thus the overhead time much smaller than than DOM, this advantage is along with the expansion of document scale is also obvious further.
The XML document analysis feature compares:
7 XML document are not being carried out under the situation of grammer buffer memory, stochastic searching 1000 times, and mean access time has reflected the analysis feature of each analytic method substantially, the result is as shown in Figure 3.
From mistake! Do not find Reference source.3 can find, the DOM resolver has good performance when little document is resolved, but at the expense of big document head and shoulders above expectation.SAX resolver resolves performance is better than DOM mode, still inefficiency.Discover that further DOM and SAX under the used DTD of example describes, when XML document is 720KB, have approximate performance.In the IXP initialization procedure, created element subtree table and crucial tab indexes table, resolving comprises two parts: search concordance list and carry out the element coupling in the element-specific subtree.IXP adopts hash function to optimize concordance list, and it almost is constant searching the concordance list time, and the time of searching element subtree is only relevant with the size of subtree.Therefore the IXP resolver also can have good performance under large-scale XML document.
In worst case, IXP need reinitialize concordance list with new index tab.DOM and the SAX time of resolving is compared with IXP initialization and parsing time sum, and the IXP performance still is better than DOM and SAX mode greatly, as table 1:
Table 1:DOM, SAX method are resolved time and IXP initialization time and are resolved the comparison of temporal summation
Document size DOM SAX IXP Document size DOM SAX IXP
13 8.9 11.1 7.0 119 11.1 44.4 12.8
238 17.8 66.6 22.1 471 55.5 122.1 39
934 1000.1 233.1 72.6 1871 5758.7 577.2 144.7
4652 35381.3 1144.4 426.5
1, document size is a unit with " kilobyte (K) ", and the time is unit with " millisecond (ms) ".
2, the value of these two row of DOM and SAX is respectively DOM method and the SAX method parsing time that element spent.
3, the value of these row is the summation of the IBP method initialization time and the time of parsing.
The contrast experiment shows that for large-scale XML document, the resolution speed of IXP method is far away faster than DOM method and SAX method.By general-purpose interface is provided, this mode can be widely used in the parsing of various XML document, for the XML text analyzing provides a kind of new method.

Claims (1)

1. method by the index analyzing XML file is characterized in that step is divided into initial phase and resolution phase:
Initial phase:
The DTD document of step traversal explanation XML document structure extracts the subtree tag name under the root node in the DTD document;
Step 2, at first create an empty Hash table, according to the subtree tag name that extracts, traversal needs the XML document of parsing, searches and note the initial relative position of each subtree tag name in XML document, joins in the Hash table according to the new project of structure item structure and forms the subtree concordance list; Described structure item is: the structure item of read-only XML document is: call number, initial relative position; The structure item of non-read-only XML document is: call number, and whether initial relative position more corrects one's mistakes;
Step 3, establishment key element concordance list: the Hash table of at first creating a sky, press the bookmark name traversal XML document of key element, extract the value of the bookmark name correspondence of all these key elements in the document, and the query sub tree concordance list obtains this key element place subtree call number, press the structure item: subtree call number, the value of the bookmark name correspondence of this key element are inserted the concordance list that forms this key element in the Hash table;
Resolution phase comprises two kinds of patterns: the IXP interpretive model of non-checking and the IXP interpretive model of checking;
Read-only XML document adopts the IXP pattern of non-checking as follows:
Key word name in step 4, the extraction querying condition, the concordance list of the key element that inquiry and key word are of the same name according to every value in the querying condition coupling key element concordance list, finds occurrence, extracts the subtree call number in the occurrence; If do not find the concordance list with key word key element of the same name, then turn back to step 3;
Step 5, according to subtree call number query sub tree concordance list, after in the subtree concordance list, finding occurrence, extract the initial relative position content in this, navigate to subtree reference position in the XML document according to initial relative position, whole subtree load content is arrived internal memory, traversal finds the element value that meets querying condition to return poll-final in the subtree scope;
Non-read-only XML document adopts the IXP pattern of checking as follows:
Step 4 ', extract the key word name in the querying condition, inquiry and key word key element concordance list of the same name according to every value in the querying condition coupling key element concordance list, are found occurrence, extract the subtree call number in the occurrence;
Step 5 ', according to subtree call number query sub tree concordance list, after finding occurrence, extract " initial relative position " content in the occurrence, navigate to corresponding subtree reference position in the XML document according to " initial relative position ", begin whole subtree load content to internal memory from this reference position, travel through in this subtree scope that in internal memory, loads then, find the value of the element that meets querying condition to return poll-final; If also need to change this element value after inquiring element, then continue following step;
The value of the element that finds in the step 6 ', upgrade above-mentioned steps 5 ', whether with " more correcting one's mistakes " in the project corresponding to this subtree in the subtree concordance list sign changes " very " into; When closing XML document, earlier all values of statistical indicant of " whether more correcting one's mistakes " are write disk file successively for the content of subtree in internal memory of " very " in subtree index list item, carry out close file then.
CN200810150767A 2008-09-01 2008-09-01 Method for analyzing XML file by indexing Pending CN101777045A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810150767A CN101777045A (en) 2008-09-01 2008-09-01 Method for analyzing XML file by indexing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810150767A CN101777045A (en) 2008-09-01 2008-09-01 Method for analyzing XML file by indexing

Publications (1)

Publication Number Publication Date
CN101777045A true CN101777045A (en) 2010-07-14

Family

ID=42513511

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810150767A Pending CN101777045A (en) 2008-09-01 2008-09-01 Method for analyzing XML file by indexing

Country Status (1)

Country Link
CN (1) CN101777045A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996252A (en) * 2010-11-17 2011-03-30 浙江省电力试验研究院 Expression method of indexing information for node element in XML (Extensive Makeup Language) file
CN102222083A (en) * 2011-05-06 2011-10-19 中国科学院研究生院 Creation-object-based extensible business reporting language (XBRL) taxonomy rapid-resolution method
CN102385604A (en) * 2010-09-06 2012-03-21 上海可鲁系统软件有限公司 Rapid analyzing method and system for SVG (Scalable Vector Graphics) file
CN102521602A (en) * 2011-11-17 2012-06-27 西安电子科技大学 Hyper-spectral image classification method based on conditional random field and minimum distance method
CN103678510A (en) * 2013-11-25 2014-03-26 北京奇虎科技有限公司 Method and device for providing visualized label for webpage
CN103914437A (en) * 2012-12-29 2014-07-09 上海可鲁系统软件有限公司 XML (X Exrensible Markup Language) text positioning method based on DOM (Document Object Model) model
CN105224531A (en) * 2014-05-28 2016-01-06 腾讯科技(深圳)有限公司 The method and apparatus of localization of XML node
CN106293862A (en) * 2015-06-25 2017-01-04 中国移动通信集团山东有限公司 A kind of analysis method and device of expandable mark language XML data
CN106469137A (en) * 2015-08-19 2017-03-01 互联网域名系统北京市工程研究中心有限公司 XML document analysis method and device
CN107220283A (en) * 2017-04-21 2017-09-29 东软集团股份有限公司 Data processing method, device, storage medium and electronic equipment
CN110019970A (en) * 2018-06-15 2019-07-16 中国平安人寿保险股份有限公司 Inventory downloads template creation method, device, terminal and readable storage medium storing program for executing
CN112417085A (en) * 2020-11-27 2021-02-26 平安普惠企业管理有限公司 Message comparison method and device, computer equipment and storage medium
CN115529271A (en) * 2022-10-17 2022-12-27 中国农业银行股份有限公司 Service request distribution method, device, equipment and medium

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385604A (en) * 2010-09-06 2012-03-21 上海可鲁系统软件有限公司 Rapid analyzing method and system for SVG (Scalable Vector Graphics) file
CN102385604B (en) * 2010-09-06 2013-08-14 上海可鲁系统软件有限公司 Rapid analyzing method and system for SVG (Scalable Vector Graphics) file
CN101996252A (en) * 2010-11-17 2011-03-30 浙江省电力试验研究院 Expression method of indexing information for node element in XML (Extensive Makeup Language) file
CN101996252B (en) * 2010-11-17 2013-01-16 浙江省电力公司电力科学研究院 Processing method of node element in XML (Extensive Makeup Language) file resolution
CN102222083A (en) * 2011-05-06 2011-10-19 中国科学院研究生院 Creation-object-based extensible business reporting language (XBRL) taxonomy rapid-resolution method
CN102521602A (en) * 2011-11-17 2012-06-27 西安电子科技大学 Hyper-spectral image classification method based on conditional random field and minimum distance method
CN103914437A (en) * 2012-12-29 2014-07-09 上海可鲁系统软件有限公司 XML (X Exrensible Markup Language) text positioning method based on DOM (Document Object Model) model
CN103678510A (en) * 2013-11-25 2014-03-26 北京奇虎科技有限公司 Method and device for providing visualized label for webpage
CN105224531A (en) * 2014-05-28 2016-01-06 腾讯科技(深圳)有限公司 The method and apparatus of localization of XML node
CN106293862A (en) * 2015-06-25 2017-01-04 中国移动通信集团山东有限公司 A kind of analysis method and device of expandable mark language XML data
CN106293862B (en) * 2015-06-25 2019-05-24 中国移动通信集团山东有限公司 A kind of analysis method and device of expandable mark language XML data
CN106469137A (en) * 2015-08-19 2017-03-01 互联网域名系统北京市工程研究中心有限公司 XML document analysis method and device
CN107220283A (en) * 2017-04-21 2017-09-29 东软集团股份有限公司 Data processing method, device, storage medium and electronic equipment
CN107220283B (en) * 2017-04-21 2019-11-08 东软集团股份有限公司 Data processing method, device, storage medium and electronic equipment
CN110019970A (en) * 2018-06-15 2019-07-16 中国平安人寿保险股份有限公司 Inventory downloads template creation method, device, terminal and readable storage medium storing program for executing
CN112417085A (en) * 2020-11-27 2021-02-26 平安普惠企业管理有限公司 Message comparison method and device, computer equipment and storage medium
CN115529271A (en) * 2022-10-17 2022-12-27 中国农业银行股份有限公司 Service request distribution method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN101777045A (en) Method for analyzing XML file by indexing
US7720789B2 (en) System and method of member unique names
AU2002334706B2 (en) Mechanism for mapping XML schemas to object-relational database systems
Balmin et al. Incremental validation of XML documents
CN102804168B (en) The data compression of storage demand is reduced in Database Systems
US7844642B2 (en) Method and structure for storing data of an XML-document in a relational database
KR100396462B1 (en) Message transformation selection tool and method
CN100541493C (en) The apparatus and method that are used for structured document management
US20050262440A1 (en) Localization of XML via transformations
CN1584884B (en) Apparatus for searching data of structured document
CN102222083A (en) Creation-object-based extensible business reporting language (XBRL) taxonomy rapid-resolution method
CN102456053A (en) Method for mapping XML document to database
CN103246857A (en) Method for resolving heterogeneous code to acquire object information by using formalized decoding rule
US20030121005A1 (en) Archiving and retrieving data objects
US8131728B2 (en) Processing large sized relationship-specifying markup language documents
US20070282804A1 (en) Apparatus and method for extracting database information from a report
CN110019306A (en) A kind of SQL statement lookup method and system based on XML format file
CN114003231B (en) SQL syntax parse tree optimization method and system
Hsu et al. UCIS-X: an updatable compact indexing scheme for efficient extensible markup language document updating and query evaluation
US20020099792A1 (en) Method of performing a search of a numerical document object model
Thao et al. Using versioned trees, change detection and node identity for three-way XML merging
WO2008085359A1 (en) Accelerating queries using delayed value projection of enumerated storage
CN113988003A (en) Method for custom directional analysis of multiple sheet contents of Excel file according to specified configuration
Li et al. Extraction and integration information in HTML tables
CN114756554B (en) Data query processing method based on MyBatis framework

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20100714