CN101777045A - Method for analyzing XML file by indexing - Google Patents
Method for analyzing XML file by indexing Download PDFInfo
- Publication number
- CN101777045A CN101777045A CN200810150767A CN200810150767A CN101777045A CN 101777045 A CN101777045 A CN 101777045A CN 200810150767 A CN200810150767 A CN 200810150767A CN 200810150767 A CN200810150767 A CN 200810150767A CN 101777045 A CN101777045 A CN 101777045A
- Authority
- CN
- China
- Prior art keywords
- subtree
- concordance list
- xml document
- ixp
- xml
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The invention relates to a method for analyzing an XML file by indexing. The technical characteristics are as follows: the method comprises the steps: traversing to illustrate a DTD file with an XML file structure and extracting a subtree tag name under a root node in the DTD file; then creating a Hash table, traversing the XML file needing to be analyzed according to the extracted subtree tag name, inquiring and recording relative positions of starting of all subtree tag names in the XML file, constructing a new item according to data items, and adding the new item into the Hash table to form a subtree index table; and creating a key element index table, and then utilizing an unvalidated IXP analytic model or validated IXP analytic model to carry out analysis. The method has the benefits that: for the large XML file, the analytic speed of an IXP method is far faster than that of a DOM method and an SAX method. By providing a general interface, the mode can be widely applied in analysis of various XML files, and provides a new method for analyzing an XML text.
Description
Technical field
The present invention relates to a kind of method, belong to the XML field of information processing by the index analyzing XML file.
Background technology
XML (eXtensible Markup Language) language by the proposition of W3C tissue, because its dirigibility and self descriptiveness, become Web2.0 uses and even various information handling system generally adopts information organization and descriptor format, obtained application more and more widely.The XML resolver that obtains comparatively widespread use at present has: JavaTM, the JDOM etc. of XML Parser, the Sun of the XML4J of IBM, the MSXML of Microsoft, Oracle.The analytic method that these resolvers adopted can be divided into and is DOM and SAX two big classes.
DOM is a kind of analytic method based on tree construction that W3C proposes.DOM regards the element in the document, attribute, note, processing instruction some nodes of tree construction as when the analyzing XML file, and the content of XML document is organized into a tree type message structure.Because the DOM analytic method need be constructed the tree structure data structure corresponding with quasi-analytic XML file,, therefore be not suitable for large-scale XML document so the internal memory that it takies is directly proportional with document size (generally being 2 to 5 times).In order to improve the analysis feature of DOM method, the researchist has carried out certain improvement to it.Main achievement has the DOM analytic method (DDOM) of delay and compression DOM analytic method (SEDOM).DDOM is embodied in the improvement of DOM method: DDOM does not need to construct complete parsing and belongs to, and just constructs the part-structure of dom tree when access document as required.DDOM mainly is applicable to the occasion of XML being carried out sparse visit, if when the document major part all needs to visit, the performance of DDOM is slower than common DOM mode.SEDOM obviously reduces the consumption of common DOM analytic method to storage space then by having introduced compress technique, but owing to need carry out squeeze operation, inevitably analysis feature is impacted.
SAX is the XML document analytic method of being used widely by the another kind that the member in the XML_DEV mail tabulation proposes.The core of SAX method is by the linear sweep XML document, and the label that retrieval user is concerned about also triggers events corresponding, finishes visit and the parsing of user to XML document in event handler procedure.The SAX method can be resolved the file of any size, realizes that simply resource consumption is less.When not needing to change the content of document and under the situation of sequential access, analyzing efficiency is than higher.
Matthias points out that the XML analytic method has direct and significant influence to the performance of this type systematic with in the infosystem of XML as general data description and organizational form.Though the DOM method can be set up the complete structure of XML document and have the random access ability, its computational resource consumption is bigger, is not suitable for the fast resolving to large-scale XML document; Though SAX method resource consumption is less, can resolve large-scale XML document with higher efficient, it goes not possess random access ability and online modification ability to XML document.
Summary of the invention
The technical matters that solves
For fear of the deficiencies in the prior art part, the present invention proposes a kind of method by the index analyzing XML file, can improve resolution speed and recall precision to extensive XML document, reduces resource consumption, and the random access ability to XML document is provided.
Technical scheme
A kind of method by the index analyzing XML file is introduced index mechanism in resolving, and by the speed of random access of index acceleration to each Element in the XML document.Its technical characterictic is that method step is divided into initial phase and resolution phase:
Initial phase:
The DTD document of step 1, traversal explanation XML document structure extracts the subtree tag name under the root node in the DTD document;
Resolution phase comprises two kinds of patterns: the IXP interpretive model of non-checking and the IXP interpretive model of checking;
Read-only XML document adopts the IXP pattern of non-checking as follows:
Key word name in step 4, the extraction querying condition, the concordance list of the key element that inquiry and key word are of the same name according to every value in the querying condition coupling key element concordance list, finds occurrence, extracts the subtree call number in the occurrence; If do not find the concordance list with key word key element of the same name, then turn back to step 3;
Non-read-only XML document adopts the IXP pattern of checking as follows:
Key word name in step 4, the extraction querying condition, inquiry and key word key element concordance list of the same name according to every value in the querying condition coupling key element concordance list, are found occurrence, extract the subtree call number in the occurrence;
The value of the element that finds in step 6, the renewal above-mentioned steps 5, whether with " more correcting one's mistakes " in the project corresponding to this subtree in the subtree concordance list sign changes " very " into; When closing XML document, earlier all values of statistical indicant of " whether more correcting one's mistakes " are write disk file successively for the content of subtree in internal memory of " very " in subtree index list item, carry out close file then.
Beneficial effect
The method of passing through the index analyzing XML file that the present invention proposes has following characteristics:
1. to before the XML document operation, there is one to set up the index process, by specifying subtree node and key node, creates key element concordance list and subtree concordance list during this.Then, IXP allows application program to carry out query manipulation based on these tables.
2. in to XML document resolving, adopt the mode of concordance list to write down nodal information, navigate to subtree position in the XML file fast by the subtree concordance list when searching, accelerated retrieval rate.
3. read in the content of appointed area according to the subtree concordance list,, obtain target information by the traversal subtree because only the partial content in the XML document is loaded into internal memory.This process has been avoided whole XML files are arrived the loading internal memory, thereby has saved memory headroom.
4.IXP-nv the various occasions that the competent SAX of method uses, but more efficient than SAX because it by initialization forming element subtree after, can navigate to element-specific place subtree fast when searching, and travel through among a small circle, and SAX need begin the traversal document from file usually.IXP-nv takies internal memory seldom, and does not become big along with file and increase.
5. resolving comprises two parts: search concordance list and carry out the element coupling in the subtree of element-specific.The IXP method adopts hash function to optimize concordance list in implementation procedure, and it almost is constant searching the concordance list time, and the time of searching element subtree is only relevant with the size of subtree.Therefore the IXP method also can have good performance under large-scale XML document.
Because the resolver performance depends on file characteristics usually, as label and data rate, attribute usage degree, element subtree number and average element subtree size etc.The present invention is based on the IXP analytic method and realized the IXP resolver, in an embodiment by with the comparison of the performance of the performance of IXP and MSXML, effect of the present invention as can be seen at initialization and mean access time with C Plus Plus.
Description of drawings
Fig. 1: set up index and query script diagram among the IXP
The initialized performance of Fig. 2: DOM, SAX and IXP method relatively
Fig. 3: DOM, SAX and IXP method element access time are relatively
Fig. 4: formed concordance list structure after the initialization of IXP method
Embodiment
Now in conjunction with the embodiments, accompanying drawing is further described the present invention:
With the BookSet.xml document is example, and the implementation process of IXP analytic method is illustrated.The DTD form of BookSet.xml document meets the definition of following table:
IXP resolves traversal XML data, and is that root splits into many element subtrees with the designed element with entire document.In this DTD explanation document, can find that its subtree element is " Book ", the position forming element subtree concordance list of all start-tags of mark (<Book 〉) and end-tag (</Book 〉).If can navigate to the subtree that comprises object element fast, then can accelerate retrieval rate greatly.Adopt needed any label to create index, recommend to adopt element with unique value.In this example, adopt ISBN, and the text between record "<ISBN〉" and "</ISBN〉" is as index value as crucial label.After the initialization procedure, IXP has created element subtree concordance list and crucial tab indexes table.IXP has created as mistake after the initialization procedure! Do not find Reference source.Subtree concordance list of element shown in 4 and crucial tab indexes table.
If want to inquire about the author's name of books ISBN number for " 7-302-04517 ", only need be written into and searching position from 67 to 140, have only the document of 73 characters altogether.
Test case:
With BookSet.xml is example, creates 7 XML document, and size is respectively 13KB, 119KB, and 238KB, 471KB, 934KB, 1871KB, 4652KB, they have comprised the element of varying number.Quantize the time overhead relatively adopt MSXML4.0 and IXP to carry out initialization and parsing then.
The initialization performance compares:
XML document adopts DOM, SAX and IXP initialization respectively 10 times, average initialization time such as Fig. 2.
In initialization procedure, DOM need travel through with analyzing XML file and form document tree in the internal memory, and initialization time is directly proportional with the size of XML document.SAX has just created the handle that opens file, but inreal reading of data, so its expense is minimum.IXP need resolve and set up concordance list, needs the regular hour expense.Because it does not need to set up complete tree structure, thus the overhead time much smaller than than DOM, this advantage is along with the expansion of document scale is also obvious further.
The XML document analysis feature compares:
7 XML document are not being carried out under the situation of grammer buffer memory, stochastic searching 1000 times, and mean access time has reflected the analysis feature of each analytic method substantially, the result is as shown in Figure 3.
From mistake! Do not find Reference source.3 can find, the DOM resolver has good performance when little document is resolved, but at the expense of big document head and shoulders above expectation.SAX resolver resolves performance is better than DOM mode, still inefficiency.Discover that further DOM and SAX under the used DTD of example describes, when XML document is 720KB, have approximate performance.In the IXP initialization procedure, created element subtree table and crucial tab indexes table, resolving comprises two parts: search concordance list and carry out the element coupling in the element-specific subtree.IXP adopts hash function to optimize concordance list, and it almost is constant searching the concordance list time, and the time of searching element subtree is only relevant with the size of subtree.Therefore the IXP resolver also can have good performance under large-scale XML document.
In worst case, IXP need reinitialize concordance list with new index tab.DOM and the SAX time of resolving is compared with IXP initialization and parsing time sum, and the IXP performance still is better than DOM and SAX mode greatly, as table 1:
Table 1:DOM, SAX method are resolved time and IXP initialization time and are resolved the comparison of temporal summation
Document size | DOM | SAX | IXP | Document size | DOM | SAX | |
13 | 8.9 | 11.1 | 7.0 | 119 | 11.1 | 44.4 | 12.8 |
238 | 17.8 | 66.6 | 22.1 | 471 | 55.5 | 122.1 | 39 |
934 | 1000.1 | 233.1 | 72.6 | 1871 | 5758.7 | 577.2 | 144.7 |
4652 | 35381.3 | 1144.4 | 426.5 |
1, document size is a unit with " kilobyte (K) ", and the time is unit with " millisecond (ms) ".
2, the value of these two row of DOM and SAX is respectively DOM method and the SAX method parsing time that element spent.
3, the value of these row is the summation of the IBP method initialization time and the time of parsing.
The contrast experiment shows that for large-scale XML document, the resolution speed of IXP method is far away faster than DOM method and SAX method.By general-purpose interface is provided, this mode can be widely used in the parsing of various XML document, for the XML text analyzing provides a kind of new method.
Claims (1)
1. method by the index analyzing XML file is characterized in that step is divided into initial phase and resolution phase:
Initial phase:
The DTD document of step traversal explanation XML document structure extracts the subtree tag name under the root node in the DTD document;
Step 2, at first create an empty Hash table, according to the subtree tag name that extracts, traversal needs the XML document of parsing, searches and note the initial relative position of each subtree tag name in XML document, joins in the Hash table according to the new project of structure item structure and forms the subtree concordance list; Described structure item is: the structure item of read-only XML document is: call number, initial relative position; The structure item of non-read-only XML document is: call number, and whether initial relative position more corrects one's mistakes;
Step 3, establishment key element concordance list: the Hash table of at first creating a sky, press the bookmark name traversal XML document of key element, extract the value of the bookmark name correspondence of all these key elements in the document, and the query sub tree concordance list obtains this key element place subtree call number, press the structure item: subtree call number, the value of the bookmark name correspondence of this key element are inserted the concordance list that forms this key element in the Hash table;
Resolution phase comprises two kinds of patterns: the IXP interpretive model of non-checking and the IXP interpretive model of checking;
Read-only XML document adopts the IXP pattern of non-checking as follows:
Key word name in step 4, the extraction querying condition, the concordance list of the key element that inquiry and key word are of the same name according to every value in the querying condition coupling key element concordance list, finds occurrence, extracts the subtree call number in the occurrence; If do not find the concordance list with key word key element of the same name, then turn back to step 3;
Step 5, according to subtree call number query sub tree concordance list, after in the subtree concordance list, finding occurrence, extract the initial relative position content in this, navigate to subtree reference position in the XML document according to initial relative position, whole subtree load content is arrived internal memory, traversal finds the element value that meets querying condition to return poll-final in the subtree scope;
Non-read-only XML document adopts the IXP pattern of checking as follows:
Step 4 ', extract the key word name in the querying condition, inquiry and key word key element concordance list of the same name according to every value in the querying condition coupling key element concordance list, are found occurrence, extract the subtree call number in the occurrence;
Step 5 ', according to subtree call number query sub tree concordance list, after finding occurrence, extract " initial relative position " content in the occurrence, navigate to corresponding subtree reference position in the XML document according to " initial relative position ", begin whole subtree load content to internal memory from this reference position, travel through in this subtree scope that in internal memory, loads then, find the value of the element that meets querying condition to return poll-final; If also need to change this element value after inquiring element, then continue following step;
The value of the element that finds in the step 6 ', upgrade above-mentioned steps 5 ', whether with " more correcting one's mistakes " in the project corresponding to this subtree in the subtree concordance list sign changes " very " into; When closing XML document, earlier all values of statistical indicant of " whether more correcting one's mistakes " are write disk file successively for the content of subtree in internal memory of " very " in subtree index list item, carry out close file then.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200810150767A CN101777045A (en) | 2008-09-01 | 2008-09-01 | Method for analyzing XML file by indexing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200810150767A CN101777045A (en) | 2008-09-01 | 2008-09-01 | Method for analyzing XML file by indexing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101777045A true CN101777045A (en) | 2010-07-14 |
Family
ID=42513511
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200810150767A Pending CN101777045A (en) | 2008-09-01 | 2008-09-01 | Method for analyzing XML file by indexing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101777045A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996252A (en) * | 2010-11-17 | 2011-03-30 | 浙江省电力试验研究院 | Expression method of indexing information for node element in XML (Extensive Makeup Language) file |
CN102222083A (en) * | 2011-05-06 | 2011-10-19 | 中国科学院研究生院 | Creation-object-based extensible business reporting language (XBRL) taxonomy rapid-resolution method |
CN102385604A (en) * | 2010-09-06 | 2012-03-21 | 上海可鲁系统软件有限公司 | Rapid analyzing method and system for SVG (Scalable Vector Graphics) file |
CN102521602A (en) * | 2011-11-17 | 2012-06-27 | 西安电子科技大学 | Hyper-spectral image classification method based on conditional random field and minimum distance method |
CN103678510A (en) * | 2013-11-25 | 2014-03-26 | 北京奇虎科技有限公司 | Method and device for providing visualized label for webpage |
CN103914437A (en) * | 2012-12-29 | 2014-07-09 | 上海可鲁系统软件有限公司 | XML (X Exrensible Markup Language) text positioning method based on DOM (Document Object Model) model |
CN105224531A (en) * | 2014-05-28 | 2016-01-06 | 腾讯科技(深圳)有限公司 | The method and apparatus of localization of XML node |
CN106293862A (en) * | 2015-06-25 | 2017-01-04 | 中国移动通信集团山东有限公司 | A kind of analysis method and device of expandable mark language XML data |
CN106469137A (en) * | 2015-08-19 | 2017-03-01 | 互联网域名系统北京市工程研究中心有限公司 | XML document analysis method and device |
CN107220283A (en) * | 2017-04-21 | 2017-09-29 | 东软集团股份有限公司 | Data processing method, device, storage medium and electronic equipment |
CN110019970A (en) * | 2018-06-15 | 2019-07-16 | 中国平安人寿保险股份有限公司 | Inventory downloads template creation method, device, terminal and readable storage medium storing program for executing |
CN112417085A (en) * | 2020-11-27 | 2021-02-26 | 平安普惠企业管理有限公司 | Message comparison method and device, computer equipment and storage medium |
CN115529271A (en) * | 2022-10-17 | 2022-12-27 | 中国农业银行股份有限公司 | Service request distribution method, device, equipment and medium |
-
2008
- 2008-09-01 CN CN200810150767A patent/CN101777045A/en active Pending
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102385604A (en) * | 2010-09-06 | 2012-03-21 | 上海可鲁系统软件有限公司 | Rapid analyzing method and system for SVG (Scalable Vector Graphics) file |
CN102385604B (en) * | 2010-09-06 | 2013-08-14 | 上海可鲁系统软件有限公司 | Rapid analyzing method and system for SVG (Scalable Vector Graphics) file |
CN101996252A (en) * | 2010-11-17 | 2011-03-30 | 浙江省电力试验研究院 | Expression method of indexing information for node element in XML (Extensive Makeup Language) file |
CN101996252B (en) * | 2010-11-17 | 2013-01-16 | 浙江省电力公司电力科学研究院 | Processing method of node element in XML (Extensive Makeup Language) file resolution |
CN102222083A (en) * | 2011-05-06 | 2011-10-19 | 中国科学院研究生院 | Creation-object-based extensible business reporting language (XBRL) taxonomy rapid-resolution method |
CN102521602A (en) * | 2011-11-17 | 2012-06-27 | 西安电子科技大学 | Hyper-spectral image classification method based on conditional random field and minimum distance method |
CN103914437A (en) * | 2012-12-29 | 2014-07-09 | 上海可鲁系统软件有限公司 | XML (X Exrensible Markup Language) text positioning method based on DOM (Document Object Model) model |
CN103678510A (en) * | 2013-11-25 | 2014-03-26 | 北京奇虎科技有限公司 | Method and device for providing visualized label for webpage |
CN105224531A (en) * | 2014-05-28 | 2016-01-06 | 腾讯科技(深圳)有限公司 | The method and apparatus of localization of XML node |
CN106293862A (en) * | 2015-06-25 | 2017-01-04 | 中国移动通信集团山东有限公司 | A kind of analysis method and device of expandable mark language XML data |
CN106293862B (en) * | 2015-06-25 | 2019-05-24 | 中国移动通信集团山东有限公司 | A kind of analysis method and device of expandable mark language XML data |
CN106469137A (en) * | 2015-08-19 | 2017-03-01 | 互联网域名系统北京市工程研究中心有限公司 | XML document analysis method and device |
CN107220283A (en) * | 2017-04-21 | 2017-09-29 | 东软集团股份有限公司 | Data processing method, device, storage medium and electronic equipment |
CN107220283B (en) * | 2017-04-21 | 2019-11-08 | 东软集团股份有限公司 | Data processing method, device, storage medium and electronic equipment |
CN110019970A (en) * | 2018-06-15 | 2019-07-16 | 中国平安人寿保险股份有限公司 | Inventory downloads template creation method, device, terminal and readable storage medium storing program for executing |
CN112417085A (en) * | 2020-11-27 | 2021-02-26 | 平安普惠企业管理有限公司 | Message comparison method and device, computer equipment and storage medium |
CN115529271A (en) * | 2022-10-17 | 2022-12-27 | 中国农业银行股份有限公司 | Service request distribution method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101777045A (en) | Method for analyzing XML file by indexing | |
US7720789B2 (en) | System and method of member unique names | |
AU2002334706B2 (en) | Mechanism for mapping XML schemas to object-relational database systems | |
Balmin et al. | Incremental validation of XML documents | |
CN102804168B (en) | The data compression of storage demand is reduced in Database Systems | |
US7844642B2 (en) | Method and structure for storing data of an XML-document in a relational database | |
KR100396462B1 (en) | Message transformation selection tool and method | |
CN100541493C (en) | The apparatus and method that are used for structured document management | |
US20050262440A1 (en) | Localization of XML via transformations | |
CN1584884B (en) | Apparatus for searching data of structured document | |
CN102222083A (en) | Creation-object-based extensible business reporting language (XBRL) taxonomy rapid-resolution method | |
CN102456053A (en) | Method for mapping XML document to database | |
CN103246857A (en) | Method for resolving heterogeneous code to acquire object information by using formalized decoding rule | |
US20030121005A1 (en) | Archiving and retrieving data objects | |
US8131728B2 (en) | Processing large sized relationship-specifying markup language documents | |
US20070282804A1 (en) | Apparatus and method for extracting database information from a report | |
CN110019306A (en) | A kind of SQL statement lookup method and system based on XML format file | |
CN114003231B (en) | SQL syntax parse tree optimization method and system | |
Hsu et al. | UCIS-X: an updatable compact indexing scheme for efficient extensible markup language document updating and query evaluation | |
US20020099792A1 (en) | Method of performing a search of a numerical document object model | |
Thao et al. | Using versioned trees, change detection and node identity for three-way XML merging | |
WO2008085359A1 (en) | Accelerating queries using delayed value projection of enumerated storage | |
CN113988003A (en) | Method for custom directional analysis of multiple sheet contents of Excel file according to specified configuration | |
Li et al. | Extraction and integration information in HTML tables | |
CN114756554B (en) | Data query processing method based on MyBatis framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20100714 |