CN101364235A - XML document compressing method based on file difference - Google Patents

XML document compressing method based on file difference Download PDF

Info

Publication number
CN101364235A
CN101364235A CNA2008102006933A CN200810200693A CN101364235A CN 101364235 A CN101364235 A CN 101364235A CN A2008102006933 A CNA2008102006933 A CN A2008102006933A CN 200810200693 A CN200810200693 A CN 200810200693A CN 101364235 A CN101364235 A CN 101364235A
Authority
CN
China
Prior art keywords
document
xml
xml document
file
xdrill
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008102006933A
Other languages
Chinese (zh)
Inventor
周傲英
耿志华
王晓玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CNA2008102006933A priority Critical patent/CN101364235A/en
Publication of CN101364235A publication Critical patent/CN101364235A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention belongs to the database technical field and particularly provides a novel XML file compression algorithm, which comprises the following steps: a. dividing an XML file into 64k document fragments; b. calculating differences among document fragments; and c. compressing the differences among document fragments. The decompression algorithm comprises the sequentially opposite steps to the compression algorithm. The XML file compression algorithm is an efficient XML file compression algorithm based on file difference calculation. By dividing an XML document tree, XDrill excavates redundant information inside a document and among documents, thereby achieving a good compression effect. Compared with the traditional XML compression algorithm, XDrill has the advantages of lower intensity of document operation and more flexible use.

Description

A kind of XML document compression method based on file difference
Technical field
The invention belongs to database technical field, be specifically related to a kind of novel and efficient XML document compression algorithm, the XDrill compression algorithm.XDrill has obtained the good compression effect by XML document being divided the redundant information of excavating between document inside and document.
Background technology
XML has been widely used in every field as a kind of self-described SGML, for example exchange of electronics document, the electronic medical records in the hospital etc.XML is used for describing semi-structured data in a large number, in order to support the self-described characteristic of XML, exists the structure that a large amount of marks are used for distinguishing document content in the XML document.This structure and content and the storage organization mode of depositing have also been brought a large amount of information redundancies when making things convenient for document query and machine mutual.In some resource-constrained systems, waste massive band width when the redundancy issue of XML file can cause the transmission data.In order to address this problem, many research work have proposed the technology compressed at XML document, comprise XMill, XGrind, XPress etc.
XMill has at first proposed isolating construction information in the XML document compression process.Utilize the existing structural information of XML document, XMill re-constructs the file structure after the compression, makes the structural information of utilizing that it can maximal efficiency improve compressibility.The Compression Strategies of XMill in most of the cases can both obtain very outstanding compression effectiveness, but the mark kind in the very little or XML document of XML document very for a long time, the mark title carried out the additional structure that process such as dictionary encoding brings and can cause compression effectiveness to descend.Experiment shows that the compression effectiveness of XMill is also not obvious when file during less than 20K.It is simple in structure in a large number that these characteristics cause XMill to be not suitable for very much compression, the XML document set that volume is less.
Summary of the invention
The objective of the invention is to propose a kind of XML compression algorithm XDrill, made no matter still be that the structure collection of document similar with content can both be obtained the good compression effect to single XML document based on file difference.
This method step is:
A is divided into the XML file XML document segment of 64K;
B calculates the difference between the XML document segment;
C, the difference between the condensed document segment.
Decompression step is opposite with this process.
Advantage of the present invention is that this algorithm has obtained the good compression effect by the XML document tree construction is divided the redundant information of excavating between document inside and document.The method that this paper proposes can be supported to compress the increment type storage of back XML document and reduce and upgrade the operated system expense after expansion.
Set about XML document:
Among the present invention, an XML document tree T={V, E, Root}, wherein V={V T, V C, V TThe expression structure node, V CThe expression content node, E represents the limit of XML document tree, Root represents the documentation root node.
Division and XML document segment about the XML document tree:
The child of Root set W={d among the T 1, d 2... d i... d n, to d iBe the subtree of root node, T d i = { V ′ , E ′ , d i } ( i ∈ [ 1 , n ] ) , The division of definition XML document tree T S = { Root , W , { T d i | i ∈ [ 1 , n ] } } . Definition
Figure A200810200693D00043
Corresponding XML document is the XML document segment D d i .
About
Figure A200810200693D00051
Content information segment and structural information segment:
Figure A200810200693D00052
The content information segment be meant according to preorder traversal T d i = { V s ′ , V t ′ , E ′ , d i } Run into the order of content node, with node content by the connector " ^ " character string that (supposition " ^ " symbol can not occur in XML document) connects into.
Figure A200810200693D00054
The structural information segment be meant according to preorder traversal T d i = { V ′ , E ′ , d i } Order, the character string that structure content is connected into.Content node to traversing in this process, " ^ " replaces to use symbol.
About reference paper and file destination:
Here reference paper and purpose file all are meant the XML document segment.Reference paper is used as the reference of compressing other file.The purpose file is meant the file that uses reference paper to compress.
About the similarity between the XML document segment:
Mix the influence that storage is brought in order to eliminate structure and content information, use traditional tree editing distance D here TWith the text editing distance D CStructural information that defines respectively and content information similarity.Document segment A is defined as two similarities between the XML document segment, Diff by the minimal steps number that basic operation reaches document segment B A → B=D T+ D C
Theorem 1.Diff A → B=D T+ D CAnd Diff B → A=D T+ D CEquate.By D TAnd D CDefinition knows that the basic operation of two kinds of editing distances all is reversible, and interpolation when for example being converted to B by A operation is corresponding to the deletion action that is converted to A by B etc., so the value of two kinds of editing distances equates.
Inference: because Diff A → B=D T+ D CAnd Diff B → A=D T+ D CBe worth identical.Gather for segment π = { D d 1 , D d 2 , . . . D d i . . . D d n } , Can use undirected weighted graph G={V, E, W} represents.Wherein W ij = Diff D d i → D d j .
Reference paper graph of a relation about the XDrill compression algorithm:
The reference paper graph of a relation R={V of XDrill compression algorithm, E, W} are an acyclic figure of oriented cum rights.Limit e wherein IjExpression uses i document segment as reference paper j document segment to be compressed, and is called j and depends on i.At first the reference paper graph of a relation must guarantee that each file can only be compressed once.If there is ring on the other hand in the R, then condensed document can not only rely on that compressed portion information decompresses.For example suppose D i, D j, D kBetween interdepend, then three files can't decompress under the situation that does not rely on external information, shown in figure one.
Fixed 2. when fixing each file and use unique reference paper (any one node in-degree mostly is 1 most in the XDrill reference paper graph of a relation), makes XDrill compress the shortest spanning subtree problem of the undirected weighted graph of problem equivalent of compression effectiveness optimum.
When theorem 3. was used two or more reference paper when each file, the problem of XDrill compression compression effectiveness optimum can be conceptualized as the branching problem on the hypergraph, has been proved to be the NPC problem.
Because this computing method need be calculated the relation between any two documents in advance, can consume plenty of time and resource, so also inapplicable in actual use.
Description of drawings
Fig. 1 dependence graph that circulates.
Fig. 2 XDrill system framework.
Embodiment
About XDrill compressibility framework:
Fig. 2 is the frame construction drawing of XDrill compressibility.The XDrill compressibility mainly is made up of the two large divisions, and a part is a SAX resolver part, is mainly used to read source document and generates corresponding reference file and purpose file by calling cutting module (segmentationmodule).Another part is compressor module (Compressor), and the zdelta compressor reducer that this part is called bottom compresses source document.
Wherein XML segment is meant the XML segment by generating after the cutting module.
About XDrill compression algorithm and flow process:
XDrill compression algorithm flow process as shown in Table 1.The XDrill compressibility need be safeguarded six system caches, refl_structure wherein, and ref2_structure is used for storing the structure fragment information of two reference papers.Refl_structure, ref2_contents are used for storing the content segments information of two reference papers.Tar_structure and tar_contents are used for safeguarding the information of current XML file fragment.The SAX resolver constantly reads the XML document content and write data in tar_structure and tar_contents.If current SAX incident has satisfied the cutting breakpoint of cutting rule, call the zdelta tool of compression and utilize corresponding structure and content segments purpose File Compress (corresponding to the 7th, 8 liang of line code).The the 9th to 14 row in the corresponding code is the corresponding buffer region exchanges data.Utilize the data among ref2_structure and ref2_contents renewal refl_structure and the ref2_contents.Equally, the data among use tar_structure and the tar_contents are replaced the data among ref2_structure and the ref2_contents.Empty tar_structure and tar_contents at last.SAX parser continues to read in following data.
Decompression process is not given unnecessary details in contrast.
We have adopted the way of cutting apart the XML document tree for the operation granularity that reduces XML document in the XDrill system.For each the XML document segment that generates, still can use existing XML compression algorithm, for example XMill etc. compresses.But generally for XML document, existing XML compression algorithm can only be utilized the information redundancy of single XML document segment inside, and XDrill when utilizing this segment self redundant information also by the information prediction in the reference paper a part of redundant information.
Renewal and increment type storage about XML document:
Because XDrill compresses after XML document is divided, the operation granularity of file is also reduced accordingly.In the XDrill system, the renewal of XML document at first needs the position of positioned update XML segment, and the file difference that uses zdelta to compress new document segment and old information then transmits.
When the user needed the increment type storage, new XML information was as long as use corresponding original XML document segment to compress as reference paper.
Table-Xdrill algorithm
Figure A200810200693D00081
1:if (SAX-Event is the beginning label event)
2: will begin tag characters and write
tar_structure;
3:else if (SAX-Event is the end-tag incident)
4:if (not satisfying division rule)
5: the end-tag character is write
tar_structure;
6:else
7:zdelta(ref1_structure,ref2_structure,tar_structure)
8:zdelta(ref1_contents,ref2_contents,tar_contents)
9: the information of mobile ref2_structure is to refl_structure;
10 move ref2_contents: information to refl_contents;
11 move tar_structure: information to ref2_structure;
12 remove tar_structure: in content;
13 move among the tar_content: information to ref2_content;
14 remove among the tar_contents: content
(SAX-Event is a literary composition to 15 else if: present event)
16 write symbol " ^ ": tar_structure;
17 write content of text: tar_contents;
18 write symbol " ^ ": tar_contents;
19?End。

Claims (1)

1, a kind of XML document compression method based on file difference, this method step is:
A is divided into the XML file XML document segment of 64K;
B calculates the difference between the XML document segment;
C, the difference between the condensed document segment;
Decompression step is opposite with this process.
CNA2008102006933A 2008-09-27 2008-09-27 XML document compressing method based on file difference Pending CN101364235A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008102006933A CN101364235A (en) 2008-09-27 2008-09-27 XML document compressing method based on file difference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008102006933A CN101364235A (en) 2008-09-27 2008-09-27 XML document compressing method based on file difference

Publications (1)

Publication Number Publication Date
CN101364235A true CN101364235A (en) 2009-02-11

Family

ID=40390601

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008102006933A Pending CN101364235A (en) 2008-09-27 2008-09-27 XML document compressing method based on file difference

Country Status (1)

Country Link
CN (1) CN101364235A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101963944A (en) * 2010-09-30 2011-02-02 用友软件股份有限公司 Object storage method and system
CN102379087A (en) * 2009-03-31 2012-03-14 西门子公司 Compression method, decompression method, compression unit, decompression unit and compressed document
CN102473175A (en) * 2009-07-31 2012-05-23 惠普开发有限公司 Compression of XML data
CN102571966A (en) * 2012-01-16 2012-07-11 上海方正数字出版技术有限公司 Network transmission method for large extensible markup language (XML) document
CN102073663B (en) * 2009-11-24 2013-01-30 北大方正集团有限公司 Method and device for rapidly processing XML (Extensible Markup Language) compressed data
CN106844479A (en) * 2016-12-23 2017-06-13 光锐恒宇(北京)科技有限公司 The compression of file, decompressing method and device
CN109474594A (en) * 2018-11-09 2019-03-15 北京海兰信数据科技股份有限公司 Ship end data lightweight device, bank end data reduction apparatus, ship-shore cooperation data lightweight Transmission system and transmission method
CN111352925A (en) * 2012-09-28 2020-06-30 甲骨文国际公司 Policy driven data placement and information lifecycle management
US11636064B2 (en) 2021-07-13 2023-04-25 Microsoft Technology Licensing, Llc Compression of localized files

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102379087A (en) * 2009-03-31 2012-03-14 西门子公司 Compression method, decompression method, compression unit, decompression unit and compressed document
CN102379087B (en) * 2009-03-31 2015-07-08 西门子公司 Compression method, decompression method, compression unit, decompression unit and compressed document
CN102473175A (en) * 2009-07-31 2012-05-23 惠普开发有限公司 Compression of XML data
CN102473175B (en) * 2009-07-31 2015-02-18 惠普开发有限公司 Compression of XML data
CN102073663B (en) * 2009-11-24 2013-01-30 北大方正集团有限公司 Method and device for rapidly processing XML (Extensible Markup Language) compressed data
CN101963944B (en) * 2010-09-30 2015-04-15 用友软件股份有限公司 Object storage method and system
CN101963944A (en) * 2010-09-30 2011-02-02 用友软件股份有限公司 Object storage method and system
CN102571966A (en) * 2012-01-16 2012-07-11 上海方正数字出版技术有限公司 Network transmission method for large extensible markup language (XML) document
CN102571966B (en) * 2012-01-16 2014-10-29 北大方正集团有限公司 Network transmission method for large extensible markup language (XML) document
CN111352925A (en) * 2012-09-28 2020-06-30 甲骨文国际公司 Policy driven data placement and information lifecycle management
CN111352925B (en) * 2012-09-28 2023-08-22 甲骨文国际公司 Policy driven data placement and information lifecycle management
CN106844479A (en) * 2016-12-23 2017-06-13 光锐恒宇(北京)科技有限公司 The compression of file, decompressing method and device
CN106844479B (en) * 2016-12-23 2020-07-07 光锐恒宇(北京)科技有限公司 Method and device for compressing and decompressing file
CN109474594A (en) * 2018-11-09 2019-03-15 北京海兰信数据科技股份有限公司 Ship end data lightweight device, bank end data reduction apparatus, ship-shore cooperation data lightweight Transmission system and transmission method
US11636064B2 (en) 2021-07-13 2023-04-25 Microsoft Technology Licensing, Llc Compression of localized files

Similar Documents

Publication Publication Date Title
CN101364235A (en) XML document compressing method based on file difference
CN102073663B (en) Method and device for rapidly processing XML (Extensible Markup Language) compressed data
KR100424130B1 (en) Data compression apparatus, database system, data communication system, data compression method, storage medium and program transmission apparatus
Deutsch DEFLATE compressed data format specification version 1.3
CN100595596C (en) Dynamic data compression storage method in electric network wide-area measuring systems (WAMS)
KR101247075B1 (en) Encoding of markup-language data
Ehrig et al. Deriving bisimulation congruences in the DPO approach to graph rewriting
CN103685589B (en) Binary coding-based domain name system (DNS) data compression and decompression methods and systems
US20110119240A1 (en) Method and system for generating a bidirectional delta file
CN101937448A (en) Be used for the maintenance string compression in proper order based on dictionary of primary memory row memory storage
EP1803225A1 (en) Adaptive compression scheme
CN106021579B (en) A kind of compression method of historical data base
JP2006172476A (en) Method and apparatus for generating instance of document
CN101216906A (en) A flow control method and a flow engine
Hinze et al. Chapter 2. Generic Haskell: Applications
CN106599016A (en) Front-end element maintenance method based on virtual DOM
CN101436199A (en) Multiple-inquiry processing method of XML compressing data
JP4821287B2 (en) Structured document encoding method, encoding apparatus, encoding program, decoding apparatus, and encoded structured document data structure
US20120109911A1 (en) Compression Of XML Data
CN101553800A (en) Migration apparatus which convert SAM/VSAM files of mainframe system into SAM/VSAM files of open system and method for thereof
Kálmán et al. Compacting XML documents
Brisaboa et al. Managing Compressed Structured Text
Collard An infrastructure to support meta-differencing and refactoring of source code
Ota et al. On the on-line arithmetic coding based on antidictionaries with linear complexity
KR102172732B1 (en) Apparatus and method for converting from/to pdf documents in capacity units

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20090211