CN101364235A - XML document compressing method based on file difference - Google Patents
XML document compressing method based on file difference Download PDFInfo
- Publication number
- CN101364235A CN101364235A CNA2008102006933A CN200810200693A CN101364235A CN 101364235 A CN101364235 A CN 101364235A CN A2008102006933 A CNA2008102006933 A CN A2008102006933A CN 200810200693 A CN200810200693 A CN 200810200693A CN 101364235 A CN101364235 A CN 101364235A
- Authority
- CN
- China
- Prior art keywords
- document
- xml
- xml document
- file
- xdrill
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The invention belongs to the database technical field and particularly provides a novel XML file compression algorithm, which comprises the following steps: a. dividing an XML file into 64k document fragments; b. calculating differences among document fragments; and c. compressing the differences among document fragments. The decompression algorithm comprises the sequentially opposite steps to the compression algorithm. The XML file compression algorithm is an efficient XML file compression algorithm based on file difference calculation. By dividing an XML document tree, XDrill excavates redundant information inside a document and among documents, thereby achieving a good compression effect. Compared with the traditional XML compression algorithm, XDrill has the advantages of lower intensity of document operation and more flexible use.
Description
Technical field
The invention belongs to database technical field, be specifically related to a kind of novel and efficient XML document compression algorithm, the XDrill compression algorithm.XDrill has obtained the good compression effect by XML document being divided the redundant information of excavating between document inside and document.
Background technology
XML has been widely used in every field as a kind of self-described SGML, for example exchange of electronics document, the electronic medical records in the hospital etc.XML is used for describing semi-structured data in a large number, in order to support the self-described characteristic of XML, exists the structure that a large amount of marks are used for distinguishing document content in the XML document.This structure and content and the storage organization mode of depositing have also been brought a large amount of information redundancies when making things convenient for document query and machine mutual.In some resource-constrained systems, waste massive band width when the redundancy issue of XML file can cause the transmission data.In order to address this problem, many research work have proposed the technology compressed at XML document, comprise XMill, XGrind, XPress etc.
XMill has at first proposed isolating construction information in the XML document compression process.Utilize the existing structural information of XML document, XMill re-constructs the file structure after the compression, makes the structural information of utilizing that it can maximal efficiency improve compressibility.The Compression Strategies of XMill in most of the cases can both obtain very outstanding compression effectiveness, but the mark kind in the very little or XML document of XML document very for a long time, the mark title carried out the additional structure that process such as dictionary encoding brings and can cause compression effectiveness to descend.Experiment shows that the compression effectiveness of XMill is also not obvious when file during less than 20K.It is simple in structure in a large number that these characteristics cause XMill to be not suitable for very much compression, the XML document set that volume is less.
Summary of the invention
The objective of the invention is to propose a kind of XML compression algorithm XDrill, made no matter still be that the structure collection of document similar with content can both be obtained the good compression effect to single XML document based on file difference.
This method step is:
A is divided into the XML file XML document segment of 64K;
B calculates the difference between the XML document segment;
C, the difference between the condensed document segment.
Decompression step is opposite with this process.
Advantage of the present invention is that this algorithm has obtained the good compression effect by the XML document tree construction is divided the redundant information of excavating between document inside and document.The method that this paper proposes can be supported to compress the increment type storage of back XML document and reduce and upgrade the operated system expense after expansion.
Set about XML document:
Among the present invention, an XML document tree T={V, E, Root}, wherein V={V
T, V
C, V
TThe expression structure node, V
CThe expression content node, E represents the limit of XML document tree, Root represents the documentation root node.
Division and XML document segment about the XML document tree:
The child of Root set W={d among the T
1, d
2... d
i... d
n, to d
iBe the subtree of root node,
The division of definition XML document tree T
Definition
Corresponding XML document is the XML document segment
The content information segment be meant according to preorder traversal
Run into the order of content node, with node content by the connector " ^ " character string that (supposition " ^ " symbol can not occur in XML document) connects into.
The structural information segment be meant according to preorder traversal
Order, the character string that structure content is connected into.Content node to traversing in this process, " ^ " replaces to use symbol.
About reference paper and file destination:
Here reference paper and purpose file all are meant the XML document segment.Reference paper is used as the reference of compressing other file.The purpose file is meant the file that uses reference paper to compress.
About the similarity between the XML document segment:
Mix the influence that storage is brought in order to eliminate structure and content information, use traditional tree editing distance D here
TWith the text editing distance D
CStructural information that defines respectively and content information similarity.Document segment A is defined as two similarities between the XML document segment, Diff by the minimal steps number that basic operation reaches document segment B
A → B=D
T+ D
C
Theorem 1.Diff
A → B=D
T+ D
CAnd Diff
B → A=D
T+ D
CEquate.By D
TAnd D
CDefinition knows that the basic operation of two kinds of editing distances all is reversible, and interpolation when for example being converted to B by A operation is corresponding to the deletion action that is converted to A by B etc., so the value of two kinds of editing distances equates.
Inference: because Diff
A → B=D
T+ D
CAnd Diff
B → A=D
T+ D
CBe worth identical.Gather for segment
Can use undirected weighted graph G={V, E, W} represents.Wherein
Reference paper graph of a relation about the XDrill compression algorithm:
The reference paper graph of a relation R={V of XDrill compression algorithm, E, W} are an acyclic figure of oriented cum rights.Limit e wherein
IjExpression uses i document segment as reference paper j document segment to be compressed, and is called j and depends on i.At first the reference paper graph of a relation must guarantee that each file can only be compressed once.If there is ring on the other hand in the R, then condensed document can not only rely on that compressed portion information decompresses.For example suppose D
i, D
j, D
kBetween interdepend, then three files can't decompress under the situation that does not rely on external information, shown in figure one.
Fixed 2. when fixing each file and use unique reference paper (any one node in-degree mostly is 1 most in the XDrill reference paper graph of a relation), makes XDrill compress the shortest spanning subtree problem of the undirected weighted graph of problem equivalent of compression effectiveness optimum.
When theorem 3. was used two or more reference paper when each file, the problem of XDrill compression compression effectiveness optimum can be conceptualized as the branching problem on the hypergraph, has been proved to be the NPC problem.
Because this computing method need be calculated the relation between any two documents in advance, can consume plenty of time and resource, so also inapplicable in actual use.
Description of drawings
Fig. 1 dependence graph that circulates.
Fig. 2 XDrill system framework.
Embodiment
About XDrill compressibility framework:
Fig. 2 is the frame construction drawing of XDrill compressibility.The XDrill compressibility mainly is made up of the two large divisions, and a part is a SAX resolver part, is mainly used to read source document and generates corresponding reference file and purpose file by calling cutting module (segmentationmodule).Another part is compressor module (Compressor), and the zdelta compressor reducer that this part is called bottom compresses source document.
Wherein XML segment is meant the XML segment by generating after the cutting module.
About XDrill compression algorithm and flow process:
XDrill compression algorithm flow process as shown in Table 1.The XDrill compressibility need be safeguarded six system caches, refl_structure wherein, and ref2_structure is used for storing the structure fragment information of two reference papers.Refl_structure, ref2_contents are used for storing the content segments information of two reference papers.Tar_structure and tar_contents are used for safeguarding the information of current XML file fragment.The SAX resolver constantly reads the XML document content and write data in tar_structure and tar_contents.If current SAX incident has satisfied the cutting breakpoint of cutting rule, call the zdelta tool of compression and utilize corresponding structure and content segments purpose File Compress (corresponding to the 7th, 8 liang of line code).The the 9th to 14 row in the corresponding code is the corresponding buffer region exchanges data.Utilize the data among ref2_structure and ref2_contents renewal refl_structure and the ref2_contents.Equally, the data among use tar_structure and the tar_contents are replaced the data among ref2_structure and the ref2_contents.Empty tar_structure and tar_contents at last.SAX parser continues to read in following data.
Decompression process is not given unnecessary details in contrast.
We have adopted the way of cutting apart the XML document tree for the operation granularity that reduces XML document in the XDrill system.For each the XML document segment that generates, still can use existing XML compression algorithm, for example XMill etc. compresses.But generally for XML document, existing XML compression algorithm can only be utilized the information redundancy of single XML document segment inside, and XDrill when utilizing this segment self redundant information also by the information prediction in the reference paper a part of redundant information.
Renewal and increment type storage about XML document:
Because XDrill compresses after XML document is divided, the operation granularity of file is also reduced accordingly.In the XDrill system, the renewal of XML document at first needs the position of positioned update XML segment, and the file difference that uses zdelta to compress new document segment and old information then transmits.
When the user needed the increment type storage, new XML information was as long as use corresponding original XML document segment to compress as reference paper.
Table-Xdrill algorithm
1:if (SAX-Event is the beginning label event)
2: will begin tag characters and write
tar_structure;
3:else if (SAX-Event is the end-tag incident)
4:if (not satisfying division rule)
5: the end-tag character is write
tar_structure;
6:else
7:zdelta(ref1_structure,ref2_structure,tar_structure)
8:zdelta(ref1_contents,ref2_contents,tar_contents)
9: the information of mobile ref2_structure is to refl_structure;
10 move ref2_contents: information to refl_contents;
11 move tar_structure: information to ref2_structure;
12 remove tar_structure: in content;
13 move among the tar_content: information to ref2_content;
14 remove among the tar_contents: content
(SAX-Event is a literary composition to 15 else if: present event)
16 write symbol " ^ ": tar_structure;
17 write content of text: tar_contents;
18 write symbol " ^ ": tar_contents;
19?End。
Claims (1)
1, a kind of XML document compression method based on file difference, this method step is:
A is divided into the XML file XML document segment of 64K;
B calculates the difference between the XML document segment;
C, the difference between the condensed document segment;
Decompression step is opposite with this process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008102006933A CN101364235A (en) | 2008-09-27 | 2008-09-27 | XML document compressing method based on file difference |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008102006933A CN101364235A (en) | 2008-09-27 | 2008-09-27 | XML document compressing method based on file difference |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101364235A true CN101364235A (en) | 2009-02-11 |
Family
ID=40390601
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2008102006933A Pending CN101364235A (en) | 2008-09-27 | 2008-09-27 | XML document compressing method based on file difference |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101364235A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101963944A (en) * | 2010-09-30 | 2011-02-02 | 用友软件股份有限公司 | Object storage method and system |
CN102379087A (en) * | 2009-03-31 | 2012-03-14 | 西门子公司 | Compression method, decompression method, compression unit, decompression unit and compressed document |
CN102473175A (en) * | 2009-07-31 | 2012-05-23 | 惠普开发有限公司 | Compression of XML data |
CN102571966A (en) * | 2012-01-16 | 2012-07-11 | 上海方正数字出版技术有限公司 | Network transmission method for large extensible markup language (XML) document |
CN102073663B (en) * | 2009-11-24 | 2013-01-30 | 北大方正集团有限公司 | Method and device for rapidly processing XML (Extensible Markup Language) compressed data |
CN106844479A (en) * | 2016-12-23 | 2017-06-13 | 光锐恒宇(北京)科技有限公司 | The compression of file, decompressing method and device |
CN109474594A (en) * | 2018-11-09 | 2019-03-15 | 北京海兰信数据科技股份有限公司 | Ship end data lightweight device, bank end data reduction apparatus, ship-shore cooperation data lightweight Transmission system and transmission method |
CN111352925A (en) * | 2012-09-28 | 2020-06-30 | 甲骨文国际公司 | Policy driven data placement and information lifecycle management |
US11636064B2 (en) | 2021-07-13 | 2023-04-25 | Microsoft Technology Licensing, Llc | Compression of localized files |
-
2008
- 2008-09-27 CN CNA2008102006933A patent/CN101364235A/en active Pending
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102379087A (en) * | 2009-03-31 | 2012-03-14 | 西门子公司 | Compression method, decompression method, compression unit, decompression unit and compressed document |
CN102379087B (en) * | 2009-03-31 | 2015-07-08 | 西门子公司 | Compression method, decompression method, compression unit, decompression unit and compressed document |
CN102473175A (en) * | 2009-07-31 | 2012-05-23 | 惠普开发有限公司 | Compression of XML data |
CN102473175B (en) * | 2009-07-31 | 2015-02-18 | 惠普开发有限公司 | Compression of XML data |
CN102073663B (en) * | 2009-11-24 | 2013-01-30 | 北大方正集团有限公司 | Method and device for rapidly processing XML (Extensible Markup Language) compressed data |
CN101963944B (en) * | 2010-09-30 | 2015-04-15 | 用友软件股份有限公司 | Object storage method and system |
CN101963944A (en) * | 2010-09-30 | 2011-02-02 | 用友软件股份有限公司 | Object storage method and system |
CN102571966A (en) * | 2012-01-16 | 2012-07-11 | 上海方正数字出版技术有限公司 | Network transmission method for large extensible markup language (XML) document |
CN102571966B (en) * | 2012-01-16 | 2014-10-29 | 北大方正集团有限公司 | Network transmission method for large extensible markup language (XML) document |
CN111352925A (en) * | 2012-09-28 | 2020-06-30 | 甲骨文国际公司 | Policy driven data placement and information lifecycle management |
CN111352925B (en) * | 2012-09-28 | 2023-08-22 | 甲骨文国际公司 | Policy driven data placement and information lifecycle management |
CN106844479A (en) * | 2016-12-23 | 2017-06-13 | 光锐恒宇(北京)科技有限公司 | The compression of file, decompressing method and device |
CN106844479B (en) * | 2016-12-23 | 2020-07-07 | 光锐恒宇(北京)科技有限公司 | Method and device for compressing and decompressing file |
CN109474594A (en) * | 2018-11-09 | 2019-03-15 | 北京海兰信数据科技股份有限公司 | Ship end data lightweight device, bank end data reduction apparatus, ship-shore cooperation data lightweight Transmission system and transmission method |
US11636064B2 (en) | 2021-07-13 | 2023-04-25 | Microsoft Technology Licensing, Llc | Compression of localized files |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101364235A (en) | XML document compressing method based on file difference | |
CN102073663B (en) | Method and device for rapidly processing XML (Extensible Markup Language) compressed data | |
KR100424130B1 (en) | Data compression apparatus, database system, data communication system, data compression method, storage medium and program transmission apparatus | |
Deutsch | DEFLATE compressed data format specification version 1.3 | |
CN100595596C (en) | Dynamic data compression storage method in electric network wide-area measuring systems (WAMS) | |
KR101247075B1 (en) | Encoding of markup-language data | |
Ehrig et al. | Deriving bisimulation congruences in the DPO approach to graph rewriting | |
CN103685589B (en) | Binary coding-based domain name system (DNS) data compression and decompression methods and systems | |
US20110119240A1 (en) | Method and system for generating a bidirectional delta file | |
CN101937448A (en) | Be used for the maintenance string compression in proper order based on dictionary of primary memory row memory storage | |
EP1803225A1 (en) | Adaptive compression scheme | |
CN106021579B (en) | A kind of compression method of historical data base | |
JP2006172476A (en) | Method and apparatus for generating instance of document | |
CN101216906A (en) | A flow control method and a flow engine | |
Hinze et al. | Chapter 2. Generic Haskell: Applications | |
CN106599016A (en) | Front-end element maintenance method based on virtual DOM | |
CN101436199A (en) | Multiple-inquiry processing method of XML compressing data | |
JP4821287B2 (en) | Structured document encoding method, encoding apparatus, encoding program, decoding apparatus, and encoded structured document data structure | |
US20120109911A1 (en) | Compression Of XML Data | |
CN101553800A (en) | Migration apparatus which convert SAM/VSAM files of mainframe system into SAM/VSAM files of open system and method for thereof | |
Kálmán et al. | Compacting XML documents | |
Brisaboa et al. | Managing Compressed Structured Text | |
Collard | An infrastructure to support meta-differencing and refactoring of source code | |
Ota et al. | On the on-line arithmetic coding based on antidictionaries with linear complexity | |
KR102172732B1 (en) | Apparatus and method for converting from/to pdf documents in capacity units |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20090211 |