CN101364235A

CN101364235A - XML document compressing method based on file difference

Info

Publication number: CN101364235A
Application number: CNA2008102006933A
Authority: CN
Inventors: 周傲英; 耿志华; 王晓玲
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2008-09-27
Filing date: 2008-09-27
Publication date: 2009-02-11

Abstract

The invention belongs to the database technical field and particularly provides a novel XML file compression algorithm, which comprises the following steps: a. dividing an XML file into 64k document fragments; b. calculating differences among document fragments; and c. compressing the differences among document fragments. The decompression algorithm comprises the sequentially opposite steps to the compression algorithm. The XML file compression algorithm is an efficient XML file compression algorithm based on file difference calculation. By dividing an XML document tree, XDrill excavates redundant information inside a document and among documents, thereby achieving a good compression effect. Compared with the traditional XML compression algorithm, XDrill has the advantages of lower intensity of document operation and more flexible use.

Description

A kind of XML document compression method based on file difference

Technical field

The invention belongs to database technical field, be specifically related to a kind of novel and efficient XML document compression algorithm, the XDrill compression algorithm.XDrill has obtained the good compression effect by XML document being divided the redundant information of excavating between document inside and document.

Background technology

XML has been widely used in every field as a kind of self-described SGML, for example exchange of electronics document, the electronic medical records in the hospital etc.XML is used for describing semi-structured data in a large number, in order to support the self-described characteristic of XML, exists the structure that a large amount of marks are used for distinguishing document content in the XML document.This structure and content and the storage organization mode of depositing have also been brought a large amount of information redundancies when making things convenient for document query and machine mutual.In some resource-constrained systems, waste massive band width when the redundancy issue of XML file can cause the transmission data.In order to address this problem, many research work have proposed the technology compressed at XML document, comprise XMill, XGrind, XPress etc.

XMill has at first proposed isolating construction information in the XML document compression process.Utilize the existing structural information of XML document, XMill re-constructs the file structure after the compression, makes the structural information of utilizing that it can maximal efficiency improve compressibility.The Compression Strategies of XMill in most of the cases can both obtain very outstanding compression effectiveness, but the mark kind in the very little or XML document of XML document very for a long time, the mark title carried out the additional structure that process such as dictionary encoding brings and can cause compression effectiveness to descend.Experiment shows that the compression effectiveness of XMill is also not obvious when file during less than 20K.It is simple in structure in a large number that these characteristics cause XMill to be not suitable for very much compression, the XML document set that volume is less.

Summary of the invention

The objective of the invention is to propose a kind of XML compression algorithm XDrill, made no matter still be that the structure collection of document similar with content can both be obtained the good compression effect to single XML document based on file difference.

This method step is:

A is divided into the XML file XML document segment of 64K;

B calculates the difference between the XML document segment;

C, the difference between the condensed document segment.

Decompression step is opposite with this process.

Advantage of the present invention is that this algorithm has obtained the good compression effect by the XML document tree construction is divided the redundant information of excavating between document inside and document.The method that this paper proposes can be supported to compress the increment type storage of back XML document and reduce and upgrade the operated system expense after expansion.

Set about XML document:

Among the present invention, an XML document tree T={V, E, Root}, wherein V={V _T, V _C, V _TThe expression structure node, V _CThe expression content node, E represents the limit of XML document tree, Root represents the documentation root node.

Division and XML document segment about the XML document tree:

The child of Root set W={d among the T ₁, d ₂... d _i... d _n, to d _iBe the subtree of root node,

T_{d_{i}} = {V', E', d_{i}} (i &Element; [1, n]),

The division of definition XML document tree T

S = {Root, W, {T_{d_{i}} | i &Element; [1, n]}} .

Definition

Corresponding XML document is the XML document segment

D_{d_{i}} .

About

Content information segment and structural information segment:

The content information segment be meant according to preorder traversal

T_{d_{i}} = {V_{s}^{'}, V_{t}^{'}, E', d_{i}}

Run into the order of content node, with node content by the connector " ^ " character string that (supposition " ^ " symbol can not occur in XML document) connects into.

The structural information segment be meant according to preorder traversal

T_{d_{i}} = {V', E', d_{i}}

Order, the character string that structure content is connected into.Content node to traversing in this process, " ^ " replaces to use symbol.

About reference paper and file destination:

Here reference paper and purpose file all are meant the XML document segment.Reference paper is used as the reference of compressing other file.The purpose file is meant the file that uses reference paper to compress.

About the similarity between the XML document segment:

Mix the influence that storage is brought in order to eliminate structure and content information, use traditional tree editing distance D here _TWith the text editing distance D _CStructural information that defines respectively and content information similarity.Document segment A is defined as two similarities between the XML document segment, Diff by the minimal steps number that basic operation reaches document segment B _{A → B}=D _T+ D _C

Theorem 1.Diff _{A → B}=D _T+ D _CAnd Diff _{B → A}=D _T+ D _CEquate.By D _TAnd D _CDefinition knows that the basic operation of two kinds of editing distances all is reversible, and interpolation when for example being converted to B by A operation is corresponding to the deletion action that is converted to A by B etc., so the value of two kinds of editing distances equates.

Inference: because Diff _{A → B}=D _T+ D _CAnd Diff _{B → A}=D _T+ D _CBe worth identical.Gather for segment

π = {D_{d_{1}}, D_{d_{2}}, . . . D_{d_{i}} . . . D_{d_{n}}},

Can use undirected weighted graph G={V, E, W} represents.Wherein

W_{ij} = {Diff}_{D_{d_{i}} &RightArrow; D_{d_{j}}} .

Reference paper graph of a relation about the XDrill compression algorithm:

The reference paper graph of a relation R={V of XDrill compression algorithm, E, W} are an acyclic figure of oriented cum rights.Limit e wherein _IjExpression uses i document segment as reference paper j document segment to be compressed, and is called j and depends on i.At first the reference paper graph of a relation must guarantee that each file can only be compressed once.If there is ring on the other hand in the R, then condensed document can not only rely on that compressed portion information decompresses.For example suppose D _i, D _j, D _kBetween interdepend, then three files can't decompress under the situation that does not rely on external information, shown in figure one.

Fixed 2. when fixing each file and use unique reference paper (any one node in-degree mostly is 1 most in the XDrill reference paper graph of a relation), makes XDrill compress the shortest spanning subtree problem of the undirected weighted graph of problem equivalent of compression effectiveness optimum.

When theorem 3. was used two or more reference paper when each file, the problem of XDrill compression compression effectiveness optimum can be conceptualized as the branching problem on the hypergraph, has been proved to be the NPC problem.

Because this computing method need be calculated the relation between any two documents in advance, can consume plenty of time and resource, so also inapplicable in actual use.

Description of drawings

Fig. 1 dependence graph that circulates.

Fig. 2 XDrill system framework.

Embodiment

About XDrill compressibility framework:

Fig. 2 is the frame construction drawing of XDrill compressibility.The XDrill compressibility mainly is made up of the two large divisions, and a part is a SAX resolver part, is mainly used to read source document and generates corresponding reference file and purpose file by calling cutting module (segmentationmodule).Another part is compressor module (Compressor), and the zdelta compressor reducer that this part is called bottom compresses source document.

Wherein XML segment is meant the XML segment by generating after the cutting module.

About XDrill compression algorithm and flow process:

XDrill compression algorithm flow process as shown in Table 1.The XDrill compressibility need be safeguarded six system caches, refl_structure wherein, and ref2_structure is used for storing the structure fragment information of two reference papers.Refl_structure, ref2_contents are used for storing the content segments information of two reference papers.Tar_structure and tar_contents are used for safeguarding the information of current XML file fragment.The SAX resolver constantly reads the XML document content and write data in tar_structure and tar_contents.If current SAX incident has satisfied the cutting breakpoint of cutting rule, call the zdelta tool of compression and utilize corresponding structure and content segments purpose File Compress (corresponding to the 7th, 8 liang of line code).The the 9th to 14 row in the corresponding code is the corresponding buffer region exchanges data.Utilize the data among ref2_structure and ref2_contents renewal refl_structure and the ref2_contents.Equally, the data among use tar_structure and the tar_contents are replaced the data among ref2_structure and the ref2_contents.Empty tar_structure and tar_contents at last.SAX parser continues to read in following data.

Decompression process is not given unnecessary details in contrast.

We have adopted the way of cutting apart the XML document tree for the operation granularity that reduces XML document in the XDrill system.For each the XML document segment that generates, still can use existing XML compression algorithm, for example XMill etc. compresses.But generally for XML document, existing XML compression algorithm can only be utilized the information redundancy of single XML document segment inside, and XDrill when utilizing this segment self redundant information also by the information prediction in the reference paper a part of redundant information.

Renewal and increment type storage about XML document:

Because XDrill compresses after XML document is divided, the operation granularity of file is also reduced accordingly.In the XDrill system, the renewal of XML document at first needs the position of positioned update XML segment, and the file difference that uses zdelta to compress new document segment and old information then transmits.

When the user needed the increment type storage, new XML information was as long as use corresponding original XML document segment to compress as reference paper.

Table-Xdrill algorithm

1:if (SAX-Event is the beginning label event)

2: will begin tag characters and write

tar_structure；

3:else if (SAX-Event is the end-tag incident)

4:if (not satisfying division rule)

5: the end-tag character is write

tar_structure；

6：else

7：zdelta(ref1_structure，ref2_structure，tar_structure)

8：zdelta(ref1_contents，ref2_contents，tar_contents)

9: the information of mobile ref2_structure is to refl_structure;

10 move ref2_contents: information to refl_contents;

11 move tar_structure: information to ref2_structure;

12 remove tar_structure: in content;

13 move among the tar_content: information to ref2_content;

14 remove among the tar_contents: content

(SAX-Event is a literary composition to 15 else if: present event)

16 write symbol " ^ ": tar_structure;

17 write content of text: tar_contents;

18 write symbol " ^ ": tar_contents;

19?End。

Claims

1, a kind of XML document compression method based on file difference, this method step is:

A is divided into the XML file XML document segment of 64K;

B calculates the difference between the XML document segment;

C, the difference between the condensed document segment;

Decompression step is opposite with this process.