Summary of the invention
The present invention overcomes the deficiency that prior art exists, and provides a kind of and replaces the element that represents in XML document and the compression method of attribute with random length identification code, realizes underload, compresses the XML document that contains a great number of elements and attribute efficiently.
For achieving the above object, the invention provides the compression method of a kind of XML based on random length identification code, the method comprises:
For each element in XML document, in data dictionary, define a random length identification code of answering in contrast; And for each attribute in XML document, in data dictionary, define a random length identification code of answering in contrast;
Transmit leg is used the corresponding random length identification code defining in described data dictionary to replace one by one element and the attribute in XML document, realizes the replacement compression of XML document;
Take over party replaces the random length identification code in the XML document after received replacement compression according to the element corresponding with corresponding random length identification code defining in described data dictionary and attribute, realizes the decompress(ion) of XML document;
Described for each element in XML document, in data dictionary, define a random length identification code of answering in contrast; And for each attribute in XML document, in data dictionary, define a random length identification code of answering in contrast and comprise:
For each element in XML document, adopt 8 bit identification codes to represent, or adopt 16 bit identification codes to represent; Wherein the 1st in high 4 is used for determining whether XML form, the 2nd is used for determining whether element, the 3rd is used for determining whether closure element, and the 4th need to represent identity element by the byte of two 8 for judging whether, remaining bit is used for representing this element;
For each attribute in XML document, adopt 8 bit identification codes to represent, or adopt 16 bit identification codes to represent; Wherein the 1st in high 3 is used for determining whether XML form, the 2nd is used for determining whether attribute, the 3rd need to represent same attribute by the byte of two 8 for judging whether, remaining bit is used for representing this attribute, and wherein the value of attribute represents with string format.
In described data dictionary, for each element in XML document, adopt DDD analysis method to decide and adopt 8 bit identification codes to represent, or adopt 16 bit identification codes to represent.
In described data dictionary, for each element in XML document, employing expends byte number analytic approach and decides 8 bit identification codes of employing to represent, or adopts 16 bit identification codes to represent.
In described data dictionary, for each attribute in XML document, adopt DDD analysis method to decide and adopt 8 bit identification codes to represent, or adopt 16 bit identification codes to represent.
In described data dictionary, for each attribute in XML document, employing expends byte number analytic approach and decides 8 bit identification codes of employing to represent, or adopts 16 bit identification codes to represent.
The present invention also provides the compression set of a kind of XML, and this device comprises: XML read module, data dictionary memory module, label replacement compression module and universal compressed module for compression; Wherein:
XML read module, for reading XML bytes of stream data;
Data dictionary memory module is used in compression, for save data dictionary;
In described data dictionary, for each element in XML document, in data dictionary, defined a random length identification code of answering in contrast; And for each attribute in XML document, in data dictionary, defined a random length identification code of answering in contrast;
Label is replaced compression module, for the corresponding random length identification code defining according to data dictionary memory module, replaces one by one element and the attribute in XML document, generates the XML document of replacing after compression;
Universal compressed module, for using universal compressed algorithm further to compress to the XML document after data dictionary and replacement compression, generates packed data.
The present invention further provides the decompression device of a kind of XML, this device comprises: working solution die block, label are replaced decompression module and decompress(ion) data dictionary memory module, wherein:
Working solution die block, for being used general decompression algorithm to carry out decompress(ion) to the packed data receiving;
Decompress(ion) data dictionary memory module, for storing data dictionary;
In described data dictionary, for each element in XML document, in data dictionary, defined a random length identification code of answering in contrast; And for each attribute in XML document, in data dictionary, defined a random length identification code of answering in contrast;
Label is replaced decompression module, uses the data dictionary of data dictionary memory module storage for decompress(ion), and the random length identification code in the XML document after compression is replaced with to corresponding element and attribute one by one, obtains original XML document.
The data dictionary that above-mentioned transmit leg is used comprises identical content with the data dictionary that take over party uses, and the random length identification code defining in transmit leg usage data dictionary is replaced element and the attribute in XML document one by one, realizes the replacement compression of XML document; Take over party receives after packed data, according to the element corresponding with corresponding random length identification code and the attribute of described data dictionary definition, replace one by one the random length identification code in received XML document, realize the replacement decompress(ion) of XML document, after decompress(ion), just obtain original XML document; The invention solves XML document in storage and transmitting procedure and contain a great number of elements and attribute, and the large problem of data volume.
Embodiment
Fig. 1 is the process flow diagram of the compression method of a kind of XML based on random length identification code in the present invention, and as shown in Figure 1, the compression method of a kind of XML based on random length identification code comprises the steps:
Step 101 for each element in XML document, defines a random length identification code of answering in contrast in data dictionary; And for each attribute in XML document, in data dictionary, define a random length identification code of answering in contrast;
Step 102, transmit leg is used the corresponding random length identification code defining in described data dictionary to replace one by one element and the attribute in XML document, realizes the replacement compression of XML document, generates the XML document after compression;
Step 103, transmit leg is used Flate, and the universal compressed algorithm such as LZW compresses XML document and the data dictionary after compressing;
Step 104, take over party uses Flate, and the general decompression algorithm such as LZW is carried out decompress(ion) to XML document and the data dictionary after compressing, and obtains the XML document after data dictionary and compression;
Step 105, take over party replaces the random length identification code in the XML document after received compression according to the element corresponding with corresponding random length identification code and the attribute that define in described data dictionary, realizes the decompress(ion) of XML document.
For clearer description technical scheme of the present invention, below in conjunction with drawings and Examples, describe the present invention.
In the concrete application such as digital books compression, owing to need to expressing geometry layout information and the logic layout information of the page, need to represent with up to a hundred labels the information such as word, word, row, paragraph, row reading direction, alignment, character direction, use language, font, word size, chapter, joint and title; In order to replace corresponding tag name with short as far as possible identification code, the present invention defines the identification code corresponding with each element and attribute by following rule.
For each element in XML document, can adopt 8 bit identification codes to represent, or adopt 16 bit identification codes to represent; Wherein the 1st in high 4 is used for determining whether XML form, if it is puts 1, otherwise sets to 0; The 2nd is used for determining whether element, if it is puts 1, otherwise set to 0; The 3rd is used for determining whether closure element, if it is puts 1, otherwise set to 0; The 4th need to represent identity element by the byte of two 8 for judging whether, if it is put 1, otherwise set to 0; Remaining bit is used for representing this element; Wherein element format is as shown in table 1, if is-two-byte position is 1, represents that next byte is also used for representing this element; If is-two-byte position is 0, represent only by a byte, to represent this element; Like this, 4096 kinds of elements can be represented altogether, most application demand can be met.
Table 1
As shown in table 2, for each attribute in XML document, can adopt 8 bit identification codes to represent, or adopt 16 bit identification codes to represent; Wherein the 1st in high 3 is used for determining whether XML form, if it is puts 1, otherwise sets to 0, and the 2nd is used for determining whether attribute, if it is sets to 0, otherwise puts 1; The 3rd need to represent identity element by the byte of two 8 for judging whether, if it is put 1, otherwise set to 0; Remaining bit is used for representing this attribute; The value of attribute represents with string format, and usings the ending of specific character as string format.In an embodiment of the present invention, using 0x00 as the ending of the string format of the value of attribute, for distinguishing each character string.Wherein attribute format is as shown in table 2, if is-two-byte position is 1, represents that next byte is also used for representing this attribute; If is-two-byte position is 0, represent only by a byte, to represent this attribute; Like this, 8192 kinds of elements can be represented altogether, most application demand can be met.
Table 2
Text in XML document represents with the form of character string, usings the beginning and end of specific character as the string format of text.Specific implementation is: described text is carried out to utf-8 coding, and the character string obtaining is started and ended up with 0x00 with 0x00.
From table 1 and table 2, can obtain the conclusion shown in table 3.When element is used double-byte representation, most-significant byte bit will be expressed as F* or the D* of 16 systems, and wherein F* represents start element, and D* represents closure element; When element is used byte to represent, most-significant byte bit will be expressed as E* or the C* of 16 systems, and wherein E* represents start element, and C* represents closure element; When attribute is used double-byte representation, most-significant byte bit will be expressed as A* or the B* of 16 systems; When attribute is used byte to represent, most-significant byte bit will be expressed as 9* or the 8* of 16 systems.
Table 3
Value for element in XML file, becomes character string by the text by utf-8 code conversion, and usings 0x00 and insert in output character stream as the beginning and end of the string format of text.Value for attribute in XML file, becomes character string by utf-8 code conversion, and usings 0x00 and insert in output character stream as the ending of the string format of the value of attribute.Like this in step 105, by reading character, flow, can differentiate current character belongs to text character string ﹑ attribute character string ﹑ double byte and opens the plain ﹑ byte of first plain ﹑ double word section end unit that begins and open any in first plain ﹑ individual character section end plain ﹑ double byte attribute of unit and byte attribute of beginning, can from data dictionary, find corresponding identification code and replace decompress(ion), to restore original initial XML file.
For each element in XML document, adopt DDD analysis method to determine to adopt 8 bit identification codes to represent, or adopt 16 bit identification codes to represent; Specific implementation adopts the method for two-pass scan, first pass counts the access times of each element, by access times the most frequently 16 elements with 8 bit identification codes, represent, for other elements, with 16 bit identification codes, represent, in embodiment, adopt the access times of element of DDD analysis method statistics as shown in table 4.
Table 4
Except decide this element according to usage frequency, be adopt 8 bit identification codes to represent or adopt 16 bit identification codes represent, can also calculate the value that each element expends byte number, i.e. use expends byte number analytic approach and decides 8 bit identification codes of employing represent or adopt 16 bit identification codes to represent, wherein expends byte number and calculates by formula (1).
Expend byte number=element occurrence number * element byte length (1)
The size sequence that expends the value of byte number according to all elements in XML document sorts from small to large, wherein 16 of maximum are expended 16 elements that byte number is maximum corresponding to representing in original XML document, its corresponding element is used 8 bit identification codes to represent, remaining element is used 16 bit identification codes to represent, in embodiment, adopt expend byte number analytic approach statistics element to expend byte number as shown in table 5.
Table 5
For each attribute in XML document, employing DDD analysis method determines to adopt 8 bit identification codes to represent or adopts 16 bit identification codes to represent; Specific implementation adopts the method for two-pass scan, first pass counts the access times of each attribute, by access times the most frequently 16 attributes with 8 bit identification codes, represent, for other attributes, with 16 bit identification codes, represent, no longer enumerate embodiment herein.
Except decide this attribute according to usage frequency, be adopt 8 bit identification codes to represent or adopt 16 bit identification codes represent, can also calculate the value that each attribute expends byte number, i.e. use expends byte number analytic approach and decides 8 bit identification codes of employing represent or adopt 16 bit identification codes to represent, wherein expends byte number and calculates by formula (2).
Expend byte number=attribute occurrence number * attribute byte length (2)
The size sequence that expends the value of byte number according to all properties in XML document sorts from small to large, wherein 32 maximum corresponding expressions are expended 32 attributes that byte number is maximum in original XML document, its corresponding attribute is used 8 bit identification codes to represent, remaining attribute is used 16 bit identification codes to represent, no longer enumerates embodiment herein.
The present invention also provides the compression set of a kind of XML, and as shown in Figure 2, this device comprises: XML read module 201, data dictionary memory module 202, label replacement compression module 203 and universal compressed module 204 for compression; Wherein:
XML read module 201, for reading XML bytes of stream data;
Data dictionary memory module 202 for compression, for save data dictionary;
In described data dictionary, for each element in XML document, in data dictionary, defined a random length identification code of answering in contrast; And for each attribute in XML document, in data dictionary, defined a random length identification code of answering in contrast;
In described data dictionary, for each element in XML document, adopt DDD analysis method to decide and adopt 8 bit identification codes to represent, or adopt 16 bit identification codes to represent.
In described data dictionary, for each element in XML document, employing expends byte number analytic approach and decides 8 bit identification codes of employing to represent, or adopts 16 bit identification codes to represent.
In described data dictionary, for each attribute in XML document, adopt DDD analysis method to decide and adopt 8 bit identification codes to represent, or adopt 16 bit identification codes to represent.
In described data dictionary, for each attribute in XML document, employing expends byte number analytic approach and decides 8 bit identification codes of employing to represent, or adopts 16 bit identification codes to represent.
Label is replaced compression module 203, for the corresponding random length identification code defining according to data dictionary memory module, replaces one by one element and the attribute in XML document, generates the XML document of replacing after compression;
Universal compressed module 204, for the XML document after data dictionary and replacement compression is used to deflate, the universal compressed algorithm such as LZW further compresses, and generates packed data.
Described label is replaced compression module 203, and by the text in XML document, the form with character string represents, usings the beginning and end of specific character as the string format of text.
The present invention also provides the decompression device of a kind of XML, and as shown in Figure 3, this device comprises: working solution die block 301, decompress(ion) are replaced decompression module 303 with data dictionary memory module 302 and label, wherein:
Working solution die block 301, is used deflate for the packed data to receiving, and the general decompression algorithm such as LZW is carried out decompress(ion);
Data dictionary memory module 302 for decompress(ion), for storing data dictionary;
In described data dictionary, for each element in XML document, in data dictionary, defined a random length identification code of answering in contrast; And for each attribute in XML document, in data dictionary, defined a random length identification code of answering in contrast;
In described data dictionary, for each element in XML document, adopt DDD analysis method to decide and adopt 8 bit identification codes to represent, or adopt 16 bit identification codes to represent.
In described data dictionary, for each element in XML document, employing expends byte number analytic approach and decides 8 bit identification codes of employing to represent, or adopts 16 bit identification codes to represent.
In described data dictionary, for each attribute in XML document, adopt DDD analysis method to decide and adopt 8 bit identification codes to represent, or adopt 16 bit identification codes to represent.
In described data dictionary, for each attribute in XML document, employing expends byte number analytic approach and decides 8 bit identification codes of employing to represent, or adopts 16 bit identification codes to represent.
Label is replaced decompression module 303, use the data dictionary of data dictionary memory module storage for decompress(ion), random length identification code in XML document after replacement is compressed is anti-corresponding element and the attribute of replacing with one by one, obtains the original XML document without any information dropout.
In sum, XML document of the prior art shared data traffic in transmitting procedure is very high, and the element and the attribute that comprise are various, technical scheme provided by the invention adopts the data dictionary based on random length identification code first element and attribute in the XML document of needs transmission to be replaced, realize the replacement compression of XML document, transmit leg is transferred to take over party by the XML document of replacing after compression, take over party receives after the XML document after this replacement compression, according to the data dictionary based on random length identification code to anti-replacement of random length identification code of answering with element or Attribute Relative in the XML document receiving, realize the decompress(ion) of XML document, after decompress(ion), just obtain original XML document, technique scheme is not subject to the content constraints in compressed XML document, and the algorithm of compression and decompress(ion) is very simple, less demanding to CPU, can be good at solving XML document data volume in data transmission procedure large, and element and the various problem of attribute.
The compression method of the above-mentioned XML based on random length identification code is also applicable to a plurality of have identical DTD (Document Type Definition) or a plurality of compressions with the XML document of identical XML SCHMEA, by adding up the usage frequency of element and attribute in a plurality of XML document or expending byte number, to a conventional data dictionary of above-mentioned a plurality of XML document definition, make a plurality of XML document share above-mentioned conventional data dictionary and carry out the replacement compression of element in XML document and attribute and replace decompress(ion); As shown in Figure 4, adopt conventional data dictionary compress a plurality of XML document and the advantage of the scheme of decompress(ion) is: when compressing the individual different XML document of N, and N value is when very large, if being respectively each XML document defines different data dictionaries and can expend a large amount of storage spaces, if now realize a plurality of XML document, share a general data dictionary, just can save the required storage space of save data dictionary; Transmit leg only need to be preserved this shared conventional data dictionary, with the random length identification code defining in this data dictionary, replaces one by one element and the attribute in each XML document, can realize the replacement compression of a plurality of XML document; Take over party also only need to preserve this shared conventional data dictionary, uses this data dictionary to anti-replacement of random length identification code of answering with element or Attribute Relative in the XML document receiving, and can realize the replacement decompress(ion) of a plurality of XML document.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.