CN104156990A

CN104156990A - Lossless compressed encoding method and system supporting oversize data window

Info

Publication number: CN104156990A
Application number: CN201410317732.3A
Authority: CN
Inventors: 覃健诚; 钟宇; 陆以勤
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2014-07-03
Filing date: 2014-07-03
Publication date: 2014-11-19
Anticipated expiration: 2034-07-03
Also published as: CN104156990B

Abstract

The invention discloses a lossless compressed encoding method supporting an oversize data window. The lossless compressed encoding method comprises the following steps: conducting character string matching on un-compressed data with a character string matching encoder to generate multi-stages of length codes and multi-stages of index codes, which comprise single character codes and control instruction codes; dividing the multi-stages of length codes and the multi-stages of index codes into code segments of binary 8-bit or other fixed length with a segment cutter; extracting coding category message from the multi-stages of length codes and the multi-stages of index codes; conducting dynamic statistics on the coding segments of the fixed length with a classification statistical table and calculating a probability prediction table according to the statistical table; conducting compressed encoding on the coding segments of the fixed length according to the probability prediction table and a compressed encoding algorithm based on probability statistic with an entropy encoder; outputting binary compressed data. The method and system can support GB-grade to EB-grade oversize data windows, and have the advantage that the compressibility cannot be reduced when an index code is lengthened.

Description

A kind of lossless compression-encoding method and system of supporting super-huge data window

Technical field

The present invention relates to the information coding technique field of data lossless compression, particularly a kind of lossless compression-encoding method and system of supporting super-huge data window.

Background technology

Along with developing rapidly of cloud computing, data volume increases with surprising rapidity.As the development trend of information industry, large data are just becoming more and more important.Can meanwhile also there is problem: how process efficiently the even more large data of these TB levels, PB level? we need to store and transmit these large data in network environment, and this is all a kind of challenge to storage space, the network bandwidth and computational resource.

Data compression is the method for saving a kind of wisdom of data storage and transmission cost, but faces large data, traditional compression and the encryption technology scarce capacity that seems.For example software WinRAR only has the small data window of a 4MB, this meeting limit compression rate, and its compression speed is enough fast.

Large data window likely improves compressibility.Be difficult to but expand data window, because index length can increase, cause compressibility to reduce.

Lossless Compression is called again Lossless Compression, is the class among data compression technique, and feature is can restore the same data while decompressing.What the softwares such as such as WinZip, WinRAR, 7-zip adopted is exactly lossless compressiong.The another kind of lossy compression method that is called of data compression technique, conventionally compression to as if the multi-medium data such as sound, picture, video, be characterized in the decompressing data and the raw data that obtain are variant, but give people's perceptual difference distance not obvious.For example JPG picture, DVD video have just been used lossy compression.All lossy compression method coding methods, all need to adopt the addressable part of a Lossless Compression to complete compression at compressibility end, and therefore this patent is applicable equally for the situation of lossy compression method.

Entropy coder is the vitals in lossless compressiong, its principle is that the probability occurring according to character is determined character-coded length, and the character that probability is large adopts short coding, and the character that probability is little adopts long codes, thereby make the data encoding of output short as far as possible, reach the effect of data compression.The common algorithm that entropy coder adopts has arithmetic coding, Interval Coding, Huffman coding etc., for example WinZip Huffman coding, the 7-zip algorithm of Interval Coding.This patent is applicable equally for having adopted the situation of other entropy encryption algorithms.

From the angle of theory classification, current Lossless Compression mathematical model and method can be divided into following 3 types:

1) compression based on probability statistics, such as Huffman encoding, arithmetic coding etc.In this type, PPM (Partial Prediction Match, the part prediction and matching) algorithm based on Markov chain model has good compressibility.

2) compression based on dictionary index, such as LZ77/LZSS algorithm, LZ78/LZW algorithm etc.The compact model of LZ series has the advantage in speed.

3) compression of the order based on symbol and the situation of repetition, such as Run-Length Coding, BWT (Burrows-Wheeler conversion) coding etc.

The compressed software of current popular is the Application of composite of above basic compression theory.Every kind of software is conventionally integrated different compact model and methods and is reached better effect.Enumerate the feature of some popular compressed softwares below:

1) dbase: WinZip

Compressed format: Deflat;

Rudimentary algorithm: LZSS & Huffman coding;

Data window maxsize: 512KB;

Weak point: data window is young; Compressibility is low; A little less than large Data support ability.

2) dbase: WinRAR

Compressed format: RAR;

Rudimentary algorithm: LZSS & Huffman coding;

Data window maxsize: 4MB;

3) dbase: Bzip2

Compressed format: BZ2;

Rudimentary algorithm: BWT & Huffman coding;

Data window (data block) maxsize: 900KB;

Weak point: BWT data block is little; Compressibility is low; A little less than large Data support ability.

4) dbase: 7-zip

Compressed format: 7z;

Rudimentary algorithm: LZSS & arithmetic coding (Interval Coding is identical with arithmetic coding essence);

Data window maxsize: 4GB;

Weak point: data window is less; Limited large Data support ability.

Also have other compressed softwares, such as PAQ, WinUDA etc.They may have higher compressibility, but speed is slower, are not suitable for large data compression.

In sum, existing data lossless compress technique or speed are slow, be not suitable for carrying out the even more large data compression of GB level, TB level, or data window are less, causes compressibility on the low side.But directly increase data window and can not effectively improve compressibility, because large data window need to be used longer index, and the increase of index length can reduce compressibility, loses more than gain, unless can find effective compaction coding method and compressed format.

Summary of the invention

The shortcoming that the object of the invention is to overcome prior art, with not enough, provides a kind of lossless compression-encoding method of supporting super-huge data window.

Another object of the present invention is to provide a kind of lossless compression-encoding system of supporting super-huge data window.

Object of the present invention realizes by following technical scheme:

Support a lossless compression-encoding method for super-huge data window, the step that comprises following order:

S1. string matching scrambler carries out string matching to packed data not, generates the multiple-length coding, the multiple index coding that have comprised individual character coding and control command code;

S2. segmentation cutter becomes multiple-length coding, multiple index code division the encoded segment of 8 of scale-of-two or other regular lengths, offers categorised statistical form and entropy coder; Meanwhile, from multiple-length coding, multiple index coding, extract coding specification information, offer categorised statistical form;

S3. according to coding specification information, first categorised statistical form is put into the encoded segment of regular length in different statistical forms and carries out dynamic statistics, then according to the numerical value of statistical form, provides probabilistic forecasting table to entropy coder;

S4. according to probabilistic forecasting table, entropy coder uses the Coding Compression Algorithm based on probability statistics, and fixed-length code (FLC) segmentation is carried out to compressed encoding, exports binary packed data.

In step S1, described multiple-length coding, carrys out classification according to string length to be encoded, and follows specific length coding form.

The described string length according to band coding is carried out classification, specifically comprises: adopt the length coding of at least 3 grades, the code length of same rank is identical; More low-level length coding, advanced other length coding is short; Shorter character string adopts more low-level length coding, and longer character string adopts the length coding of higher level, and concrete partition of the level is relevant to specific length coding form;

Described multiple-length coded format, specifically comprises: L0 length coding, represents single character; L1 length coding, represents string length 2 to 11; L2 length coding, represents string length 12 to 75; Optional L3 length coding, represents string length 76 to 325; Optional L4 length coding, comprises L4 lineal measure coding and L4 index length coding, can support the string length of maximum 16EB, and follows the specific lineal measure form combining with index length coding of encoding.

In step S1, described multiple index coding, carrys out classification according to the distance of character string to be encoded and matched character string, and follows specific index coded format.

The described distance according to character string to be encoded and matched character string is carried out classification, specifically comprises: adopt the index coding of at least 3 grades, the code length of same rank is identical; More low-level index coding, advanced other index coding is short; Index value equals the distance value of character string and matched character string to be encoded; Less index value adopts more low-level index coding, and larger index value adopts the index coding of higher level, and concrete partition of the level is relevant to specific index coded format;

Described multiple index coded format, specifically comprises: L1 index coding, represents index value X1 to X1+Y1-1, wherein X1=1,8 powers that Y1 equals 2; L2 index coding, represents index value X2 to X2+Y2-1, wherein X2=X1+Y1,16 powers that Y2 equals 2; L3 index coding, represents index value X3 to X3+Y3-1, wherein X3=X2+Y2,24 powers that Y3 equals 2; Optional L4 index coding, represents index value X4 to X4+Y4-1, wherein X4=X3+Y3,32 powers that Y4 equals 2; Optional L5 index coding, represents index value X5 to X5+Y5-1, wherein X5=X4+Y4,40 powers that Y5 equals 2; Optional L6 index coding, represents index value X6 to X6+Y6-1, wherein X6=X5+Y5,48 powers that Y6 equals 2; Optional L7 index coding, represents index value X7 to X7+Y7-1, wherein X7=X6+Y6,56 powers that Y7 equals 2; Optional L8 index coding, represents index value X8 to X8+Y8-1, wherein X8=X7+Y7,64 powers that Y8 equals 2; L1, L2, L3, L4, L5, L6, L7, the L8 index data window size that can support of encoding is followed successively by 256B, 64KB, 16MB, 4GB, 1TB, 256TB, 64PB, 16EB.

In step S2, the coding that is categorized as more than one ranks in described categorised statistical form.

In step S3, described statistical form designs according to 0 rank, 1 rank or 2 rank PPM probability statistics models.

In step S4, the described Coding Compression Algorithm based on probability statistics is the one in arithmetic coding, Interval Coding or Huffman coding.

Another object of the present invention realizes by following technical scheme:

Support a lossless compression-encoding system for super-huge data window, comprise string matching scrambler, segmentation cutter, categorised statistical form, entropy coder that order is connected, wherein

String matching scrambler, for packed data is not carried out to string matching, generates the multiple-length coding, the multiple index coding that have comprised individual character coding and control command code, offers segmentation cutter;

Segmentation cutter, for the encoded segment multiple-length is encoded, multiple index code division becomes 8 of scale-of-two or other regular lengths, offer categorised statistical form and entropy coder, meanwhile, from multiple-length coding, multiple index coding, extract coding specification information, offer categorised statistical form, a classification can be a kind of coding of rank, can be also the coding of more than one ranks;

Categorised statistical form, be used for according to coding specification information, the encoded segment of regular length is put into and in different statistical forms, carries out dynamic statistics, each statistical form used can be to design according to 0 rank, 1 rank or 2 rank PPM probability statistics models, meanwhile, according to the numerical value of statistical form, provide probabilistic forecasting table to entropy coder;

Entropy coder, for according to probabilistic forecasting table, uses arithmetic coding, Interval Coding, Huffman coding or other Coding Compression Algorithm based on probability statistics, and fixed-length code (FLC) segmentation is carried out to compressed encoding, exports binary packed data.

Compared with prior art, tool has the following advantages and beneficial effect in the present invention:

Method of the present invention, index coding can be supported oversized data window, can reach PB, EB level, be far longer than MB, the GB DBMS window of the current common WinRAR of compressed software and 7-zip, and can not reduce compressibility because of the increase of the increase of data window, index length; Can support linear length coding and index length coding simultaneously, can encode and reach the string length of 16EB; The Encoder Advantage of above index, length, makes the present invention in the time of compression GB, TB level or above mass data, can obtain the compressibility higher than current common compressed software.

Brief description of the drawings

Fig. 1 is the process flow diagram of a kind of lossless compression-encoding method of supporting large data window of the present invention;

Fig. 2 is the structural representation of a kind of lossless compression-encoding system of supporting large data window of the present invention;

Fig. 3 is multiple-length coding, the multistage straw line coded format schematic diagram of the string matching scrambler output of system described in Fig. 2.

Embodiment

Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited to this.

Described multiple-length coding, carrys out classification according to string length to be encoded, and follows specific length coding form.

The described string length according to band coding is carried out classification, specifically comprises: adopt the length coding of at least 3 grades, the code length of same rank is identical; More low-level length coding, advanced other length coding is short; Shorter character string adopts more low-level length coding, and longer character string adopts the length coding of higher level, and concrete partition of the level is relevant to specific length coding form.

Described multiple-length coded format, specifically comprises: L0 length coding, represents single character; L1 length coding, represents string length 2 to 11; L2 length coding, represents string length 12 to 75; Optional L3 length coding, represents string length 76 to 325; Optional L4 length coding, comprises L4 lineal measure coding and L4 index length coding, can support the string length of maximum 16EB, and follows the specific lineal measure form combining with index length coding of encoding;

The described distance according to character string to be encoded and matched character string is carried out classification, specifically comprises: adopt the index coding of at least 3 grades, the code length of same rank is identical; More low-level index coding, advanced other index coding is short; Index value equals the distance value of character string and matched character string to be encoded; Less index value adopts more low-level index coding, and larger index value adopts the index coding of higher level, and concrete partition of the level is relevant to specific index coded format.

Described multiple index coded format, specifically comprises: L1 index coding, represents index value X1 to X1+Y1-1, wherein X1=1,8 powers that Y1 equals 2; L2 index coding, represents index value X2 to X2+Y2-1, wherein X2=X1+Y1,16 powers that Y2 equals 2; L3 index coding, represents index value X3 to X3+Y3-1, wherein X3=X2+Y2,24 powers that Y3 equals 2; Optional L4 index coding, represents index value X4 to X4+Y4-1, wherein X4=X3+Y3,32 powers that Y4 equals 2; Optional L5 index coding, represents index value X5 to X5+Y5-1, wherein X5=X4+Y4,40 powers that Y5 equals 2; Optional L6 index coding, represents index value X6 to X6+Y6-1, wherein X6=X5+Y5,48 powers that Y6 equals 2; Optional L7 index coding, represents index value X7 to X7+Y7-1, wherein X7=X6+Y6,56 powers that Y7 equals 2; Optional L8 index coding, represents index value X8 to X8+Y8-1, wherein X8=X7+Y7,64 powers that Y8 equals 2; L1, L2, L3, L4, L5, L6, L7, the L8 index data window size that can support of encoding is followed successively by 256B, 64KB, 16MB, 4GB, 1TB, 256TB, 64PB, 16EB;

The coding that is categorized as more than one ranks in described categorised statistical form;

Described statistical form designs according to 0 rank, 1 rank or 2 rank PPM probability statistics models;

S4. according to probabilistic forecasting table, entropy coder uses the Coding Compression Algorithm based on probability statistics, and fixed-length code (FLC) segmentation is carried out to compressed encoding, exports binary packed data;

The described Coding Compression Algorithm based on probability statistics is the one in arithmetic coding, Interval Coding or Huffman coding.

To the method, do further and introduce below:

As shown in Figure 1, a kind of lossless compression-encoding method of supporting super-huge data window, the step that comprises following order:

S301: initialization index segmentation statistical form tabI[i], wherein i from 1 to 8, has 8 group index segmentation statistical forms;

S302: initialization length segmentation statistical form tabL[i], wherein i from 0 to 4, has 5 groups of length segmentation statistical forms;

S303: judging whether not have still unpressed data, is that flow process finishes, otherwise continue step S304;

S304: judge whether to carry out special function, be to go to step S305, otherwise go to step S306;

S305: variable C is set for controlling class, I is steering order code, and L is control operation number, goes to step S311;

S306: read new not packed data;

S307: carry out string matching in data window;

S308: judge that whether string matching is successful, is to go to step S309, otherwise goes to step S310;

S309: it is dictionary class that variable C is set, I is the index value of coupling, L is the string length value of coupling, goes to step S311;

S310: it is monocase class that variable C is set, and I is character ASCII character, goes to step S311;

S311: three category codes of C representative are integrated into Unified coding, make hierarchical index coding according to I, make classification length coding according to L;

S312: it is hierarchical index coding that variable Idx is set, and S1 is index level, and Len is classification length coding, and S2 is length rank;

S313: Idx carries out segmentation cutting to index coding, becomes the encoded segment Idx1 of regular length, Idx2, Idx3 etc., the number of encoded segment depends on the figure place of index coding Idx;

S314: Len carries out segmentation cutting to index coding, becomes the encoded segment Len1 of regular length, Len2, Len3 etc., the number of encoded segment depends on the figure place of length coding Len;

S315: adopt 0 rank, 1 rank or 2 rank PPM probability statistics table tabL[S2] to encoded segment Len1, Len2, Len3 etc. carry out entropy coding;

S316: adopt 0 rank, 1 rank or 2 rank PPM probability statistics table tabI[S1] to encoded segment Idx1, Idx2, Idx3 etc. carry out entropy coding;

S317: the entropy packed data output obtaining of encoding;

S318: according to encoded segment Idx1, Idx2, Idx3 etc., upgrade statistical form tabI[S1], according to encoded segment Len1, Len2, Len3 etc., upgrade statistical form tabL[S2], and jump to step S303.

As Fig. 2, a kind of lossless compression-encoding system of supporting super-huge data window, comprises string matching scrambler, segmentation cutter, categorised statistical form, entropy coder that order is connected, wherein

String matching scrambler 101, for packed data is not carried out to string matching, generates the multiple-length coding, the multiple index coding that have comprised individual character coding and control command code, offers segmentation cutter 102;

Segmentation cutter 102, for the encoded segment multiple-length is encoded, multiple index code division becomes 8 of scale-of-two or other regular lengths, offer categorised statistical form 103 and entropy coder 104, meanwhile, from multiple-length coding, multiple index coding, extract coding specification information, offering the coding that 103, one classification of categorised statistical form can be a kind of ranks, can be also the coding of more than one ranks;

Categorised statistical form 103, be used for according to coding specification information, the encoded segment of regular length is put into and in different statistical forms, carries out dynamic statistics, each statistical form used can be to design according to 0 rank, 1 rank or 2 rank PPM probability statistics models, meanwhile, according to the numerical value of statistical form, provide probabilistic forecasting table to entropy coder 104;

Entropy coder 104, for according to probabilistic forecasting table, uses arithmetic coding, Interval Coding, Huffman coding or other Coding Compression Algorithm based on probability statistics, and fixed-length code (FLC) segmentation is carried out to compressed encoding, exports binary packed data.

As Fig. 3, the multiple-length coding of string matching scrambler output, multiple index coded format comprise:

For each section not packed data carry out string matching coding, the result obtaining is a classification length coding, followed by a hierarchical index coding, wherein classification length coding form separates 201 from length coding below, and hierarchical index coded format separates 212 from index coding;

Length coding separates 201: according to different situations, single character be may be encoded as to L0 length coding 202, length 2 to 11 be may be encoded as to L1 length coding 203, length 12 to 75 be may be encoded as to L2 length coding 204, the coding of other situations separates 205 from L3 length;

L3 length separates 205: according to different situations, control word be may be encoded as to control command code 210, length 76 to 325 be may be encoded as to L3 lineal measure coding 206, to length 326 and above situation, coding separates 207 from L4 length;

L4 length separates 207: according to different situations, length 326 to 65535 be may be encoded as to L4 lineal measure coding 208,64 powers of 16 powers to 2 to length 2 may be encoded as L4 index length coding 209;

Control command code 210: below followed by control operation number encoder 211;

Index coding separates 212: according to different situations, matched position be may be encoded as in data window 1 scope of 256 bytes to L1 index coding 213, matched position be may be encoded as in data window 2 scopes of 64KB to L2 index coding 214, matched position be may be encoded as in data window 3 scopes of 16MB to L3 index coding 215, matched position be may be encoded as in data window 4 scopes of 4GB to L4 index coding 216, matched position be may be encoded as in data window 5 scopes of 1TB to L5 index coding 217, matched position be may be encoded as in data window 6 scopes of 256TB to L6 index coding 218, matched position be may be encoded as in data window 7 scopes of 64PB to L7 index coding 219, matched position be may be encoded as in data window 8 scopes of 16EB to L8 index coding 220.

Above-described embodiment is preferably embodiment of the present invention; but embodiments of the present invention are not restricted to the described embodiments; other any do not deviate from change, the modification done under Spirit Essence of the present invention and principle, substitutes, combination, simplify; all should be equivalent substitute mode, within being included in protection scope of the present invention.

Claims

1. a lossless compression-encoding method of supporting super-huge data window, is characterized in that the step that comprises following order:

2. the lossless compression-encoding method of the super-huge data window of support according to claim 1, is characterized in that: in step S1, described multiple-length coding, carrys out classification according to string length to be encoded, and follow specific length coding form.

3. the lossless compression-encoding method of the super-huge data window of support according to claim 2, it is characterized in that, the described string length according to band coding is carried out classification, specifically comprises: adopt the length coding of at least 3 grades, the code length of same rank is identical; More low-level length coding, advanced other length coding is short; Shorter character string adopts more low-level length coding, and longer character string adopts the length coding of higher level, and concrete partition of the level is relevant to specific length coding form;

4. the lossless compression-encoding method of the super-huge data window of support according to claim 1, it is characterized in that: in step S1, described multiple index coding, carrys out classification according to the distance of character string to be encoded and matched character string, and follows specific index coded format.

5. the lossless compression-encoding method of the super-huge data window of support according to claim 4, it is characterized in that, the described distance according to character string to be encoded and matched character string is carried out classification, specifically comprises: adopt the index coding of at least 3 grades, the code length of same rank is identical; More low-level index coding, advanced other index coding is short; Index value equals the distance value of character string and matched character string to be encoded; Less index value adopts more low-level index coding, and larger index value adopts the index coding of higher level, and concrete partition of the level is relevant to specific index coded format;

6. the lossless compression-encoding method of the super-huge data window of support according to claim 1, is characterized in that: in step S2, and the coding that is categorized as more than one ranks in described categorised statistical form.

7. the lossless compression-encoding method of the super-huge data window of support according to claim 1, is characterized in that: in step S3, described statistical form designs according to 0 rank, 1 rank or 2 rank PPM probability statistics models.

8. the lossless compression-encoding method of the super-huge data window of support according to claim 1, is characterized in that: in step S4, the described Coding Compression Algorithm based on probability statistics is the one in arithmetic coding, Interval Coding or Huffman coding.

9. a lossless compression-encoding system of supporting super-huge data window, is characterized in that: comprise string matching scrambler, segmentation cutter, categorised statistical form, entropy coder that order is connected, wherein