CN104156990A - Lossless compressed encoding method and system supporting oversize data window - Google Patents

Lossless compressed encoding method and system supporting oversize data window Download PDF

Info

Publication number
CN104156990A
CN104156990A CN201410317732.3A CN201410317732A CN104156990A CN 104156990 A CN104156990 A CN 104156990A CN 201410317732 A CN201410317732 A CN 201410317732A CN 104156990 A CN104156990 A CN 104156990A
Authority
CN
China
Prior art keywords
coding
index
length
string
data window
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410317732.3A
Other languages
Chinese (zh)
Other versions
CN104156990B (en
Inventor
覃健诚
钟宇
陆以勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201410317732.3A priority Critical patent/CN104156990B/en
Publication of CN104156990A publication Critical patent/CN104156990A/en
Application granted granted Critical
Publication of CN104156990B publication Critical patent/CN104156990B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a lossless compressed encoding method supporting an oversize data window. The lossless compressed encoding method comprises the following steps: conducting character string matching on un-compressed data with a character string matching encoder to generate multi-stages of length codes and multi-stages of index codes, which comprise single character codes and control instruction codes; dividing the multi-stages of length codes and the multi-stages of index codes into code segments of binary 8-bit or other fixed length with a segment cutter; extracting coding category message from the multi-stages of length codes and the multi-stages of index codes; conducting dynamic statistics on the coding segments of the fixed length with a classification statistical table and calculating a probability prediction table according to the statistical table; conducting compressed encoding on the coding segments of the fixed length according to the probability prediction table and a compressed encoding algorithm based on probability statistic with an entropy encoder; outputting binary compressed data. The method and system can support GB-grade to EB-grade oversize data windows, and have the advantage that the compressibility cannot be reduced when an index code is lengthened.

Description

A kind of lossless compression-encoding method and system of supporting super-huge data window
Technical field
The present invention relates to the information coding technique field of data lossless compression, particularly a kind of lossless compression-encoding method and system of supporting super-huge data window.
Background technology
Along with developing rapidly of cloud computing, data volume increases with surprising rapidity.As the development trend of information industry, large data are just becoming more and more important.Can meanwhile also there is problem: how process efficiently the even more large data of these TB levels, PB level? we need to store and transmit these large data in network environment, and this is all a kind of challenge to storage space, the network bandwidth and computational resource.
Data compression is the method for saving a kind of wisdom of data storage and transmission cost, but faces large data, traditional compression and the encryption technology scarce capacity that seems.For example software WinRAR only has the small data window of a 4MB, this meeting limit compression rate, and its compression speed is enough fast.
Large data window likely improves compressibility.Be difficult to but expand data window, because index length can increase, cause compressibility to reduce.
Lossless Compression is called again Lossless Compression, is the class among data compression technique, and feature is can restore the same data while decompressing.What the softwares such as such as WinZip, WinRAR, 7-zip adopted is exactly lossless compressiong.The another kind of lossy compression method that is called of data compression technique, conventionally compression to as if the multi-medium data such as sound, picture, video, be characterized in the decompressing data and the raw data that obtain are variant, but give people's perceptual difference distance not obvious.For example JPG picture, DVD video have just been used lossy compression.All lossy compression method coding methods, all need to adopt the addressable part of a Lossless Compression to complete compression at compressibility end, and therefore this patent is applicable equally for the situation of lossy compression method.
Entropy coder is the vitals in lossless compressiong, its principle is that the probability occurring according to character is determined character-coded length, and the character that probability is large adopts short coding, and the character that probability is little adopts long codes, thereby make the data encoding of output short as far as possible, reach the effect of data compression.The common algorithm that entropy coder adopts has arithmetic coding, Interval Coding, Huffman coding etc., for example WinZip Huffman coding, the 7-zip algorithm of Interval Coding.This patent is applicable equally for having adopted the situation of other entropy encryption algorithms.
From the angle of theory classification, current Lossless Compression mathematical model and method can be divided into following 3 types:
1) compression based on probability statistics, such as Huffman encoding, arithmetic coding etc.In this type, PPM (Partial Prediction Match, the part prediction and matching) algorithm based on Markov chain model has good compressibility.
2) compression based on dictionary index, such as LZ77/LZSS algorithm, LZ78/LZW algorithm etc.The compact model of LZ series has the advantage in speed.
3) compression of the order based on symbol and the situation of repetition, such as Run-Length Coding, BWT (Burrows-Wheeler conversion) coding etc.
The compressed software of current popular is the Application of composite of above basic compression theory.Every kind of software is conventionally integrated different compact model and methods and is reached better effect.Enumerate the feature of some popular compressed softwares below:
1) dbase: WinZip
Compressed format: Deflat;
Rudimentary algorithm: LZSS & Huffman coding;
Data window maxsize: 512KB;
Weak point: data window is young; Compressibility is low; A little less than large Data support ability.
2) dbase: WinRAR
Compressed format: RAR;
Rudimentary algorithm: LZSS & Huffman coding;
Data window maxsize: 4MB;
Weak point: data window is young; Compressibility is low; A little less than large Data support ability.
3) dbase: Bzip2
Compressed format: BZ2;
Rudimentary algorithm: BWT & Huffman coding;
Data window (data block) maxsize: 900KB;
Weak point: BWT data block is little; Compressibility is low; A little less than large Data support ability.
4) dbase: 7-zip
Compressed format: 7z;
Rudimentary algorithm: LZSS & arithmetic coding (Interval Coding is identical with arithmetic coding essence);
Data window maxsize: 4GB;
Weak point: data window is less; Limited large Data support ability.
Also have other compressed softwares, such as PAQ, WinUDA etc.They may have higher compressibility, but speed is slower, are not suitable for large data compression.
In sum, existing data lossless compress technique or speed are slow, be not suitable for carrying out the even more large data compression of GB level, TB level, or data window are less, causes compressibility on the low side.But directly increase data window and can not effectively improve compressibility, because large data window need to be used longer index, and the increase of index length can reduce compressibility, loses more than gain, unless can find effective compaction coding method and compressed format.
Summary of the invention
The shortcoming that the object of the invention is to overcome prior art, with not enough, provides a kind of lossless compression-encoding method of supporting super-huge data window.
Another object of the present invention is to provide a kind of lossless compression-encoding system of supporting super-huge data window.
Object of the present invention realizes by following technical scheme:
Support a lossless compression-encoding method for super-huge data window, the step that comprises following order:
S1. string matching scrambler carries out string matching to packed data not, generates the multiple-length coding, the multiple index coding that have comprised individual character coding and control command code;
S2. segmentation cutter becomes multiple-length coding, multiple index code division the encoded segment of 8 of scale-of-two or other regular lengths, offers categorised statistical form and entropy coder; Meanwhile, from multiple-length coding, multiple index coding, extract coding specification information, offer categorised statistical form;
S3. according to coding specification information, first categorised statistical form is put into the encoded segment of regular length in different statistical forms and carries out dynamic statistics, then according to the numerical value of statistical form, provides probabilistic forecasting table to entropy coder;
S4. according to probabilistic forecasting table, entropy coder uses the Coding Compression Algorithm based on probability statistics, and fixed-length code (FLC) segmentation is carried out to compressed encoding, exports binary packed data.
In step S1, described multiple-length coding, carrys out classification according to string length to be encoded, and follows specific length coding form.
The described string length according to band coding is carried out classification, specifically comprises: adopt the length coding of at least 3 grades, the code length of same rank is identical; More low-level length coding, advanced other length coding is short; Shorter character string adopts more low-level length coding, and longer character string adopts the length coding of higher level, and concrete partition of the level is relevant to specific length coding form;
Described multiple-length coded format, specifically comprises: L0 length coding, represents single character; L1 length coding, represents string length 2 to 11; L2 length coding, represents string length 12 to 75; Optional L3 length coding, represents string length 76 to 325; Optional L4 length coding, comprises L4 lineal measure coding and L4 index length coding, can support the string length of maximum 16EB, and follows the specific lineal measure form combining with index length coding of encoding.
In step S1, described multiple index coding, carrys out classification according to the distance of character string to be encoded and matched character string, and follows specific index coded format.
The described distance according to character string to be encoded and matched character string is carried out classification, specifically comprises: adopt the index coding of at least 3 grades, the code length of same rank is identical; More low-level index coding, advanced other index coding is short; Index value equals the distance value of character string and matched character string to be encoded; Less index value adopts more low-level index coding, and larger index value adopts the index coding of higher level, and concrete partition of the level is relevant to specific index coded format;
Described multiple index coded format, specifically comprises: L1 index coding, represents index value X1 to X1+Y1-1, wherein X1=1,8 powers that Y1 equals 2; L2 index coding, represents index value X2 to X2+Y2-1, wherein X2=X1+Y1,16 powers that Y2 equals 2; L3 index coding, represents index value X3 to X3+Y3-1, wherein X3=X2+Y2,24 powers that Y3 equals 2; Optional L4 index coding, represents index value X4 to X4+Y4-1, wherein X4=X3+Y3,32 powers that Y4 equals 2; Optional L5 index coding, represents index value X5 to X5+Y5-1, wherein X5=X4+Y4,40 powers that Y5 equals 2; Optional L6 index coding, represents index value X6 to X6+Y6-1, wherein X6=X5+Y5,48 powers that Y6 equals 2; Optional L7 index coding, represents index value X7 to X7+Y7-1, wherein X7=X6+Y6,56 powers that Y7 equals 2; Optional L8 index coding, represents index value X8 to X8+Y8-1, wherein X8=X7+Y7,64 powers that Y8 equals 2; L1, L2, L3, L4, L5, L6, L7, the L8 index data window size that can support of encoding is followed successively by 256B, 64KB, 16MB, 4GB, 1TB, 256TB, 64PB, 16EB.
In step S2, the coding that is categorized as more than one ranks in described categorised statistical form.
In step S3, described statistical form designs according to 0 rank, 1 rank or 2 rank PPM probability statistics models.
In step S4, the described Coding Compression Algorithm based on probability statistics is the one in arithmetic coding, Interval Coding or Huffman coding.
Another object of the present invention realizes by following technical scheme:
Support a lossless compression-encoding system for super-huge data window, comprise string matching scrambler, segmentation cutter, categorised statistical form, entropy coder that order is connected, wherein
String matching scrambler, for packed data is not carried out to string matching, generates the multiple-length coding, the multiple index coding that have comprised individual character coding and control command code, offers segmentation cutter;
Segmentation cutter, for the encoded segment multiple-length is encoded, multiple index code division becomes 8 of scale-of-two or other regular lengths, offer categorised statistical form and entropy coder, meanwhile, from multiple-length coding, multiple index coding, extract coding specification information, offer categorised statistical form, a classification can be a kind of coding of rank, can be also the coding of more than one ranks;
Categorised statistical form, be used for according to coding specification information, the encoded segment of regular length is put into and in different statistical forms, carries out dynamic statistics, each statistical form used can be to design according to 0 rank, 1 rank or 2 rank PPM probability statistics models, meanwhile, according to the numerical value of statistical form, provide probabilistic forecasting table to entropy coder;
Entropy coder, for according to probabilistic forecasting table, uses arithmetic coding, Interval Coding, Huffman coding or other Coding Compression Algorithm based on probability statistics, and fixed-length code (FLC) segmentation is carried out to compressed encoding, exports binary packed data.
Compared with prior art, tool has the following advantages and beneficial effect in the present invention:
Method of the present invention, index coding can be supported oversized data window, can reach PB, EB level, be far longer than MB, the GB DBMS window of the current common WinRAR of compressed software and 7-zip, and can not reduce compressibility because of the increase of the increase of data window, index length; Can support linear length coding and index length coding simultaneously, can encode and reach the string length of 16EB; The Encoder Advantage of above index, length, makes the present invention in the time of compression GB, TB level or above mass data, can obtain the compressibility higher than current common compressed software.
Brief description of the drawings
Fig. 1 is the process flow diagram of a kind of lossless compression-encoding method of supporting large data window of the present invention;
Fig. 2 is the structural representation of a kind of lossless compression-encoding system of supporting large data window of the present invention;
Fig. 3 is multiple-length coding, the multistage straw line coded format schematic diagram of the string matching scrambler output of system described in Fig. 2.
Embodiment
Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited to this.
Support a lossless compression-encoding method for super-huge data window, the step that comprises following order:
S1. string matching scrambler carries out string matching to packed data not, generates the multiple-length coding, the multiple index coding that have comprised individual character coding and control command code;
Described multiple-length coding, carrys out classification according to string length to be encoded, and follows specific length coding form.
The described string length according to band coding is carried out classification, specifically comprises: adopt the length coding of at least 3 grades, the code length of same rank is identical; More low-level length coding, advanced other length coding is short; Shorter character string adopts more low-level length coding, and longer character string adopts the length coding of higher level, and concrete partition of the level is relevant to specific length coding form.
Described multiple-length coded format, specifically comprises: L0 length coding, represents single character; L1 length coding, represents string length 2 to 11; L2 length coding, represents string length 12 to 75; Optional L3 length coding, represents string length 76 to 325; Optional L4 length coding, comprises L4 lineal measure coding and L4 index length coding, can support the string length of maximum 16EB, and follows the specific lineal measure form combining with index length coding of encoding;
In step S1, described multiple index coding, carrys out classification according to the distance of character string to be encoded and matched character string, and follows specific index coded format.
The described distance according to character string to be encoded and matched character string is carried out classification, specifically comprises: adopt the index coding of at least 3 grades, the code length of same rank is identical; More low-level index coding, advanced other index coding is short; Index value equals the distance value of character string and matched character string to be encoded; Less index value adopts more low-level index coding, and larger index value adopts the index coding of higher level, and concrete partition of the level is relevant to specific index coded format.
Described multiple index coded format, specifically comprises: L1 index coding, represents index value X1 to X1+Y1-1, wherein X1=1,8 powers that Y1 equals 2; L2 index coding, represents index value X2 to X2+Y2-1, wherein X2=X1+Y1,16 powers that Y2 equals 2; L3 index coding, represents index value X3 to X3+Y3-1, wherein X3=X2+Y2,24 powers that Y3 equals 2; Optional L4 index coding, represents index value X4 to X4+Y4-1, wherein X4=X3+Y3,32 powers that Y4 equals 2; Optional L5 index coding, represents index value X5 to X5+Y5-1, wherein X5=X4+Y4,40 powers that Y5 equals 2; Optional L6 index coding, represents index value X6 to X6+Y6-1, wherein X6=X5+Y5,48 powers that Y6 equals 2; Optional L7 index coding, represents index value X7 to X7+Y7-1, wherein X7=X6+Y6,56 powers that Y7 equals 2; Optional L8 index coding, represents index value X8 to X8+Y8-1, wherein X8=X7+Y7,64 powers that Y8 equals 2; L1, L2, L3, L4, L5, L6, L7, the L8 index data window size that can support of encoding is followed successively by 256B, 64KB, 16MB, 4GB, 1TB, 256TB, 64PB, 16EB;
S2. segmentation cutter becomes multiple-length coding, multiple index code division the encoded segment of 8 of scale-of-two or other regular lengths, offers categorised statistical form and entropy coder; Meanwhile, from multiple-length coding, multiple index coding, extract coding specification information, offer categorised statistical form;
The coding that is categorized as more than one ranks in described categorised statistical form;
S3. according to coding specification information, first categorised statistical form is put into the encoded segment of regular length in different statistical forms and carries out dynamic statistics, then according to the numerical value of statistical form, provides probabilistic forecasting table to entropy coder;
Described statistical form designs according to 0 rank, 1 rank or 2 rank PPM probability statistics models;
S4. according to probabilistic forecasting table, entropy coder uses the Coding Compression Algorithm based on probability statistics, and fixed-length code (FLC) segmentation is carried out to compressed encoding, exports binary packed data;
The described Coding Compression Algorithm based on probability statistics is the one in arithmetic coding, Interval Coding or Huffman coding.
To the method, do further and introduce below:
As shown in Figure 1, a kind of lossless compression-encoding method of supporting super-huge data window, the step that comprises following order:
S301: initialization index segmentation statistical form tabI[i], wherein i from 1 to 8, has 8 group index segmentation statistical forms;
S302: initialization length segmentation statistical form tabL[i], wherein i from 0 to 4, has 5 groups of length segmentation statistical forms;
S303: judging whether not have still unpressed data, is that flow process finishes, otherwise continue step S304;
S304: judge whether to carry out special function, be to go to step S305, otherwise go to step S306;
S305: variable C is set for controlling class, I is steering order code, and L is control operation number, goes to step S311;
S306: read new not packed data;
S307: carry out string matching in data window;
S308: judge that whether string matching is successful, is to go to step S309, otherwise goes to step S310;
S309: it is dictionary class that variable C is set, I is the index value of coupling, L is the string length value of coupling, goes to step S311;
S310: it is monocase class that variable C is set, and I is character ASCII character, goes to step S311;
S311: three category codes of C representative are integrated into Unified coding, make hierarchical index coding according to I, make classification length coding according to L;
S312: it is hierarchical index coding that variable Idx is set, and S1 is index level, and Len is classification length coding, and S2 is length rank;
S313: Idx carries out segmentation cutting to index coding, becomes the encoded segment Idx1 of regular length, Idx2, Idx3 etc., the number of encoded segment depends on the figure place of index coding Idx;
S314: Len carries out segmentation cutting to index coding, becomes the encoded segment Len1 of regular length, Len2, Len3 etc., the number of encoded segment depends on the figure place of length coding Len;
S315: adopt 0 rank, 1 rank or 2 rank PPM probability statistics table tabL[S2] to encoded segment Len1, Len2, Len3 etc. carry out entropy coding;
S316: adopt 0 rank, 1 rank or 2 rank PPM probability statistics table tabI[S1] to encoded segment Idx1, Idx2, Idx3 etc. carry out entropy coding;
S317: the entropy packed data output obtaining of encoding;
S318: according to encoded segment Idx1, Idx2, Idx3 etc., upgrade statistical form tabI[S1], according to encoded segment Len1, Len2, Len3 etc., upgrade statistical form tabL[S2], and jump to step S303.
As Fig. 2, a kind of lossless compression-encoding system of supporting super-huge data window, comprises string matching scrambler, segmentation cutter, categorised statistical form, entropy coder that order is connected, wherein
String matching scrambler 101, for packed data is not carried out to string matching, generates the multiple-length coding, the multiple index coding that have comprised individual character coding and control command code, offers segmentation cutter 102;
Segmentation cutter 102, for the encoded segment multiple-length is encoded, multiple index code division becomes 8 of scale-of-two or other regular lengths, offer categorised statistical form 103 and entropy coder 104, meanwhile, from multiple-length coding, multiple index coding, extract coding specification information, offering the coding that 103, one classification of categorised statistical form can be a kind of ranks, can be also the coding of more than one ranks;
Categorised statistical form 103, be used for according to coding specification information, the encoded segment of regular length is put into and in different statistical forms, carries out dynamic statistics, each statistical form used can be to design according to 0 rank, 1 rank or 2 rank PPM probability statistics models, meanwhile, according to the numerical value of statistical form, provide probabilistic forecasting table to entropy coder 104;
Entropy coder 104, for according to probabilistic forecasting table, uses arithmetic coding, Interval Coding, Huffman coding or other Coding Compression Algorithm based on probability statistics, and fixed-length code (FLC) segmentation is carried out to compressed encoding, exports binary packed data.
As Fig. 3, the multiple-length coding of string matching scrambler output, multiple index coded format comprise:
For each section not packed data carry out string matching coding, the result obtaining is a classification length coding, followed by a hierarchical index coding, wherein classification length coding form separates 201 from length coding below, and hierarchical index coded format separates 212 from index coding;
Length coding separates 201: according to different situations, single character be may be encoded as to L0 length coding 202, length 2 to 11 be may be encoded as to L1 length coding 203, length 12 to 75 be may be encoded as to L2 length coding 204, the coding of other situations separates 205 from L3 length;
L3 length separates 205: according to different situations, control word be may be encoded as to control command code 210, length 76 to 325 be may be encoded as to L3 lineal measure coding 206, to length 326 and above situation, coding separates 207 from L4 length;
L4 length separates 207: according to different situations, length 326 to 65535 be may be encoded as to L4 lineal measure coding 208,64 powers of 16 powers to 2 to length 2 may be encoded as L4 index length coding 209;
Control command code 210: below followed by control operation number encoder 211;
Index coding separates 212: according to different situations, matched position be may be encoded as in data window 1 scope of 256 bytes to L1 index coding 213, matched position be may be encoded as in data window 2 scopes of 64KB to L2 index coding 214, matched position be may be encoded as in data window 3 scopes of 16MB to L3 index coding 215, matched position be may be encoded as in data window 4 scopes of 4GB to L4 index coding 216, matched position be may be encoded as in data window 5 scopes of 1TB to L5 index coding 217, matched position be may be encoded as in data window 6 scopes of 256TB to L6 index coding 218, matched position be may be encoded as in data window 7 scopes of 64PB to L7 index coding 219, matched position be may be encoded as in data window 8 scopes of 16EB to L8 index coding 220.
Above-described embodiment is preferably embodiment of the present invention; but embodiments of the present invention are not restricted to the described embodiments; other any do not deviate from change, the modification done under Spirit Essence of the present invention and principle, substitutes, combination, simplify; all should be equivalent substitute mode, within being included in protection scope of the present invention.

Claims (9)

1. a lossless compression-encoding method of supporting super-huge data window, is characterized in that the step that comprises following order:
S1. string matching scrambler carries out string matching to packed data not, generates the multiple-length coding, the multiple index coding that have comprised individual character coding and control command code;
S2. segmentation cutter becomes multiple-length coding, multiple index code division the encoded segment of 8 of scale-of-two or other regular lengths, offers categorised statistical form and entropy coder; Meanwhile, from multiple-length coding, multiple index coding, extract coding specification information, offer categorised statistical form;
S3. according to coding specification information, first categorised statistical form is put into the encoded segment of regular length in different statistical forms and carries out dynamic statistics, then according to the numerical value of statistical form, provides probabilistic forecasting table to entropy coder;
S4. according to probabilistic forecasting table, entropy coder uses the Coding Compression Algorithm based on probability statistics, and fixed-length code (FLC) segmentation is carried out to compressed encoding, exports binary packed data.
2. the lossless compression-encoding method of the super-huge data window of support according to claim 1, is characterized in that: in step S1, described multiple-length coding, carrys out classification according to string length to be encoded, and follow specific length coding form.
3. the lossless compression-encoding method of the super-huge data window of support according to claim 2, it is characterized in that, the described string length according to band coding is carried out classification, specifically comprises: adopt the length coding of at least 3 grades, the code length of same rank is identical; More low-level length coding, advanced other length coding is short; Shorter character string adopts more low-level length coding, and longer character string adopts the length coding of higher level, and concrete partition of the level is relevant to specific length coding form;
Described multiple-length coded format, specifically comprises: L0 length coding, represents single character; L1 length coding, represents string length 2 to 11; L2 length coding, represents string length 12 to 75; Optional L3 length coding, represents string length 76 to 325; Optional L4 length coding, comprises L4 lineal measure coding and L4 index length coding, can support the string length of maximum 16EB, and follows the specific lineal measure form combining with index length coding of encoding.
4. the lossless compression-encoding method of the super-huge data window of support according to claim 1, it is characterized in that: in step S1, described multiple index coding, carrys out classification according to the distance of character string to be encoded and matched character string, and follows specific index coded format.
5. the lossless compression-encoding method of the super-huge data window of support according to claim 4, it is characterized in that, the described distance according to character string to be encoded and matched character string is carried out classification, specifically comprises: adopt the index coding of at least 3 grades, the code length of same rank is identical; More low-level index coding, advanced other index coding is short; Index value equals the distance value of character string and matched character string to be encoded; Less index value adopts more low-level index coding, and larger index value adopts the index coding of higher level, and concrete partition of the level is relevant to specific index coded format;
Described multiple index coded format, specifically comprises: L1 index coding, represents index value X1 to X1+Y1-1, wherein X1=1,8 powers that Y1 equals 2; L2 index coding, represents index value X2 to X2+Y2-1, wherein X2=X1+Y1,16 powers that Y2 equals 2; L3 index coding, represents index value X3 to X3+Y3-1, wherein X3=X2+Y2,24 powers that Y3 equals 2; Optional L4 index coding, represents index value X4 to X4+Y4-1, wherein X4=X3+Y3,32 powers that Y4 equals 2; Optional L5 index coding, represents index value X5 to X5+Y5-1, wherein X5=X4+Y4,40 powers that Y5 equals 2; Optional L6 index coding, represents index value X6 to X6+Y6-1, wherein X6=X5+Y5,48 powers that Y6 equals 2; Optional L7 index coding, represents index value X7 to X7+Y7-1, wherein X7=X6+Y6,56 powers that Y7 equals 2; Optional L8 index coding, represents index value X8 to X8+Y8-1, wherein X8=X7+Y7,64 powers that Y8 equals 2; L1, L2, L3, L4, L5, L6, L7, the L8 index data window size that can support of encoding is followed successively by 256B, 64KB, 16MB, 4GB, 1TB, 256TB, 64PB, 16EB.
6. the lossless compression-encoding method of the super-huge data window of support according to claim 1, is characterized in that: in step S2, and the coding that is categorized as more than one ranks in described categorised statistical form.
7. the lossless compression-encoding method of the super-huge data window of support according to claim 1, is characterized in that: in step S3, described statistical form designs according to 0 rank, 1 rank or 2 rank PPM probability statistics models.
8. the lossless compression-encoding method of the super-huge data window of support according to claim 1, is characterized in that: in step S4, the described Coding Compression Algorithm based on probability statistics is the one in arithmetic coding, Interval Coding or Huffman coding.
9. a lossless compression-encoding system of supporting super-huge data window, is characterized in that: comprise string matching scrambler, segmentation cutter, categorised statistical form, entropy coder that order is connected, wherein
String matching scrambler, for packed data is not carried out to string matching, generates the multiple-length coding, the multiple index coding that have comprised individual character coding and control command code, offers segmentation cutter;
Segmentation cutter, for the encoded segment multiple-length is encoded, multiple index code division becomes 8 of scale-of-two or other regular lengths, offer categorised statistical form and entropy coder, meanwhile, from multiple-length coding, multiple index coding, extract coding specification information, offer categorised statistical form, a classification can be a kind of coding of rank, can be also the coding of more than one ranks;
Categorised statistical form, be used for according to coding specification information, the encoded segment of regular length is put into and in different statistical forms, carries out dynamic statistics, each statistical form used can be to design according to 0 rank, 1 rank or 2 rank PPM probability statistics models, meanwhile, according to the numerical value of statistical form, provide probabilistic forecasting table to entropy coder;
Entropy coder, for according to probabilistic forecasting table, uses arithmetic coding, Interval Coding, Huffman coding or other Coding Compression Algorithm based on probability statistics, and fixed-length code (FLC) segmentation is carried out to compressed encoding, exports binary packed data.
CN201410317732.3A 2014-07-03 2014-07-03 A kind of lossless compression-encoding method and system for supporting super-huge data window Active CN104156990B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410317732.3A CN104156990B (en) 2014-07-03 2014-07-03 A kind of lossless compression-encoding method and system for supporting super-huge data window

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410317732.3A CN104156990B (en) 2014-07-03 2014-07-03 A kind of lossless compression-encoding method and system for supporting super-huge data window

Publications (2)

Publication Number Publication Date
CN104156990A true CN104156990A (en) 2014-11-19
CN104156990B CN104156990B (en) 2018-02-27

Family

ID=51882478

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410317732.3A Active CN104156990B (en) 2014-07-03 2014-07-03 A kind of lossless compression-encoding method and system for supporting super-huge data window

Country Status (1)

Country Link
CN (1) CN104156990B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109478893A (en) * 2016-07-25 2019-03-15 株式会社高速屋 Data compression coding method, coding/decoding method, its device and its program
CN110377288A (en) * 2018-04-13 2019-10-25 赛灵思公司 Neural network compresses compiler and its compiling compression method
CN110868222A (en) * 2019-11-29 2020-03-06 中国人民解放军战略支援部队信息工程大学 LZSS compressed data error code detection method and device
CN111177432A (en) * 2019-12-23 2020-05-19 北京航空航天大学 Large-scale image retrieval method based on hierarchical depth hash
CN112380196A (en) * 2020-10-28 2021-02-19 安擎(天津)计算机有限公司 Server for data compression transmission
CN117238504A (en) * 2023-11-01 2023-12-15 江苏亿通高科技股份有限公司 Smart city CIM data optimization processing method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0878914A2 (en) * 1997-05-12 1998-11-18 Lexmark International, Inc. Data compression method and apparatus
CN1251449A (en) * 1998-10-18 2000-04-26 华强 Combined use with reference of two category dictionary compress algorithm in data compaction
CN1447603A (en) * 2003-01-10 2003-10-08 李春林 Data compress method based on higher order entropy of message source
US20070152853A1 (en) * 2005-12-30 2007-07-05 Vtech Telecommunications Limited Dictionary-based compression of melody data and compressor/decompressor for the same
CN101090501A (en) * 2006-06-13 2007-12-19 财团法人工业技术研究院 Mould search type variable-length code-decode method and device
CN103067022A (en) * 2012-12-19 2013-04-24 中国石油天然气集团公司 Nondestructive compressing method, uncompressing method, compressing device and uncompressing device for integer data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0878914A2 (en) * 1997-05-12 1998-11-18 Lexmark International, Inc. Data compression method and apparatus
CN1251449A (en) * 1998-10-18 2000-04-26 华强 Combined use with reference of two category dictionary compress algorithm in data compaction
CN1447603A (en) * 2003-01-10 2003-10-08 李春林 Data compress method based on higher order entropy of message source
US20070152853A1 (en) * 2005-12-30 2007-07-05 Vtech Telecommunications Limited Dictionary-based compression of melody data and compressor/decompressor for the same
CN101090501A (en) * 2006-06-13 2007-12-19 财团法人工业技术研究院 Mould search type variable-length code-decode method and device
CN103067022A (en) * 2012-12-19 2013-04-24 中国石油天然气集团公司 Nondestructive compressing method, uncompressing method, compressing device and uncompressing device for integer data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QIN JIANCHENG ET AL: "Design of new format for mass data compression", 《THE JOURNAL OF CHINA UNIVERSITIES OF POSTS AND TELECOMMUNICATIONS》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109478893A (en) * 2016-07-25 2019-03-15 株式会社高速屋 Data compression coding method, coding/decoding method, its device and its program
CN110377288A (en) * 2018-04-13 2019-10-25 赛灵思公司 Neural network compresses compiler and its compiling compression method
CN110868222A (en) * 2019-11-29 2020-03-06 中国人民解放军战略支援部队信息工程大学 LZSS compressed data error code detection method and device
CN110868222B (en) * 2019-11-29 2023-12-15 中国人民解放军战略支援部队信息工程大学 LZSS compressed data error code detection method and device
CN111177432A (en) * 2019-12-23 2020-05-19 北京航空航天大学 Large-scale image retrieval method based on hierarchical depth hash
CN112380196A (en) * 2020-10-28 2021-02-19 安擎(天津)计算机有限公司 Server for data compression transmission
CN112380196B (en) * 2020-10-28 2023-03-21 安擎(天津)计算机有限公司 Server for data compression transmission
CN117238504A (en) * 2023-11-01 2023-12-15 江苏亿通高科技股份有限公司 Smart city CIM data optimization processing method
CN117238504B (en) * 2023-11-01 2024-04-09 江苏亿通高科技股份有限公司 Smart city CIM data optimization processing method

Also Published As

Publication number Publication date
CN104156990B (en) 2018-02-27

Similar Documents

Publication Publication Date Title
CN104156990A (en) Lossless compressed encoding method and system supporting oversize data window
CN112953550B (en) Data compression method, electronic device and storage medium
CN106407285A (en) RLE and LZW-based optimized bit file compression and decompression method
CN101095284B (en) Device and data method for selective compression and decompression and data format for compressed data
CN110518917B (en) LZW data compression method and system based on Huffman coding
EP2455853A2 (en) Data compression method
CN107565971B (en) Data compression method and device
CN104125475B (en) Multi-dimensional quantum data compressing and uncompressing method and apparatus
CN103236847A (en) Multilayer Hash structure and run coding-based lossless compression method for data
US20190140657A1 (en) Data compression coding method, apparatus therefor, and program therefor
CN103067022A (en) Nondestructive compressing method, uncompressing method, compressing device and uncompressing device for integer data
Bhattacharjee et al. Comparison study of lossless data compression algorithms for text data
CN103258030A (en) Mobile device memory compression method based on dictionary encoding and run-length encoding
CN107565970B (en) Hybrid lossless compression method and device based on feature recognition
Spiegel et al. A comparative experimental study of lossless compression algorithms for enhancing energy efficiency in smart meters
CN105306951A (en) Pipeline parallel acceleration method for data compression encoding and system architecture thereof
CN116016606B (en) Sewage treatment operation and maintenance data efficient management system based on intelligent cloud
CN104467868A (en) Chinese text compression method
CN101534124B (en) Compression algorithm for short natural language
CN103428498A (en) Lossless image compression system
CN113312325A (en) Track data transmission method, device, equipment and storage medium
KR102068383B1 (en) Entropy modifier and method
Mahmood et al. An Efficient 6 bit Encoding Scheme for Printable Characters by table look up
JP2022048930A (en) Data compression method, data compression device, data compression program, data decompression method, data decompression device, and data decompression program
CN104682966A (en) Non-destructive compressing method for list data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
OL01 Intention to license declared
OL01 Intention to license declared