CN104156990B

CN104156990B - A kind of lossless compression-encoding method and system for supporting super-huge data window

Info

Publication number: CN104156990B
Application number: CN201410317732.3A
Authority: CN
Inventors: 覃健诚; 钟宇; 陆以勤
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2014-07-03
Filing date: 2014-07-03
Publication date: 2018-02-27
Anticipated expiration: 2034-07-03
Also published as: CN104156990A

Abstract

The invention discloses a kind of lossless compression-encoding method for supporting super-huge data window, comprise the steps of：String matching encoder carries out string matching to uncompressed data, and generation contains the multiple-length coding of individual character coding and control command code, multiple index coding；Segmented cutter encodes multiple-length, multiple index code division into other regular lengths of binary system 8 or binary system encoded segment, and from multiple-length coding, multiple index coding in extract coding specification information；Categorised statistical form carries out dynamic statistics to the encoded segment of regular length, and according to the numerical computations probabilistic forecasting table of statistical form；According to probabilistic forecasting table, entropy coder uses the Coding Compression Algorithm based on probability statistics, is compressed coding to fixed-length code (FLC) segmentation, exports binary compressed data.The present invention can support GB levels down to the super-huge data window of EB levels, and will not become big because index code length and reduce compression ratio.

Description

A kind of lossless compression-encoding method and system for supporting super-huge data window

Technical field

It is more particularly to a kind of to support super-huge data window the present invention relates to the information coding technique field of lossless date-compress The lossless compression-encoding method and system of mouth.

Background technology

With developing rapidly for cloud computing, data volume increases with surprising rapidity.Development as information industry becomes Gesture, big data are just becoming more and more important.At the same time also there is a problem：How these TB level, PBs are efficiently handled Level even more big datasWe need to store and transmit these big datas in a network environment, and this is to memory space, network Bandwidth and computing resource are all a kind of challenges.

Data compression is to save data storage and a kind of wise method of transmission cost, but faces big data, tradition Compression and encryption technology seem scarce capacity.Such as software WinRAR only has 4MB small data window, this can limit pressure Shrinkage, and its compression speed is not sufficiently fast.

Big data window is possible to improve compression ratio.But expand data window and be difficult, because index length can increase, Compression ratio is caused to reduce.

Lossless Compression is also known as Lossless Compression, is one kind among data compression technique, and feature can when being decompression Data it is the same restore.Such as the use of the software such as WinZip, WinRAR, 7-zip is exactly lossless compressiong. The another kind of of data compression technique is referred to as lossy compression method, and the object generally compressed is the multi-medium datas such as sound, picture, video, It is characterized in decompressing obtained data and initial data is variant, but the perceptual difference given people is away from unobvious.Such as JPG figures Piece, DVD video have just used lossy compression.All lossy compression method coding methods, it is required for adopting in compressibility end Complete to compress with the addressable part of a Lossless Compression, therefore this patent is equally applicable in the case of lossy compression method.

Entropy coder is the important component in lossless compressiong, and its principle is to determine word according to the probability of character appearance The length of coding is accorded with, the big character of probability uses short coding, and the small character of probability uses long codes, so that the data of output are compiled Code is as short as possible, reaches the effect of data compression.Entropy coder use common algorithms have arithmetic coding, Interval Coding, Huffman codings etc., such as WinZip Huffman codings, the 7-zip algorithms of Interval Coding.This patent is for adopting Situation with other entropy code algorithms is equally applicable.

From the point of view of theoretic classification, current Lossless Compression mathematical modeling and method can be divided into following 3 type：

1) compression based on probability statistics, such as Huffman encoding, arithmetic coding etc..In this type, based on Marko husband PPM (Partial Prediction Match, fractional prediction matching) algorithm of chain model has good compression ratio.

2) compression based on dictionary index, such as LZ77/LZSS algorithms, LZ78/LZW algorithms etc..The compression mould of LZ series Type has the advantage in speed.

3) compression of the order based on symbol and repetition situation, such as (Burrows-Wheeler turns by Run- Length Coding, BWT Change) coding etc..

The compressed software of current popular is the Application of composite of above basis compression theory.Every kind of software is generally integrated different Compact model and method reach more preferable effect.The characteristics of some popular compressed softwares are as follows：

1) dbase：WinZip

Compressed format：Deflat；

Rudimentary algorithm：LZSS＆Huffman is encoded；

Data window maxsize：512KB；

Weak point：Data window is small；Compression ratio is low；Big data tenability is weak.

2) dbase：WinRAR

Compressed format：RAR；

Rudimentary algorithm：LZSS＆Huffman is encoded；

Data window maxsize：4MB；

3) dbase：Bzip2

Compressed format：BZ2；

Rudimentary algorithm：BWT＆Huffman is encoded；

Data window (data block) maxsize：900KB；

Weak point：BWT data blocks are small；Compression ratio is low；Big data tenability is weak.

4) dbase：7-zip

Compressed format：7z；

Rudimentary algorithm：LZSS＆ arithmetic codings (Interval Coding is identical with arithmetic coding essence)；

Data window maxsize：4GB；

Weak point：Data window is smaller；Limited big data tenability.

Also other compressed softwares, such as PAQ, WinUDA etc..They may have higher compression ratio, but speed is slower, Be not suitable for big data compression.

In summary, or existing lossless date-compress technology speed is slow, it is even more more to be not suitable for progress GB levels, TB levels Big data compression, or data window is smaller, cause compression ratio relatively low.But directly increase data window can not be carried effectively High compression rate, because big data window is needed with longer index, and the increase for indexing length can reduce compression ratio, obtain and do not repay Lose, unless effective compaction coding method and compressed format can be found.

The content of the invention

The shortcomings that it is an object of the invention to overcome prior art and deficiency, there is provided a kind of to support super-huge data window Lossless compression-encoding method.

Another object of the present invention is to provide a kind of lossless compression-encoding system for supporting super-huge data window.

The purpose of the present invention is realized by following technical scheme：

A kind of lossless compression-encoding method for supporting super-huge data window, the step of comprising following order：

S1. string matching encoder to uncompressed data carry out string matching, generation contain individual character coding and Multiple-length coding, the multiple index coding of control command code；

S2. segmented cutter multiple-length is encoded, multiple index code division into binary system 8 or binary system other The encoded segment of regular length, there is provided to categorised statistical form and entropy coder；Meanwhile compiled from multiple-length coding, multiple index Coding specification information is extracted in code, there is provided to categorised statistical form；

S3. the encoded segment of regular length is put into according to coding specification information, categorised statistical form by different statistics first Dynamic statistics are carried out in table, then according to the numerical value of statistical form, probabilistic forecasting table is provided to entropy coder；

S4. the Coding Compression Algorithm based on probability statistics is used according to probabilistic forecasting table, entropy coder, regular length is compiled Code division section is compressed coding, exports binary compressed data.

In step S1, described multiple-length coding, it is classified, and is followed specific according to string length to be encoded Length coded format.

Described is classified according to string length to be encoded, specifically includes：Using at least 3 grades of length coding, together The other code length of one-level is identical；The other length coding of lower level, advanced other length coding are short；Shorter character string is adopted With the other length coding of lower level, longer character string uses the length coding of higher level, specific partition of the level with it is specific Length coded format it is related；

Described multiple-length coded format, is specifically included：L0 length codings, represent single character；L1 length codings, generation Table string length 2 to 11；L2 length codings, represent string length 12 to 75；Optional L3 length codings, represent character string Length 76 to 325；Optional L4 length codings, including L4 lineal measures coding and L4 index length codings, it would be preferable to support maximum 16EB string length, and follow the form that specific lineal measure coding is combined with index length coding.

In step S1, described multiple index coding, divided according to the distance of character string and matched character string to be encoded Level, and follow specific index coded format.

The described distance according to character string and matched character string to be encoded is classified, and is specifically included：Using at least 3 grades Index coding, the code length of same rank is identical；The other index coding of lower level, advanced other index coding are short； Index value is equal to the distance value of character string and matched character string to be encoded；Less index value is compiled using the other index of lower level Code, larger index value are encoded using the index of higher level, and specific partition of the level is related to specific index coded format；

Described multiple index coded format, is specifically included：L1 indexes encode, and represent index value X1 to X1+Y1-1, wherein X1=1, Y1 are equal to 28 powers；L2 indexes encode, and represent index value X2 to X2+Y2-1, wherein X2=X1+Y1, and Y2 is equal to 2 16 powers；L3 indexes encode, and represent index value X3 to X3+Y3-1, wherein X3=X2+Y2, and Y3 is equal to 2 24 powers；Optionally L4 indexes encode, and represent index value X4 to X4+Y4-1, wherein X4=X3+Y3, and Y4 is equal to 2 32 powers；Optional L5 indexes are compiled Code, represents index value X5 to X5+Y5-1, wherein X5=X4+Y4, and Y5 is equal to 2 40 powers；Optional L6 indexes coding, is represented Index value X6 to X6+Y6-1, wherein X6=X5+Y5, Y6 are equal to 2 48 powers；Optional L7 indexes coding, represents index value X7 To X7+Y7-1, wherein X7=X6+Y6, Y7 is equal to 2 56 powers；Optional L8 indexes coding, represents index value X8 to X8+Y8- 1, wherein X8=X7+Y7, Y8 are equal to 2 64 powers；The data that L1, L2, L3, L4, L5, L6, L7, L8 index coding can be supported Window size is followed successively by 256B, 64KB, 16MB, 4GB, 1TB, 256TB, 64PB, 16EB.

In step S2, the coding for being categorized as N number of rank in described categorised statistical form, wherein N >=1.

In step S3, described statistical form designs according to 0 rank, 1 rank or 2 rank PPM probability statistics models.

In step S4, the Coding Compression Algorithm based on probability statistics is arithmetic coding, Interval Coding or Huffman One kind in coding.

Another object of the present invention is realized by following technical scheme：

A kind of lossless compression-encoding system for supporting super-huge data window, including the string matching coding that order is connected Device, segmented cutter, categorised statistical form, entropy coder, wherein

String matching encoder, for uncompressed data to be carried out with string matching, generation contains individual character coding Multiple-length coding, multiple index coding with control command code, there is provided to segmented cutter；

Segmented cutter, for multiple-length to be encoded, multiple index code division into binary system 8 or binary system its The encoded segment of his regular length, there is provided to categorised statistical form and entropy coder, at the same time, from multiple-length coding, multistage Index coding in extract coding specification information, there is provided to categorised statistical form, one classification be M kind ranks coding, wherein M >= 1；

Categorised statistical form, for according to coding specification information, the encoded segment of regular length being put into different statistical forms Middle carry out dynamic statistics, each statistical form used can be designed according to 0 rank, 1 rank or 2 rank PPM probability statistics models, with According to the numerical value of statistical form, probabilistic forecasting table is provided to entropy coder simultaneously for this；

Entropy coder, for according to probabilistic forecasting table, using arithmetic coding, Interval Coding, Huffman codings or other bases In the Coding Compression Algorithm of probability statistics, coding is compressed to fixed-length code (FLC) segmentation, output is binary to have compressed number According to.

The present invention compared with prior art, has the following advantages that and beneficial effect：

The method of the present invention, index coding can support oversized data window, can reach PB, EB level, be far longer than Current common compression software WinRAR and 7-zip MB, GB DBMS window, and will not be because of data window increase, rope Draw the increase of length and reduce compression ratio；Linear length coding and index length coding can be supported simultaneously, can be encoded and be up to 16EB string length；Index, the Encoder Advantage of length above, make the present invention in the magnanimity number of compression GB, TB level or more According to when, the compression ratio higher than current common compression software can be obtained.

Brief description of the drawings

Fig. 1 is a kind of flow chart of lossless compression-encoding method for supporting large data window of the present invention；

Fig. 2 is a kind of structural representation of lossless compression-encoding system for supporting large data window of the present invention；

Fig. 3 is the multiple-length coding of the string matching encoder output of system described in Fig. 2, multistage straw line coded format Schematic diagram.

Embodiment

With reference to embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are unlimited In this.

Described multiple-length coding, is classified, and follow specific length coding according to string length to be encoded Form.

Described is classified according to string length to be encoded, specifically includes：Using at least 3 grades of length coding, together The other code length of one-level is identical；The other length coding of lower level, advanced other length coding are short；Shorter character string is adopted With the other length coding of lower level, longer character string uses the length coding of higher level, specific partition of the level with it is specific Length coded format it is related.

Described multiple-length coded format, is specifically included：L0 length codings, represent single character；L1 length codings, generation Table string length 2 to 11；L2 length codings, represent string length 12 to 75；Optional L3 length codings, represent character string Length 76 to 325；Optional L4 length codings, including L4 lineal measures coding and L4 index length codings, it would be preferable to support maximum 16EB string length, and follow the form that specific lineal measure coding is combined with index length coding；

The described distance according to character string and matched character string to be encoded is classified, and is specifically included：Using at least 3 grades Index coding, the code length of same rank is identical；The other index coding of lower level, advanced other index coding are short； Index value is equal to the distance value of character string and matched character string to be encoded；Less index value is compiled using the other index of lower level Code, larger index value are encoded using the index of higher level, and specific partition of the level is related to specific index coded format.

Described multiple index coded format, is specifically included：L1 indexes encode, and represent index value X1 to X1+Y1-1, wherein X1=1, Y1 are equal to 28 powers；L2 indexes encode, and represent index value X2 to X2+Y2-1, wherein X2=X1+Y1, and Y2 is equal to 2 16 powers；L3 indexes encode, and represent index value X3 to X3+Y3-1, wherein X3=X2+Y2, and Y3 is equal to 2 24 powers；Optionally L4 indexes encode, and represent index value X4 to X4+Y4-1, wherein X4=X3+Y3, and Y4 is equal to 2 32 powers；Optional L5 indexes are compiled Code, represents index value X5 to X5+Y5-1, wherein X5=X4+Y4, and Y5 is equal to 2 40 powers；Optional L6 indexes coding, is represented Index value X6 to X6+Y6-1, wherein X6=X5+Y5, Y6 are equal to 2 48 powers；Optional L7 indexes coding, represents index value X7 To X7+Y7-1, wherein X7=X6+Y6, Y7 is equal to 2 56 powers；Optional L8 indexes coding, represents index value X8 to X8+Y8- 1, wherein X8=X7+Y7, Y8 are equal to 2 64 powers；The data that L1, L2, L3, L4, L5, L6, L7, L8 index coding can be supported Window size is followed successively by 256B, 64KB, 16MB, 4GB, 1TB, 256TB, 64PB, 16EB；

The coding for being categorized as N number of rank in described categorised statistical form, wherein N >=1；

Described statistical form designs according to 0 rank, 1 rank or 2 rank PPM probability statistics models；

S4. the Coding Compression Algorithm based on probability statistics is used according to probabilistic forecasting table, entropy coder, regular length is compiled Code division section is compressed coding, exports binary compressed data；

The described Coding Compression Algorithm based on probability statistics is in arithmetic coding, Interval Coding or Huffman codings It is a kind of.

Below to this method, do and further introduce：

As shown in figure 1, a kind of lossless compression-encoding method for supporting super-huge data window, the step of following order is included Suddenly：

S301：Initialization index segmentation statistical form tabI [i], wherein i share 8 group indexes segmentation statistical form from 1 to 8；

S302：Length segmentation statistical form tabL [i] is initialized, wherein i shares 5 groups of length segmentation statistical forms from 0 to 4；

S303：Judge whether without still unpressed data, be that flow terminates, otherwise continue step S304；

S304：Special function is judged whether to, is, S305 is gone to step, otherwise goes to step S306；

S305：Setting variable C, I is control instruction code, and L is control operation number, goes to step S311 for control class；

S306：Read new uncompressed data；

S307：String matching is carried out in data window；

S308：Judge whether string matching succeeds, be, go to step S309, otherwise go to step S310；

S309：It is dictionary class to set variable C, and I is the index value of matching, and L is the string length value of matching, is gone to step S311；

S310：It is monocase class to set variable C, and I is character ASCII character, goes to step S311；

S311：C three category codes represented are integrated into Unified coding, hierarchical index coding is made according to I, is classified according to L Length coding；

S312：Variable Idx is set to be encoded for hierarchical index, S1 is index level, and Len is classification length coding, and S2 is length Spend rank；

S313：Segmentation cutting is carried out to index coding Idx, turns into the encoded segment Idx1, Idx2, Idx3 of regular length Number Deng, encoded segment depends on index coding Idx digit；

S314：Segmentation cutting is carried out to index coding Len, turns into the encoded segment Len1, Len2, Len3 of regular length Number Deng, encoded segment depends on length coding Len digit；

S315：Using 0 rank, 1 rank or 2 rank PPM probability statistics table tabL [S2] to encoded segment Len1, Len2, Len3 etc. Carry out entropy code；

S316：Using 0 rank, 1 rank or 2 rank PPM probability statistics table tabI [S1] to encoded segment Idx1, Idx2, Idx3 etc. Carry out entropy code；

S317：The compressed data that entropy code is obtained exports；

S318：According to encoded segment Idx1, Idx2, Idx3 etc., renewal statistical form tabI [S1], according to encoded segment Len1, Len2, Len3 etc., renewal statistical form tabL [S2], and jump to step S303.

Such as Fig. 2, a kind of lossless compression-encoding system for supporting super-huge data window, including the character string that order is connected With encoder, segmented cutter, categorised statistical form, entropy coder, wherein

String matching encoder 101, for uncompressed data to be carried out with string matching, generation contains monocase volume The multiple-length of code and control command code coding, multiple index coding, there is provided to segmented cutter 102；

Segmented cutter 102, for multiple-length to be encoded, multiple index code division enters into binary system 8 or two Make the encoded segment of other regular lengths, there is provided to categorised statistical form 103 and entropy coder 104, at the same time, from multiple-length Coding specification information is extracted in coding, multiple index coding, there is provided to categorised statistical form 103, a classification is M kind ranks Coding, wherein M >=1；

Categorised statistical form 103, for according to coding specification information, the encoded segment of regular length being put into different statistics Dynamic statistics are carried out in table, each statistical form used can be designed according to 0 rank, 1 rank or 2 rank PPM probability statistics models, At the same time, according to the numerical value of statistical form, probabilistic forecasting table is provided to entropy coder 104；

Entropy coder 104, for according to probabilistic forecasting table, using arithmetic coding, Interval Coding, Huffman is encoded or it His Coding Compression Algorithm based on probability statistics, coding is compressed to fixed-length code (FLC) segmentation, exports binary has pressed Contracting data.

Such as Fig. 3, the multiple-length coding of string matching encoder output, multiple index coded format include：

String matching coding is carried out for each section of uncompressed data, obtained result is a classification length coding, Encoded followed by a hierarchical index, wherein classification length coded format, from length coding separation 201, hierarchical index is compiled Code form is from index coding separation 212；

Length coding separation 201：According to different situations, L0 length codings 202 may be encoded as to single character, to length 2 L1 length codings 203 are may be encoded as to 11, L2 length codings 204 are may be encoded as to length 12 to 75, the coding of other situations is then From L3 length separation 205；

L3 length separation 205：According to different situations, control command code 210 is may be encoded as to control word, to length 76 to 325 may be encoded as L3 lineal measures coding 206, and to length 326 and the situation of the above, coding is then from L4 length separation 207；

L4 length separation 207：According to different situations, L4 lineal measures coding 208 is may be encoded as to length 326 to 65535, L4 indexes length coding 209 may be encoded as to 64 powers of 16 powers to 2 of length 2；

Control command code 210：Followed by control operation number encoder 211；

Index coding separation 212：According to different situations, matched position can be compiled in the range of the data window 1 of 256 bytes Code is L1 indexes coding 213, and L2 indexes are may be encoded as in the range of 64KB data window 2 to matched position and encode 214, to L3 indexes coding 215 is may be encoded as in the range of 16MB data window 3 with position, to matched position 4GB data window 4 In the range of may be encoded as L4 indexes coding 216, L5 indexes are may be encoded as in the range of 1TB data window 5 to matched position and are compiled Code 217, L6 indexes coding 218 is may be encoded as in the range of 256TB data window 6 to matched position, matched position is existed L7 indexes coding 219 is may be encoded as in the range of 64PB data window 7, to matched position in the range of 16EB data window 8 It may be encoded as L8 indexes coding 220.

Above-described embodiment is the preferable embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any Spirit Essences without departing from the present invention with made under principle change, modification, replacement, combine, simplification, Equivalent substitute mode is should be, is included within protection scope of the present invention.

Claims

A kind of 1. lossless compression-encoding method for supporting super-huge data window, it is characterised in that the step of comprising following order：

S1. string matching encoder carries out string matching to uncompressed data, and generation contains individual character coding and control Multiple-length coding, the multiple index coding of instruction encoding；

Described multiple-length coding, is classified, and follow specific length coded format according to string length to be encoded；

Described is classified according to string length to be encoded, specifically includes：Using at least 3 grades of length coding, same to one-level Other code length is identical；The other length coding of lower level, advanced other length coding are short；Shorter character string use compared with The length coding of low level, longer character string use the length coding of higher level, specific partition of the level and specific length It is related to spend coded format；

Described multiple-length coded format, is specifically included：L0 length codings, represent single character；L1 length codings, represent word Accord with string length 2 to 11；L2 length codings, represent string length 12 to 75；L3 length codings, represent string length 76 to 325；L4 length codings, including L4 lineal measures coding and L4 index length codings, it would be preferable to support maximum 16EB character string length Degree, and follow the form that specific lineal measure coding is combined with index length coding；

Described multiple index coding, is classified, and follow specific according to the distance of character string and matched character string to be encoded Index coded format；

The described distance according to character string and matched character string to be encoded is classified, and is specifically included：Using at least 3 grades of rope Draw coding, the code length of same rank is identical；The other index coding of lower level, advanced other index coding are short；Index Distance value of the value equal to character string and matched character string to be encoded；Less index value is encoded using the other index of lower level, Larger index value is encoded using the index of higher level, and specific partition of the level is related to specific index coded format；

Described multiple index coded format, is specifically included：L1 indexes encode, and represent index value X1 to X1+Y1-1, wherein X1= 1, Y1 is equal to 28 powers；L2 indexes encode, and represent index value X2 to X2+Y2-1, wherein X2=X1+Y1, and Y2 is equal to 16 times of 2 Side；L3 indexes encode, and represent index value X3 to X3+Y3-1, wherein X3=X2+Y2, and Y3 is equal to 2 24 powers；L4 indexes encode, Index value X4 to X4+Y4-1, wherein X4=X3+Y3 are represented, Y4 is equal to 2 32 powers；L5 indexes encode, and represent index value X5 extremely X5+Y5-1, wherein X5=X4+Y4, Y5 are equal to 2 40 powers；L6 indexes encode, and represent index value X6 to X6+Y6-1, wherein X6 =X5+Y5, Y6 are equal to 2 48 powers；L7 indexes encode, and represent index value X7 to X7+Y7-1, wherein X7=X6+Y6, Y7 is equal to 2 56 powers；L8 indexes encode, and represent index value X8 to X8+Y8-1, wherein X8=X7+Y7, and Y8 is equal to 2 64 powers；L1、 The data window size that can support of L2, L3, L4, L5, L6, L7, L8 index coding be followed successively by 256B, 64KB, 16MB, 4GB, 1TB、256TB、64PB、16EB；

S2. segmented cutter encodes multiple-length, into binary system 8 or binary system, other are fixed multiple index code division The encoded segment of length, there is provided to categorised statistical form and entropy coder；Meanwhile from multiple-length coding, multiple index coding Extract coding specification information, there is provided to categorised statistical form；

S3. the encoded segment of regular length is put into different statistical forms first according to coding specification information, categorised statistical form Dynamic statistics are carried out, then according to the numerical value of statistical form, probabilistic forecasting table is provided to entropy coder；

S4. the Coding Compression Algorithm based on probability statistics is used according to probabilistic forecasting table, entropy coder, to fixed-length code (FLC) point Section is compressed coding, exports binary compressed data.
2. the lossless compression-encoding method according to claim 1 for supporting super-huge data window, it is characterised in that：Step In S2, the coding for being categorized as N number of rank in described categorised statistical form, wherein N >=1.
3. the lossless compression-encoding method according to claim 1 for supporting super-huge data window, it is characterised in that：Step In S3, described statistical form designs according to 0 rank, 1 rank or 2 rank PPM probability statistics models.
4. the lossless compression-encoding method according to claim 1 for supporting super-huge data window, it is characterised in that：Step In S4, the Coding Compression Algorithm based on probability statistics is one in arithmetic coding, Interval Coding or Huffman codings Kind.
5. the lossless compression-encoding method of super-huge data window is supported described in claim 1-4 any claims for realizing A kind of lossless compression-encoding system for supporting super-huge data window, it is characterised in that：Including the connected character string of order With encoder, segmented cutter, categorised statistical form, entropy coder, wherein

String matching encoder, for uncompressed data to be carried out with string matching, generation contains individual character coding and control Multiple-length coding, the multiple index coding of instruction encoding processed, there is provided to segmented cutter；

Segmented cutter, for multiple-length to be encoded, into binary system 8 or binary system, other consolidate multiple index code division The encoded segment of measured length, there is provided to categorised statistical form and entropy coder, at the same time, from multiple-length coding, multiple index Coding specification information is extracted in coding, there is provided to categorised statistical form, a classification is the coding of M kind ranks, wherein M >=1；

Categorised statistical form, for according to coding specification information, the encoded segment of regular length being put into different statistical forms Mobile state is counted, and each statistical form used is designed according to 0 rank, 1 rank or 2 rank PPM probability statistics models, at the same time, According to the numerical value of statistical form, probabilistic forecasting table is provided to entropy coder；

Entropy coder, for according to probabilistic forecasting table, being encoded using arithmetic coding, Interval Coding, Huffman or other being based on generally The Coding Compression Algorithm of rate statistics, coding is compressed to fixed-length code (FLC) segmentation, exports binary compressed data.