CN106559084B

CN106559084B - A kind of lossless data compression coding method based on arithmetic coding

Info

Publication number: CN106559084B
Application number: CN201611026314.4A
Authority: CN
Inventors: 陆成刚
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Hangzhou Markov Technology Co ltd
Priority date: 2016-11-15
Filing date: 2016-11-15
Publication date: 2019-07-30
Anticipated expiration: 2036-11-15
Also published as: CN106559084A

Abstract

A kind of lossless data compression coding method based on arithmetic coding, including following procedure: 1) encoding: assuming that the probability distribution of n source symbol is { 0 < p₁<=p₂<=... <=p_n< 1 }, it is that dull increasing arrangement is carried out according to subscript, mixing symbolic label to n symbol is { n-1, n-2 ..., 1,0 }, probability p_iSymbol mix marked as n-i, i=1,2 ..., n, it is assumed that the length of input be K character string J₁J₂J₃...J_n‑1J_K, wherein J_i∈ { 0,1,2 ..., n-1 }, the character string is so defined corresponding to the rational τ in unit interval [0,1], regards τ as decimal number, mathematically prove that it is to belong to [0,1] unit interval, converts thereof into binary representation, indicates that the binary sequence of result is the output bit flow encoded；2) it decodes: the bit stream is changed into the corresponding rational in unit interval [0,1] of decimal system rational.The present invention does not need division, and is entropy optimum code, does not depend on grouping mode of extension, usage mode is flexible.

Description

A kind of lossless data compression coding method based on arithmetic coding

Technical field

The invention belongs to field of data compression, are related to a kind of lossless data compression coding method.

Background technique

Arithmetic coding is a kind of very classical and well-known lossless data compression coding method.Data compression technique is in information There is extremely important status, data compression method is divided into lossy compression and lossless compression again in technology.Compression method one As be based on the characteristic of human perception organ (auditory system, vision system), delete a large amount of perception redundant datas, but do not influence people The perceived quality listened to rating.For example, time domain or the frequency domain mask spy of human auditory system is utilized in audio MP3 technology Property and threshold of audibility characteristic；And the characteristic that video mpeg encoded technology takes full advantage of human eye frame sampling frequency estimates image interframe movement It is counted as out effective compensation, so that deleting a large amount of inter-frame redundancy information achievees the effect that data compression.Lossless compressiong is usual For computer document, medical image data compression in, data are to need stringent lossless recovery there.In addition, nothing Damage compress technique is also used as the rear end of lossy compression system, makees further nothing to the lossy compression data flow of input Damage compression, as shown in Figure 1.

Huffman coding is widely used entropy coding algorithm in lossless compressiong, relative to arithmetic coding method For, maximum problem is to carry out coded treatment to character string stream based on grouping mode of extension, with the increasing of grouping width Add, coding redundancy degree reduces, thus it is optimal to approach entropy.But since grouping width increase brings the complexity of character list design The increase of degree, and the fixed scalability of grouping width, flexibility ratio are poor in practical applications, are actually inferior to arithmetic volume Code.Arithmetic coding is natural streaming coding, is not grouped the concept of extension, the input character crossfire of any length can all reflect It is mapped to the rational (or mutually disjoint segment) of [0, a 1] unit interval, applicability, flexibility are all compiled compared with Huffman Code is high, and the accurate estimation based on the probability distribution to character source, and arithmetic coding is also entropy optimum code, the probability point of character source Cloth estimation is more accurate, and the redundancy of arithmetic coding is lower, to optimal closer to entropy.Just because of the advantage of arithmetic coding, causes It is technically studied most sufficiently, and authorized relevant patent is also most, although many patents were at eighties of last century 90 years In generation, fails, but has the relevant patent of many arithmetic codings to be applied and authorized again in recent years.In turn, due to calculating Art is encoded by widely granted patent, so leaving the space of many industrial applications to Huffman coding instead.

The thought of classical arithmetic coding be will input character crossfire be mapped to [0,1] section a rational (ten into Form processed), then this number is shown as bit stream using binary form, which is exactly coding result；Decoding process is by bit The decimal fraction for circulating into [0,1] section, by the probability distribution of symbolic source, Cong Shouzhi tail solves character one by one.It is specific to compile Code process is as follows, if symbolic source is { 1,2,3 ..., n }, their probability distribution is { p₁,p₂,...,p_n, input character String is J₁J₂J₃......J_n-1J_KLength is K, wherein J_i∈ { 1,2 ..., n }, the left and right endpoint point in the corresponding section of the character string It is not

Wherein arrangeIt can be seen that the corresponding interval width of the character string isThe word The corresponding rational τ of symbol string may be defined as the midpoint τ=(τ in section^l+τ^r)/2。

Decoding process is as follows, is firstly introduced into markWherein F⁰=0, Fⁿ=1；The length for inputting character string is L The corresponding section of substring left and right endpoint be τ^l _LAnd τ^r _L, wherein τ^l ₀=0, τ^r ₀=1, L=1 is enabled, is solved in four steps below Code:

1) τ is calculated^*=(τ-τ^l _L-1)/(τ^r _L-1-τ^l _L-1)

2) determination meets F^i-1<=τ^*<=Fⁱ,I^*As J_L=i^*

3)And

4) such as L < K, then L=L+1, returns to step 1, otherwise terminates.

It is exactly above the decoded method frame of arithmetic coding, still there is the thin of some software code layouts in actual implementation Section.

Existing arithmetic coding method rests on the details aspect of software code realization or applies to arithmetic coding In the design of other systems.

Summary of the invention

In order to overcome the shortcomings of that the present invention provides one kind not to need dependent on division existing for existing arithmetic coding method The lossless data compression coding method based on arithmetic coding of division, and be equally entropy optimum code with existing arithmetic coding, and It is non-grouping mode of extension, usage mode is flexible.

The technical solution adopted by the present invention to solve the technical problems is:

A kind of lossless data compression coding method based on arithmetic coding, including following procedure:

1) it encodes: assuming that the probability distribution of n source symbol is { 0 < p₁<=p₂<=... <=p_n< 1 }, it is under Mark carries out dull increasing arrangement, mixes symbolic label to n symbol as { n-1, n-2 ..., 1,0 }, probability p_iSymbol It mixes marked as n-i, i=1,2 ..., n, it is assumed that the length of input is K character stringWherein J_i∈{0,1, 2 ..., n-1 }, then defining the character string corresponding to the rational τ in unit interval [0,1]；

Regard τ as decimal number, mathematically prove that it is to belong to [0,1] unit interval, convert thereof into two into Tabulation is shown, indicates that the binary sequence of result is the output bit flow encoded；

2) it decodes: the bit stream being changed into the corresponding rational in unit interval [0,1] of decimal system rational, is set as τ enables L=1, τ₁=τ or less is decoded in three steps:

2.1) gatheringIn find out and meet τ_L>=kp^L _n-kMaximum k, be denoted as k^*；

2.2)J_L=k^*；

2.3) such as L < K, thenAnd L=L+1 is enabled, step 2.1) is returned to, is otherwise terminated.

Further, in the step 2.1), meet τ_L>=kp^L _n-kK be constantly present, at least k=0 always meets, thus There is maximum k every time^*；Obtain J₁J₂J₃......J_n-1J_KAs decoded result.

Beneficial effects of the present invention are mainly manifested in: it is also that a streaming coding makes independent of grouping mode of extension It is flexible with mode；In addition, mathematically proving that it is also entropy optimum code.

Detailed description of the invention

Fig. 1 is schematic diagram of the lossless compressiong as the rear module of lossy compression system.

Specific embodiment

The invention will be further described below.

A kind of lossless data compression coding method based on arithmetic coding, the arithmetic coding can be regarded as from some angle The method frame of original arithmetic coding, including following procedure are breached from basic ideas:

1) it encodes: assuming that the probability distribution of n source symbol is { 0 < p₁<=p₂<=... <=p_n< 1 }, it is noted that be Dull increasing arrangement (having no this arrangement in classical arithmetic coding mode to require) is carried out according to subscript, mixes symbol to n symbol Number marked as { n-1, n-2 ..., 1,0 }, probability p_iSymbol mix marked as n-i, i=1,2 ..., n, from mathematics Upper proofAssuming that the length of input is K character string J₁J₂J₃......J_n-1J_K, wherein J_i∈ { 0,1,2 ..., n-1 }, then defining the character string corresponding in unit interval [0,1] Rational τ；

2) it decodes: above-mentioned bit stream being changed into the corresponding rational in unit interval [0,1] of decimal system rational, is set as τ enables L=1, τ₁=τ or less is decoded in three steps:

2.2)J_L=k^*；

In the step 2.1), meet τ_L>=kp^L _n-kK be constantly present, at least k=0 always meets, thus every time There are maximum k^*；Obtain J₁J₂J₃......J_n-1J_KAs decoded result.

Example one: a kind of lossless data compression coding method based on arithmetic coding, process are as follows:

Symbol	A	B	C
				Probability	1/4	1/4	1/2
Number	2	1	0

1) character string CBAB is inputted

Reference numeral is 0121, and corresponding rational is 25/256=0.09765625, and corresponding bit stream is 00011001.Decoding, is converted into decimal system rational 25/256 for 00011001 first, meets in set { 1/2,1/4,0 } τ_L>=kp^L _n-kMaximum k be 0, so J₁=0；Meet τ in set { 1/8,1/16,0 } again_L>=kp^L _n-kMaximum k be 1 So J₂=1；It is found again to 9/256 in set { 1/32,1/64,0 } and meets τ_L>=kp^L _n-kMaximum k be 2 so J₃=2； It is finally found to 1/256 in set { 1/128,1/256,0 } and meets τ_L>=kp^L _n-kMaximum k be 1 so J₄=1, last solution Code obtains J₁J₂J₃J₄=0121 corresponds to character string CBAB.

2) character string CACCB is inputted

Reference numeral is 02001, and corresponding rational is 0.1259765625, and corresponding bit stream is 0010000001. The corresponding rational of decoding process 0010000001 is 0.1259765625, meets τ in { 1/2,1/4,0 }_L>=kp^L _n-k's Maximum k is 0 so J₁=0；It is 2 so J in the maximum k that set { 1/8,1/16,0 } is met the requirements₂=0；It similarly obtains so J₃ =J₄=0, J₅=1.

Example two: a kind of lossless data compression coding method based on arithmetic coding, process are as follows:

It inputs character string ABCCCCE and corresponds to numbered sequence 2100003, corresponding in rational is 0.4400003.Decoding is such as Under, the maximum k met the requirements is 2 in { 0.3,0.4,0.2,0 }, secondly maximum full in { 0.03,0.08,0.04,0 } The k required enough is 1, and the k in { 0.003,0.016,0.008,0 } is 0, hereafter continuous 4 times all be 0, to the 7th time K is 3 in { 0.0000003,0.0000256,0.0000128,0 }.Final decoding obtains 2100003.

Claims

1. a kind of lossless data compression coding method based on arithmetic coding, it is characterised in that: the coding method includes following Process:

1) it encodes: assuming that the probability distribution of n source symbol is { 0 < p₁<=p₂<=... <=p_n< 1 }, be according to subscript into Row dullness increasing arrangement mixes symbolic label to n symbol as { n-1, n-2 ..., 1,0 }, probability p_iSymbol mix Marked as n-i, i=1,2 ..., n, it is assumed that the length of input is K character string J₁J₂J₃......J_n-1J_K, wherein J_i∈{0,1, 2 ..., n-1 }, then defining the character string corresponding to the rational τ in unit interval [0,1]；

Regard τ as decimal number, mathematically proves that it is to belong to [0,1] unit interval, convert thereof into binary form Show, indicates that the binary sequence of result is the output bit flow encoded；

2) it decodes: the bit stream being changed into the corresponding rational in unit interval [0,1] of decimal system rational, τ is set as, enables L=1, τ₁=τ or less is decoded in three steps:

2.1) gatheringIn find out and meet τ_L>=kp^L _n-kMaximum k, be denoted as k^*；Meet τ_L>=kp^L _n-kK It is constantly present, at least k=0 always meets, thus there is maximum k every time^*；Obtain J₁J₂J₃......J_n-1J_KAs decode Result；

2.2)J_L=k^*；

2.3) such as L < K, then τ_L+1=τ_L-k^*p^L _n-k ^*, and L=L+1 is enabled, step 2.1) is returned to, is otherwise terminated.