CN101951516A

CN101951516A - Parallel encoding realization circuit and encoding method based on CABAC (Context-based Adaptive Binary Arithmetic Coding) in H.264/AVC (Advanced Video Coding)

Info

Publication number: CN101951516A
Application number: CN 201010291264
Authority: CN
Inventors: 刘振宇; 汪东升
Original assignee: Tsinghua University
Current assignee: CERTUSNET CORP.
Priority date: 2010-09-25
Filing date: 2010-09-25
Publication date: 2011-01-19
Anticipated expiration: 2030-09-25
Also published as: CN101951516B

Abstract

The invention discloses a parallel encoding realization circuit and an encoding method based on CABAC (Context-based Adaptive Binary Arithmetic Coding) in H.264/AVC (Advanced Video Coding). The parallel encoding realization circuit comprises a binary engine, a context model engine, a parallel normalization engine and an RBSP (Remote Batch Station Processor) code stream generation engine, wherein the binary engine is used for executing a parallel normalization operation; the context model engine is used for executing context read and updating operation of two bits per period; the parallel normalization engine is used for executing the normalization operation of two bits per period; and the RBSP is used for generating an RBSP output code stream. The binary engine and the context model engine are connected in a three-writing, two-reading and first-in, first-out queue; and the parallel normalization engine and the RBSP code stream generation engine are connected in a two-writing, one-reading and first-in, first-out queue. The invention ensures that processing speeds of the binary engine, the normalization engine and the RBSP code stream generation engine are matched, avoids production line stagnancy and solves the problems of unbalanced throughput rates among various levels of processing engines and calculation bottleneck initiated by correlation of a coding interval, coding lower limit normalization and a code stream production process.

Description

Based on H.264/AVC parallel encoding realization circuit and the coding method of middle CABAC

Technical field

The present invention relates to field of video encoding, relate in particular to a kind of based on H.264/AVC parallel encoding realization circuit and the coding method of middle CABAC.

Background technology

H.264/AVC Main Profile (main class) adopt based on contextual adaptive binary arithmetic coding (Context-based Adaptive Binary Arithmetic Coding, CABAC).Test shows, with (Context-Adaptive Variable-Length Coding CAVLC) relatively, under identical code check, adopts CABAC picture quality can be improved 0.3-0.6dB based on contextual Variable Length Code algorithm.Under the high definition application scenarios, the deficiency of CABAC encryption algorithm is that throughput is low.

The processing block diagram of CABAC as shown in Figure 1, the first step is carried out dualization to the semantic primitive that does not possess dual nature and is handled, and then directly outputs to coding processing unit for the semantic primitive that possesses the bigram statistics characteristic; In second step,, carry out adaptive binary arithmetic coding according to the probability distribution characteristics of each bit in the output bit flow of dualization processing.Specifically, adopt the mode standard coding, on the contrary, the bit with even probability distribution is adopted the bypass mode coding for the bit that has based on context probability distribution characteristics.Adopt the bit of mode standard coding, at first obtain contextual information in context model, this comprises its probability index value pStateIdx[5:0] and big probability value of symbol valMPS, corresponding context model upgraded according to the value of the current bit that is encoded afterwards.PStateIdx[5:0] and the value binVal of valMPS and the current bit that is encoded be imported into the mode standard coding engine, be used to upgrade R[8:0 between the code area] and coding lower limit L[9:0], R[8:0] the presentation code interval is 9 bit signals, highest order is numbered 8, lowest order is numbered 0, and other variable with same form adopts above-mentioned method for expressing definition.By to R[8:0 between the code area after upgrading] and coding lower limit L[9:0] carry out normalization operation, generation raw byte sequence payload code stream RBSP.The handling process of mode standard coding is referring to list of references 1 (T.Wiegand, G.Sullivan, and A.Luthra, " Draft itu-trecommendation and final draft international standard of joint videospecification (ITUT rec.H.264-ISO/IEC 14496-10AVC), " May 2003.JVT-G050r1) shown in middle Fig. 9-7.Different with the mode standard coding, the bypass mode coding adopts fixedly contextual information, does not need to search and upgrade context model.

Realize that H.264/AVC there is following difficulty in standard C ABAC algorithm:

1, computing granularity is little: H.264/AVC in the standard generation flow process of normalization operation and output code flow can referring to above-mentioned from list of references 1 Fig. 9-8,9-9 and 9-10, wherein normalization computing granularity is to be unit with a bit.

2, in the normalized process to each bit, to operate and export the generation close-coupled of bit between the code area with the normalization of coding lower limit.Initial value as between R ' expression normalization code area needs so Inferior circulation is finished its normalization operation and is produced corresponding output code flow, wherein

For rounding operation downwards.Owing to will solve the carry propagation problem in the normalization process, H.264/AVC introduce Outstan dingBits variable (OB) in the agreement.If OB is not equal to 0, the output code flow generation unit needs a plurality of cycleoperations, and this situation can further reduce the efficient of CABAC coding engine.

3, the output that the semantic primitive dualization is handled is non-block code, part semantic primitive coding back is single-bit output, this characteristic has been brought difficulty for the coding engine of realizing having greater than 1 constant code speed, therefore need carry out the throughput equilibrium treatment to dualization engine and arithmetic coding engine.

Summary of the invention

(1) technical problem that will solve

At defective that exists in the prior art and deficiency, the CABAC coding that the purpose of this invention is to provide the constant throughput of a kind of phase that has 2 bits weekly that is used for video coding agreement is H.264/AVC realized circuit and coding method, one, the processing speed that makes dualization engine and normalization engine and RBSP code stream generate engine is complementary; Its two, solve the unbalanced problem of throughput between processing engine at different levels and avoid pipeline stall; Its three, solve in the CABAC algorithm between the code area and the calculating bottleneck problem that correlation caused of coding lower limit normalization and code stream production process.

(2) technical scheme

For solving the problems of the technologies described above, the invention provides a kind of parallel encoding and realize circuit based on CABAC in H.264/AVC, comprise first order streamline, for being used to carry out the dualization engine of parallel normalized computing; Second level streamline reads and upgrades the context model engine of operation for the context that is used to carry out weekly the phase dibit; Third level streamline is the parallel normalized engine of the normalization operation that is used to carry out weekly the phase dibit; And fourth stage streamline, for being used to produce the RBSP code stream generation engine of raw byte sequence payload RBSP output code flow; Wherein, described dualization engine and context model engine are intersegmental writes 2 with 3 and reads First Input First Output and be connected; Parallel normalized engine and RBSP code stream generate that engine is intersegmental to be write 1 with 2 and read First Input First Output and be connected.

Wherein, described dualization engine is the discrete cosine transform/quantification DCT/Q coefficient dualization engine based on the table tennis storage organization, is used for executed in parallel coefficient scanning and dualization coding.

Wherein, the input signal of described dualization engine comprises current processed semantic primitive value Cur.SE, relevant with the described semantic primitive value semantic primitive value NeighborSEs that closes on, 3 write the 2 number hole_num[2:0 that read idle memory cell in the First Input First Output], and relevant dualization engine control information Ctrl.Info; The output signal of described dualization engine comprises dualization output valve { the binVali|i ∈ { 0 of 3 bits, 1,2}}, the corresponding context index value of each output bit ctxIdxi[7:0] | i ∈ { 0,1,2}} and write 3 and write 2 and read dualization output bit in the First Input First Output and the total number w_num[1:0 of related context index value], as w_num[1:0] when being not equal to 0, described dualization engine is with { binVali, ctxIdxi[7:0] | i＜w_num[1:0] write subordinate's First Input First Output, wherein i is 0 or 1 or 2, a[b:c] variable of form represents to be between the code area bit signal a of b+1, a is the title of signal, and b is the highest order numbering, is c lowest order numbering.

Wherein, the circuit structure of described dualization engine has following feature: 1) carry out in the process of described coefficient scanning, described circuit reads 4x4 piece DCT/Q coefficient, and this coefficient is write in the table tennis storage organization by the linear address incremental order; 2) in the process of carrying out described coefficient scanning, write down the indicating vector of 15 bits and the index value of last non-zero coefficient simultaneously; 3) index value of described indicating vector and last non-zero coefficient is used for the dualization coding of marking pattern significant_map, and described marking pattern is that the index value according to indicating vector register and last non-zero coefficient dynamically produces in cataloged procedure; 4) in the dualization cataloged procedure to 4x4 piece DCT/Q coefficient, by the index value of described indicating vector and last non-zero coefficient, described one step of circuit generates the address of reading of nonzero coefficient.

Wherein, the context of described context model engine reads and upgrades operation and adopt the register in the standard cell lib to realize, and adopt following design: the context model of described context model engine is classified according to the tablet mode under it, the context that belongs to same tablet mode is stored in 2 and reads in the memory cell of 2 write ports, and other context model information stores is on the sheet of single port in the internal memory, when tablet mode changes, 2 contents of reading in the memory cell of 2 write ports are upgraded, in renewal process, the phase is upgraded 2 contexts weekly.

Wherein, described parallel normalized engine is made up of the cascade of two monocycle normalization engines, phase is handled the normalized of dibit simultaneously weekly, and its input signal comprises: binVal0, valMPS0, pStateIdx0[5:0], valid0, mode0, binVal1, valMPS1, pStateIdx1[5:0], valid1 and mode1; Wherein, binVal0 and binVal1 represent the value of processed bit; ValMPS0 and valMPS1 represent big probability value of symbol; PStateIdx0[5:0] and pStateIdx1[5:0] expression probability index value; Valid0 and valid1 represent whether processed bit is effective; Mode0 and mode1 represent the coding mode of processed bit, and 0 is the standard code pattern, and 1 is the bypass coding mode, and the

suffix

0 and 1 of input signal is used to distinguish the order of processed bit; The output signal of parallel normalized engine is OB0[7:0], β 0[2:0], L0[6:0], we0, OB1[7:0], β 1[2:0], L1[6:0] and we1; When wei|i ∈ 0,1}} is 1 o'clock, exports OBi[7:0 accordingly], β i[2:0] and Li[6:0] be written into subordinate 2 and write 1 and read First Input First Output, i=0 or 1 is used for subordinate's engine and generates the RBSP code stream.

Wherein, described parallel normalized engine comprises: OB[7:0] register, be used to store current variable OB value; R[8:0] the code area inter-register, be used to store present encoding interval variable value; L[9:0] the coding lower limit register, be used to store present encoding lower limit variate-value; Described 2 to write 1 degree of depth of reading First Input First Output be 10, and every is 18 bit bit wides; Under the mode standard, high 7 after position section [17:11] memory encoding lower limit upgrades; Under the bypass mode, the highest order after position section [17] memory encoding lower limit upgrades, position section [16:11] is meaningless; Position section [10:3] storage OB[7:0]; Position section [2:0] storage of variables β [2:0]; OB[7:0 wherein] with the content of β [2:0] storage be the output signal OBi[7:0 of parallel normalized engine] and β i[2:0], i=0 or 1; The output result of described parallel normalized engine of phase after rear of queue writes dibit normalization at most weekly, simultaneously, when formation when not being empty, the RBSP code stream generates engine playback head pointer Storage Item pointed; Its meta section [b:c] presentation code is interval to be the position section of c to b, and b, c are integer, and b is that highest order is numbered, and is c lowest order numbering.

Wherein, it be can be in that the phase produces the output code flow generation engine of multidigit output bit weekly that described RBSP code stream generates engine, described output code flow produces engine and comprises precedence bits output engine and suffix bits output engine, write 1 when reading the First Input First Output non-NULL when 2, output code flow generation engine is write 1 information of reading the First Input First Output head term according to 2 and is generated the RBSP code stream, described precedence bits output engine is used for writing 1 according to 2 and reads the value of First Input First Output head term highest order and the variable OB[7:0 that is stored] value, generate value and follow-up OB[7:0 by a bit highest order] a position bit value is the character string of highest order negate; Described suffix bits output engine is used for the Bit String that generation is made up of input position section [16:16-β [2:0]+1], and is input to the RBSP code stream, and wherein, the data that are written in the RBSP code stream are exported with the byte mode alignment.

In addition, the present invention also provides a kind of parallel encoding method based on CABAC in H.264/AVC of utilizing that foregoing circuit realizes, may further comprise the steps:

Described dualization engine is encoded to executed in parallel DCT/Q coefficient scanning with to the dualization of DCT/Q coefficient;

Described context model engine is write 2 output signals of reading First Input First Output according to described 3 and is carried out weekly the context of phase dibit and read and upgrade operation;

Described parallel normalized modeling engine respectively under operative norm coding and the bypass coding mode between the code area and the normalization operation of coding lower limit;

Described RBSP code stream generation engine is write 1 output signal of reading First Input First Output according to 2 and is produced the RBSP output code flow.

(3) beneficial effect

Compared with prior art, the present invention can produce following beneficial effect:

At first, the design of dualization accelerating engine has been proposed, phase can produce the dualization output code flow of 1 to 3 bit weekly, specifically, the dualization of 4x4 piece DCT/Q coefficient is handled based on the table tennis storage organization, coefficient scanning and the concurrent working of dualization coding, thereby satisfied the 2 bits constant process speed of phase weekly, and write 2 with 6 grade 3 between dualization engine and back level processing engine and read First Input First Output and be connected, this structure can be balanced before and after the processing speed of level, thereby the dualization engine can mate the processing speed that follow-up normalization engine and RBSP code stream generate engine;

Secondly, the normalization of any bit is all handled with combinational circuit, has avoided the pipeline stall that multicycle normalization operation is introduced in the prior art; Parallel normalized processing and RBSP code stream are produced operation be decomposed into two level production lines, write 1 with 10 grade 2 between parallel normalized engine and RBSP code stream generation engine and read First Input First Output and be connected, this structure can effectively be avoided pipeline stall;

At last, the throughput that the circuit that is proposed is realized is constant, it is per clock cycle of 2 bits, its throughput is irrelevant with handles in bit stream small probability symbol probability of happening, and this has solved in the CABAC algorithm between the code area calculating bottleneck that correlation caused with encode lower limit normalization and code stream production process.

Description of drawings

Fig. 1 is an existing CABAC system block diagram H.264/AVC;

Fig. 2 is the circuit overall architecture block diagram of the embodiment of the invention;

Fig. 3 is the circuit diagram of the parallel dualization engine of storing based on rattling of the embodiment of the invention;

Fig. 4 is the circuit diagram of the parallel normalized engine of the embodiment of the invention;

Fig. 5 is between the mode standard code area of the embodiment of the invention and the coding lower limit upgrades the circuit diagram of engine;

Fig. 6 follows bit for the mode standard of the embodiment of the invention and upgrades the circuit diagram that generates variable OB and variable β in the engine.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used to illustrate the present invention, but are not used for limiting the scope of the invention.

The design and the realization of CABAC coding engine in real time during H.264/AVC the present invention can be applicable to.

According to the circuit overall architecture block diagram of the embodiment of the invention as shown in Figure 2.The definition of the input/output variable of the top pipeline organization of CABAC encoder and each level production line at first is described:

The CABAC encoder adopts 4 stage pipeline structure, comprises from top to bottom: (1) dualization engine; (2) context model engine; (3) parallel normalized engine and (4) RBSP code stream generate engine.For keeping the balance of throughput between each streamline engine, adopt 3 of 6 memory cell to write 2 between dualization engine and the context model engine and read First Input First Output and is connected, 2 of employing 10 memory cell write 1 and read First Input First Output and be connected between parallel normalized engine and the RBSP code stream generation engine.Introduce each level production line below successively:

First order streamline is the dualization engine, and it comprises from the outside input signal of parallel encoding module (being the CABAC encoder):

1.Cur.SE: processed semantic primitive value;

2.NeighborSEs: the neighbours semantic primitive value relevant with processed semantic primitive;

3.Ctrl.Info: relevant dualization engine control information comprises module reset signal and coding enable signal.

The writing 2 input signals of reading First Input First Output from 3 and comprise of dualization engine:

1.hole_num[2:0]: represent that current 3 write the 2 dummy cell numbers of reading in the First Input First Output.Hole_num[2:0] the presentation code interval is 3 bit signals, and highest order is numbered 2, and lowest order is numbered 0, other variable with same form adopts the definition of above-mentioned method for expressing, for example, ctxIdx[7:0] the presentation code interval is 8 bit signals, highest order is numbered 7, and lowest order is numbered 0.

The dualization engine outputs to 3 and writes 2 signals of reading First Input First Output and comprise:

1.binVal0: the value of the 0th bit that this cycle generated;

2.ctxIdx0[7:0]: the context index value of the 0th bit that this cycle generated, wherein ctxIdx0[7:0] equal 255 and show that binVal0 adopts the bypass coding mode;

3.binVal1: the value of the 1st bit that this cycle generated;

4.ctxIdx1[7:0]: the context index value of the 1st bit that this cycle generated, ctxIdx1[7:0] equal 255 and show that binVal1 adopts the bypass coding mode;

5.binVal2: the value of the 2nd bit that this cycle generated;

6.ctxIdx2[7:0]: the context index value of the 2nd bit that this cycle generated, ctxIdx2[7:0] equal 255 and show that binVal2 adopts the bypass coding mode;

7.w_num[1:0]: since the 0th bit, be written to 3 and write 2 number of reading First Input First Output: w_num[1:0] be 0, expression does not write any data; W_num[1:0] be 1, expression binVal0 and ctxIdx0[7:0] write the memory cell of tail pointer indication; W_num[1:0] be 2, expression binVal0 and ctxIdx0[7:0] write the memory cell of tail pointer indication and binVal1 and ctxIdx1[7:0] write the memory cell that tail pointer adds an indication; W_num[1:0] be 3, expression binVal0 and ctxIdx0[7:0] write the memory cell of tail pointer indication, and binVal1 and ctxIdx1[7:0] write the memory cell that tail pointer adds an indication, and binVal2 and ctxIdx2[7:0] write the memory cell that tail pointer adds two indications.

Second level streamline is the context model engine, and it is write 2 input signals of reading First Input First Output from 3 and comprises:

1.item_num[2:0]: represent that current 3 write 2 and read significant bit number in the First Input First Output.

Read the bit value that the First Input First Output head pointer points to the unit 2.binVal0:3 write 2;

3.ctxIdx0[7:0]: 3 write 2 reads the context index value that the First Input First Output head pointer points to the unit;

Read the bit value that the First Input First Output head pointer adds a sensing unit 4.binVal1:3 write 2;

5.ctxIdx1[7:0]: 3 write 2 reads the context index value that the First Input First Output head pointer adds a sensing unit.

The context model engine outputs to 3 and writes 2 signals of reading First Input First Output and comprise:

1.r_num[1:0]: expression is write 2 unit number of reading the First Input First Output of reading from 3.

The signal suffix that the context model engine outputs to parallel normalized engine is 0, and this signal is produced by first bit value that reads in and relevant context index value, comprising:

1.valid0: effectively whether first bit of reading of expression (1: effectively, 0: invalid);

2.binVal0: first bit value;

3.mode0: work as mode0=0, represent that first bit adopts the standard code pattern to encode; Work as mode0=1, represent that first bit adopts the bypass coding mode to encode;

4.pStateIdx0[5:0]: first bit is the probability index value of big probability symbol;

5.valMPS0: the binary value of the big probability symbol of first bit correspondence.

The signal suffix that the context model engine outputs to parallel normalized engine is 1, and this signal is produced by second bit value that reads in and relevant context index value, comprises valid1, binVal1, mode1, pStateIdx1[5:0] and valMPS1, its implication is identical with above-mentioned signal.

Third level streamline is parallel normalized engine, and it outputs to 2 and writes 1 signal of reading First Input First Output and comprise:

1.we0: when we0 equals 1, represent that the first bit normalization meeting produces the RBSP code stream, output OB0[7:0], β 0[2:0] and L0[6:0] value need write 2 and write 1 and read First Input First Output; When we0 equals 0, expression does not produce the RBSP code stream, output OB0[7:0], β 0[2:0] and L0[6:0] value can not write 2 and write 1 and read First Input First Output;

2.L0[6:0]: be used to produce the RBSP code stream;

3.OB0[7:0]: be used to produce the RBSP code stream, show L0[6] (expression L0[6:0] highest order) heel with value be

(

Be L0[6] inverse) number of bits;

4. β 0[2:0]: be used to produce the RBSP code stream, its value non-zero then shows L0[5:6-β 0[2:0]] will output to the RBSP code stream;

5.we1: when we1 equals 1, show that the second bit normalization meeting produces the RBSP code stream, output OB1[7:0], β 1[2:0] and L1[6:0] value will write 2 and write 1 and read First Input First Output; When we1 equals 0, show not produce the RBSP code stream output OB1[7:0], β 1[2:0] and L1[6:0] value can not write 2 and write 1 and read First Input First Output;

6.L1[6:0]: be used to produce the RBSP code stream;

7.OB1[7:0]: be used to produce the RBSP code stream, indicate L1[6] heel with value be Number of bits;

8. β 1[2:0]: be used to produce the RBSP code stream, its value non-zero then shows L1[5:6-β 1[2:0]] also will output to the RBSP code stream;

Fourth stage streamline generates engine for the RBSP code stream, and it is write 1 input signal of reading First Input First Output from 2 and comprises:

1.valid: 1 data of reading in the First Input First Output head pointer memory cell pointed whether effectively (1: are effectively write in expression 2; 0: invalid);

2.L[6:0], OB[7:0] with β [2:0] be data in the head pointer memory cell pointed; The RBSP code stream generates engine and outputs to 2 and write 1 signal of reading First Input First Output and comprise:

1.re: read enable signal, re is 1 o'clock, and the data in the head pointer memory cell pointed are ejected from formation.

The signal that RBSP code stream generation engine outputs to the RBSP code stream is:

1.RBSP[7:0]: the RBSP code stream with byte-aligned of generation;

2.RBSP_we: output code flow is write enable signal, is to show current output port RBSP[7:0 at 1 o'clock] data are effective, otherwise, output port RBSP[7:0] data are invalid.

Operation principle below in conjunction with Fig. 3～6 explanations circuit of the present invention.

The dualization circuit structure of 4x4 piece DCT/Q coefficient as shown in Figure 3.The dualization of 4x4 piece DCT/Q coefficient is handled and is divided into two stages: sweep phase and dualization coding stage.At sweep phase, Zig-Zag (zigzag) scan address generation circuit generates outside 4x4 piece DCT/Q coefficient memory by the Zig-Zag order and reads the address, the coefficient read produces its absolute value subtract one (abs_minus1[14:0]) and symbol (sign) thereof, abs_minus1[14:0 through logical circuit] and sign write the table tennis memory cell (memory cell 0 or memory cell 1) of dualization engine internal.Because the absolute value of DCT/Q coefficient is less than 2 ¹⁵So, subtract one carry value when giving up its absolute value, only keep under its situation of low 15, when coefficient is 0, abs_minus1[14:0] equal 2 ¹⁵-1.The design judges with this whether former coefficient is 0, and the flag bit that generates writes 15 " indicating vector register ", and this flag bit also is the clock enable signal of " last non-zero coefficient index value " register simultaneously.

After a 4x4 piece DCT/Q coefficient scanning finished, enter dualization coding stage.In coding stage at first is marking pattern (significant_map) to be carried out dualization handle.In the present invention, the generation of significant_map dynamically produces (can with reference to figure 3) according to " indicating vector " register and " last non-zero coefficient index value " register in cataloged procedure.Indicating vector and last non-zero coefficient index value are sent into " significant_map dualization engine " as input signal, this engine is the phase dualization output that can produce 3 bits at most weekly, its operation principle is as follows: " significant_map dualization engine " use " current mark bit index " variable indicates this cycle dualization and operates the index address that begins in the pairing indicating vector, its corresponding flag bit is " flag bit 0 ", the flag bit that " current mark bit index " adds a correspondence is " flag bit 1 ", and the flag bit that " current mark bit index " adds two correspondences is " flag bit 2 ".In addition, in " significant_map dualization engine ", also there is a variable " preceding cycle last non-zero sign is left over ", when being 1, this variable shows: in last cycle dualization cataloged procedure, last bit flag coding be output as 1 and its " last non-zero sign " also do not write 3 and write 2 and read First Input First Output, therefore need be at current period with this sign output.According to " preceding cycle last non-zero sign is left over ", and the value of " flag bit 0 " and " flag bit 1 ", the result of output dualization bit divides five kinds of situations to handle:

1. " preceding cycle last non-zero sign is left over " equals 0, " flag bit 0 " equals 0, " flag bit 1 " equals 0: output binVal0 and ctxIdx0[7:0] determine by " flag bit 0 ", output binVal1 and ctxIdx1[7:0] by " flag bit 1 " decision, output binVal2 and ctxIdx2[7:0] determine by " flag bit 2 ";

2. " preceding cycle last non-zero sign is left over " equals 0, " flag bit 0 " equals 0, " flag bit 1 " equals 1: output binVal0 and ctxIdx0[7:0] determine by " flag bit 0 ", output binVal1 and ctxIdx1[7:0] determine by " flag bit 1 ", output binVal2 and ctxIdx2[7:0] be " the last non-zero sign " of " flag bit 1 ", equal " last non-zero coefficient index value " if " current mark bit index " adds one, binVal2 equals 1, otherwise binVal2 equals 0;

3. " preceding cycle last non-zero sign is left over " equals 0, " flag bit 0 " equals 1: output binVal0 and ctxIdx0[7:0] determine by " flag bit 0 ", output binVal1 and ctxIdx1[7:0] be " the last non-zero sign " of " flag bit 0 ", if " current mark bit index " equals " last non-zero coefficient index value ", binVal1 equals 1, otherwise binVal1 equals 0, output binVal2 and ctxIdx2[7:0] determine by " flag bit 1 ";

4. " preceding cycle last non-zero sign is left over " equals 1, " flag bit 0 " equals 0: output binVal0 and ctxIdx0[7:0] be " the last non-zero sign " of " flag bit 0 " last position, if subtracting one, " current mark bit index " equal " last non-zero coefficient index value ", binVal0 equals 1, otherwise binVal0 equals 0, output binVal1 and ctxIdx1[7:0] by " flag bit 0 " decision, output binVal2 and ctxIdx2[7:0] determine by " flag bit 1 ";

5. " preceding cycle last non-zero sign is left over " equals 1, " flag bit 0 " equals 1: output binVal0 and ctxIdx0[7:0] be " the last non-zero sign " of " flag bit 0 " last position, if subtracting one, " current mark bit index " equal " last non-zero coefficient index value ", binVal0 equals 1, otherwise binVal0 equals 0, output binVal1 and ctxIdx1[7:0] determine by " flag bit 0 ", output binVal2 and ctxIdx2[7:0] be " the last non-zero sign " of " flag bit 0 ", if " current mark bit index " equals " last non-zero coefficient index value ", binVal2 equals 1, otherwise binVal2 equals 0; It should be noted that: the output signal of being mentioned in the method that five kinds of situations of above-mentioned branch are handled not is all to be written to 3 to write 2 and read First Input First Output, specifically writes binVali and ctxIdxi[7:0] number by w_num[1:0] control.

At abs_minus1[14:0] and sign dualization processing procedure in, " indicating vector " register and " last non-zero coefficient index value " register be common to generate the address of reading of the nonzero coefficient that is stored in " coefficient formation ".The initial value of " address is read in the coefficient formation " is set to " last non-zero coefficient index value ", at this moment last last non-zero coefficient in " address is read in the coefficient formation " sensing " coefficient formation ".In the starting stage, " indicating vector register " the n position that moves to right, n equals 15 and deducts " last non-zero coefficient index value ".As the abs_minus1[14:0 that finishes a coefficient] and after the sign dualization handles, the coefficient formation is read the address and is deducted tz+1 (tz is the quantity of trailing zeros in current " indicating vector register ", the number of continuous 0 bit of promptly trailing), this spline coefficient formation is read the address and is just pointed to next nonzero coefficient in " coefficient formation ".Then, " indicating vector " register tz+1 position that moves to right.This process is sustained, and reads the value of address less than tz+1 up to current coefficient formation.

The context model phase context that can handle 2 bits weekly reads and upgrades operation, therefore realize that the memory circuit of context model has 2 read ports and 2 write ports, adopt the register in the standard cell lib to realize, therefore have power consumption and the bigger defective of chip area expense.For power consumption and the chip area that reduces the context model circuit, design proposed by the invention is divided into 3 classes: SI/I with 399 kinds of context models according to the tablet mode under it (slice mode), SP/P and B (being the tablet mode item name of H.264 stipulating in the agreement).The context model that belongs to same tablet mode is stored in 2 and reads in the memory cell of 2 write ports, and other context model information is stored on the single port sheet in the internal memory.When tablet mode changes, need upgrade 2 context models of reading in the memory cell of 2 write ports.This design is based on the change of tablet mode, generally only occur in the beginning of a frame/field coding, and the context bar number of sharing between different tablet mode is 237.In the renewal process, renewable 2 contexts of phase weekly, total process is no more than 69 cycles.The method can reduce reading of chip area expense and context model memory cell effectively and upgrade the power consumption cost.

Wherein there is following functional part: OB[7:0 in 2 bit parallel normalization circuit block diagrams as shown in Figure 4] register (being " OB " among the figure): store current variable OB value.R[8:0] code area inter-register (being " R " among the figure): storage present encoding interval variable value.L[9:0] coding lower limit register ((being " L " among the figure)): storage present encoding lower limit variate-value.

Renewal engine with the coding lower limit between the code area comprises " L﹠amp; R upgrades engine 0 " and " L﹠amp; R upgrades engine 1 "." L﹠amp; R upgrades engine 0 " be used between the code area of first bit (bit 0) and the renewal work of coding lower limit, be output as R ' after the renewal ₀[8:0] and L ' ₀[10:0].This upgrades engine and works in standard (mode ₀≡ 0, and " ≡ " expression " is constantly equal to ") and bypass (mode ₀≡ 1) two kinds of patterns.When working in mode standard, it exports R ' ₀[8:0] and L ' ₀[10:0] is defined as:

Wherein, R _LPSFor according to R[8:0] and pStateIdx[5:0] value according to the gained of tabling look-up of the definition in the list of references 1, R _MPS=R-R _LPS

When working in bypass mode, R ' ₀[8:0] equals its input R[8:0], L ' ₀[10:0] is defined as:

" L﹠amp; R upgrades engine 1 " be used to carry out between the code area of second bit (bit 1) and the renewal work of coding lower limit, between the code area of its input and the coding lower limit be R ' ₀[8:0] and L ' ₀Result after [10:0] normalization comes from the output R of " normalization engine 0 " " ₀[8:0] and L " ₀[10:0], other input comprises the signal relevant with bit 1, comprises binVal1, mode1, pStateIdx1[5:0] and valMPS1." L﹠amp; R upgrades engine 1 " update algorithm and " L﹠amp; R upgrades engine 0 " identical.

Normalization engine with the coding lower limit between the code area comprises " normalization engine 0 " and " engine 1 is upgraded in normalization ".The normalization engine has two kinds of mode of operations: mode standard and bypass operating mode.When the normalization engine is operated in mode standard, its circuit block diagram as shown in Figure 5.When the normalization engine is operated in bypass mode, if L ' [10] ≡ 1 defines L " [9:0] ≡ L ' [9:0], otherwise, definition L " [9]=0 and L " [8:0] ≡ L ' [8:0].

OB upgrades engine and comprises " OB upgrades engine 0 " and " OB upgrades engine 1 ".OB[7:0] upgrade engine and have two kinds of mode of operations equally: mode standard and bypass operating mode.When it is operated in mode standard, variable OB[7:0] and the refresh circuit block diagram of β [2:0] is as shown in Figure 6, variable n[2:0 wherein] be the output of Fig. 5, variable σ is the output of leading 1 counter; When it is operated in bypass mode, β [2:0] is constantly equal to 0, for variable OB[7:0] renewal in two kinds of situation: as input L ' [10:9] ≡ 01, it exports OB ' [7:0]=OB[7:0]+1; Otherwise, OB ' [7:0]=0.The input L ' [10:0] of " OB upgrades engine 0 " is from L ' ₀[10:0], input n[2:0] from n ₀[2:0], input OB[7:0] from OB[7:0] output of register, it is output as OB ' ₀[7:0] and β 0[2:0].The input L ' [10:0] of " OB upgrades engine 1 " is from L ' ₁[10:0], input n[2:0] from n ₁[2:0], input OB[7:0] from OB ' ₀[7:0], it is output as OB ' ₁[7:0] and β 1[2:0].

Output is write enable signal we0 and equal 0:(1 in following situation) respective input signals valid0 ≡ 0, promptly input bit 0 is invalid; (2) when input bit adopts the mode standard coding, L ' ₀[9] ≡ 0, and L ' ₀[8:9-n ₀[2:0]] in be not 0 bit; (3) when input bit adopts the bypass mode coding, L ' ₀[10:9] ≡ 01.Except that above-mentioned situation, we0 equals 1.When we0 equals 1, output β 0[2:0], OB0[7:0] and L0[6:0] write subordinate 2 and write 1 memory cell of reading the tail pointer indication of First Input First Output.

Output is write enable signal we1 and equal 0:(1 in following situation) respective input signals valid1 ≡ 0, promptly input bit 1 is invalid; (2) when input bit adopts the mode standard coding, L ' ₁[9] ≡ 0, and L ' ₁[8:9-n ₁[2:0]] in be not 0 bit; (3) when input bit adopts the bypass mode coding, L ' ₁[10:9] ≡ 01.Except that above-mentioned three kinds of situations, we1 equals 1.When we1 equals 1, output β 1[2:0], OB1[7:0] and L1[6:0] will write 2 of subordinate and write 1 and read First Input First Output, the position of the memory cell that is write is relevant with the value of we0: if we0 equals 0, write the memory cell of tail pointer indication, otherwise write the memory cell that tail pointer adds an indication.

Output signal L0[6:0] relevant with the coding mode of bit 0, for standard code pattern, L0[6:0] equal L ' ₀[9:3]; Otherwise, L0[6:0] and equal L ' ₀[10].Output signal L1[6:0] relevant with the coding mode of bit 1, for standard code pattern, L1[6:0] equal L ' ₁[9:3]; Otherwise, L1[6:0] and equal L ' ₁[10].

Valid0 or valid1 equal at 1 o'clock, register OB[7:0], R[8:0] and L[9:0] in value need to upgrade, when valid1 equals 1, OB ' ₁[7:0], R " ₁[8:0] and L " ₁The value of [9:0] is used to upgrade above-mentioned register; Otherwise, OB ' ₀[7:0], R " ₀[8:0] and L " ₀The value of [9:0] is used to upgrade above-mentioned register.

2 to write 1 degree of depth of reading First Input First Output be 10, and every is 18 bit bit wides: position section [17:11] is a variables L [6:0], under the mode standard, high 7 after the memory encoding lower limit upgrades, under the bypass mode, the highest order after position section [17] memory encoding lower limit upgrades, position section [16:11] is meaningless; Position section [10:3] is variable OB[7:0]; Position section [2:0] is variable β [2:0].

RBSP code stream generation engine is write 1 memory cell of reading the head pointer indication of First Input First Output from 2 and is taken out variables L [6:0], β [2:0] and OB[7:0], produce the RBSP code stream by the byte-aligned mode.The data-path circuit that bit generates engine is mainly by the buffer storage buf[7:0 of 8 bits], precedence bits output engine and suffix bits output engine form.The input signal of precedence bits output engine is L[6] and OB[7:0], its function is to generate bit stream L[6] and follow-up OB[7:0] individual

Produce the output of 8 bits in the precedence bits output engine monocycle at most, these output bits and buffer storage buf[7:0] in after the bit of buffer memory splices, preceding 8 bits are write the RBSP code stream, remaining bit deposits buf[7:0 again in]; The input signal of suffix bits output engine is L[5:0] and β [2:0], when the value of β [2:0] is not equal to 0, the suffix bits output engine produces code stream L[5:6-β [2:0]], after the bit of buffer memory splices in this output and the buffer storage, if total amount of bits is not less than 8, preceding 8 bits are write the RBSP code stream, remaining bit deposits buffer storage again in, otherwise spliced bit stream directly deposits buffer storage in.

The above only is embodiments of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the technology of the present invention principle; can also make some improvement and modification, these improve and modification also should be considered as protection scope of the present invention.

Claims

1. the parallel encoding based on CABAC in is H.264/AVC realized circuit, it is characterized in that, comprises first order streamline, for being used to carry out the dualization engine of parallel normalized computing; Second level streamline reads and upgrades the context model engine of operation for the context that is used to carry out weekly the phase dibit; Third level streamline is the parallel normalized engine of the normalization operation that is used to carry out weekly the phase dibit; And fourth stage streamline, for being used to produce the RBSP code stream generation engine of raw byte sequence payload RBSP output code flow; Wherein, described dualization engine and context model engine are intersegmental writes 2 with 3 and reads First Input First Output and be connected; Parallel normalized engine and RBSP code stream generate that engine is intersegmental to be write 1 with 2 and read First Input First Output and be connected.

2. the parallel encoding based on CABAC in H.264/AVC as claimed in claim 1 is realized circuit, it is characterized in that, described dualization engine is the discrete cosine transform/quantification DCT/Q coefficient dualization engine based on the table tennis storage organization, is used for executed in parallel coefficient scanning and dualization coding.

3. as claimed in claim 1ly realize circuit based on CABAC parallel encoding in H.264/AVC, it is characterized in that, the input signal of described dualization engine comprises current processed semantic primitive value Cur.SE, relevant with the described semantic primitive value semantic primitive value Neighbor SEs that closes on, 3 write the 2 number hole_num[2:0 that read idle memory cell in the First Input First Output], and relevant dualization engine control information Ctrl.Info; The output signal of described dualization engine comprises dualization output valve { the binVali|i ∈ { 0 of 3 bits, 1,2}}, the corresponding context index value of each output bit ctxIdxi[7:0] | i ∈ { 0,1,2}} and write 3 and write 2 and read dualization output bit in the First Input First Output and the total number w_num[1:0 of related context index value], as w_num[1:0] when being not equal to 0, described dualization engine is with { binVali, ctxIdxi[7:0] | i＜w_num[1:0] write subordinate's First Input First Output, wherein i is 0 or 1 or 2, a[b:c] variable of form represents to be between the code area bit signal a of b+1, a is the title of signal, and b is the highest order numbering, is c lowest order numbering.

4. the parallel encoding based on CABAC in H.264/AVC as claimed in claim 2 is realized circuit, it is characterized in that, the circuit structure of described dualization engine has following feature: 1) carry out in the process of described coefficient scanning, described circuit reads 4x4 piece DCT/Q coefficient, and this coefficient is write in the table tennis storage organization by the linear address incremental order; 2) in the process of carrying out described coefficient scanning, write down the indicating vector of 15 bits and the index value of last non-zero coefficient simultaneously; 3) index value of described indicating vector and last non-zero coefficient is used for the dualization coding of marking pattern significant_map, and described marking pattern is that the index value according to indicating vector register and last non-zero coefficient dynamically produces in cataloged procedure; 4) in the dualization cataloged procedure to 4x4 piece DCT/Q coefficient, by the index value of described indicating vector and last non-zero coefficient, described one step of circuit generates the address of reading of nonzero coefficient.

5. the parallel encoding based on CABAC in H.264/AVC as claimed in claim 1 is realized circuit, it is characterized in that, the context of described context model engine reads and upgrades operation and adopt the register in the standard cell lib to realize, and adopt following design: the context model of described context model engine is classified according to the tablet mode under it, the context that belongs to same tablet mode is stored in 2 and reads in the memory cell of 2 write ports, and other context model information stores is on the sheet of single port in the internal memory, when tablet mode changes, 2 contents of reading in the memory cell of 2 write ports are upgraded, in renewal process, the phase is upgraded 2 contexts weekly.

6. the parallel encoding based on CABAC in H.264/AVC as claimed in claim 1 is realized circuit, it is characterized in that, described parallel normalized engine is made up of the cascade of two monocycle normalization engines, phase is handled the normalized of dibit simultaneously weekly, and its input signal comprises: binVal0, valMPS0, pStateIdx0[5:0], valid0, mode0, binVal1, valMPS1, pStateIdx1[5:0], valid1 and mode1; Wherein, binVal0 and binVal1 represent the value of processed bit; ValMPS0 and valMPS1 represent big probability value of symbol; PStateIdx0[5:0] and pStateIdx1[5:0] expression probability index value; Valid0 and valid1 represent whether processed bit is effective; Mode0 and mode1 represent the coding mode of processed bit, and 0 is the standard code pattern, and 1 is the bypass coding mode, and the suffix 0 and 1 of input signal is used to distinguish the order of processed bit; The output signal of parallel normalized engine is OB0[7:0], β 0[2:0], L0[6:0], we0, OB1[7:0], β 1[2:0], L1[6:0] and we1; When wei|i ∈ 0,1}} is 1 o'clock, exports OBi[7:0 accordingly], β i[2:0] and Li[6:0] be written into subordinate 2 and write 1 and read First Input First Output, i=0 or 1 is used for subordinate's engine and generates the RBSP code stream.

7. the parallel encoding based on CABAC in H.264/AVC as claimed in claim 6 is realized circuit, it is characterized in that described parallel normalized engine comprises: OB[7:0] register, be used to store current variable OB value; R[8:0] the code area inter-register, be used to store present encoding interval variable value; L[9:0] the coding lower limit register, be used to store present encoding lower limit variate-value; Described 2 to write 1 degree of depth of reading First Input First Output be 10, and every is 18 bit bit wides; Under the mode standard, high 7 after position section [17:11] memory encoding lower limit upgrades; Under the bypass mode, the highest order after position section [17] memory encoding lower limit upgrades, position section [16:11] is meaningless; Position section [10:3] storage OB[7:0]; Position section [2:0] storage of variables β [2:0]; OB[7:0 wherein] with the content of β [2:0] storage be the output signal OBi[7:0 of described parallel normalized engine] and β i[2:0], i=0 or 1; The output result of described parallel normalized engine of phase after rear of queue writes dibit normalization at most weekly, simultaneously, when formation when not being empty, the RBSP code stream generates engine playback head pointer Storage Item pointed; Its meta section [b:c] presentation code is interval to be the position section of c to b, and b, c are integer, and b is that highest order is numbered, and is c lowest order numbering.

8. the parallel encoding based on CABAC in H.264/AVC as claimed in claim 7 is realized circuit, it is characterized in that, it be can be in that the phase produces the output code flow generation engine of multidigit output bit weekly that described RBSP code stream generates engine, described output code flow produces engine and comprises precedence bits output engine and suffix bits output engine, write 1 when reading the First Input First Output non-NULL when 2, output code flow generation engine is write 1 information of reading the First Input First Output head term according to 2 and is generated the RBSP code stream, described precedence bits output engine is used for writing 1 according to 2 and reads the value of First Input First Output head term highest order and the variable OB[7:0 that is stored] value, generate value and follow-up OB[7:0 by a bit highest order] bit value is the character string of highest order negate; Described suffix bits output engine is used for the Bit String that generation is made up of input position section [16:16-β [2:0]+1], and is input to the RBSP code stream, and wherein, the data that are written in the RBSP code stream are exported with the byte mode alignment.

9. the parallel encoding method based on CABAC in H.264/AVC of utilizing that the described circuit of each of claim 1～8 realizes is characterized in that, may further comprise the steps: