WO2007027402A2

WO2007027402A2 - Multi-stage cabac decoding pipeline

Info

Publication number: WO2007027402A2
Application number: PCT/US2006/031353
Authority: WO
Inventors: Ankur Shah; Liang Peng
Original assignee: Micronas Usa, Inc.
Priority date: 2005-08-31
Filing date: 2006-08-11
Publication date: 2007-03-08
Also published as: WO2007027402A3

Abstract

An architecture capable of Content Based Adaptive Binary Arithmetic Coding (CABAC) decoding at the syntax element level is disclosed. The architecture employs a multi-stage stage pipeline to implement the functions of CABAC bit parsing and decoding processes based on the H.264 CABAC algorithm. Each stage can be carried out in one clock cycle, and not all stages are executed for every bit (e.g., average of 4 cycle per bit, or 30 frames per second). The architecture can be implemented, for example, using gate-level logic state machines as part of a system-on-chip (SOC) solution for a video/audio decoder for use in high definition television broadcasting (HDTV) applications. Other such video/audio decoder applications are enabled as well.

Description

MULTI-STAGE CABAC DECODING PIPELINE

RELATED APPLICATIONS

[0001] This application is related to U.S. Application No. 11/181,204, filed July 13, 2005, titled "Two Pass Architecture For H.264 CABAC Decoding Process," and to U.S. Application No. (not yet known), filed August 31, 2005, titled "Macroblock Neighborhood Address Calculation" <attorney docket number 22682-10708>. Each of these applications is herein incorporated in its entirety by reference.

FIELD OF THE INVENTION

[0002] The invention relates to video compression, and more particularly, to H.264 Content Based Adaptive Binary Arithmetic Coding (CABAC) decoding at the syntax element level.

BACKGROUND OF THE INVENTION

[0003] The H.264 specification, also known as the Advanced Video Coding (AVC) standard, is a high compression digital video codec standard produced by the Joint Video Team (JVT), and is identical to ISO MPEG-4 part 10. The H.264 standard is herein incorporated by reference in its entirety.

[0004] H.264 CODECs can encode video with approximately three times fewer bits than comparable MPEG-2 encoders at the same visual quality. This significant increase in coding efficiency means that more quality video data can be sent over the available channel bandwidth. In addition, many video services can now be offered in environments where they previously were not possible. H.264 CODECs would be particularly useful, for instance, in high definition television (HDTV) applications, bandwidth limited networks (e.g., streaming mobile television), personal video recorder (PVR) and storage applications for home use, and other such video delivery applications (e.g., digital terrestrial TV, cable TV, satellite TV, video over xDSL, DVD, and digital and wireless cinema). [0005] In general, all Standard video processing (e.g., MPEG-2 and H.264) encodes video as a series of pictures, For video in the interlaced format, the two fields of a frame can be encoded together as a frame picture, or encoded separately as two field pictures. Botfi types of encoding can be used in a single interlaced sequence. The output of the decoding process for an interlaced sequence is a series of reconstructed fields. For video in the progressive format, all encoded pictures are frame pictures. The output of the decoding process is a series of reconstructed frames.

[0006] Encoded pictures are classified into three^, types: I₅ P, and B. I-type pictures represent intra coded pictures, and are used as a prediction starting point (e.g., after error recovery or a channel change). Here, all macroblocks are coded with the prediction only from the macroblocks in the same picture. P-type pictures represent predicted pictures. Here, macroblocks can.be coded with forward prediction with reference to macroblocks in previous I-type or P-type pictures, or they can be intra coded within the same pictures. B-type pictures represent bi-directionally predicted pictures. Here, macroblocks can be coded with forward prediction (with reference to¹ the macroblocks in previous I-type and P-type pictures), or with backward prediction (with reference to the macroblocks in next I-type and P-type pictures), or with interpolated prediction (with reference to the macroblocks in previous and next I-type and P-type pictures), or intra coded within the same picture. In both P-type and B-type pictures, macroblocks may be skipped and not sent at all, In such cases, the decoder uses the anchor reference pictures for prediction with no error.

[0007] The advanced coding techniques of the H.264 specification operate within a similar scheme as used by previous MPEG standards. The higher coding efficiency and video quality are enabled by a number of features, including improved motion estimation and inter prediction, spatial intra prediction and transform, and context-adaptive binary arithmetic coding (CABAC) and context-adaptive variable length coding (CAVLC) algorithms.

[0008] As is known, motion estimation is used to support inter picture prediction for eliminating temporal redundancies. Spatial correlation of data is used to provide intra picture prediction (prior to the transform). Residuals are constructed as the difference between predicted images and the source images. Discrete spatial transform and filtering is used to eliminate spatial redundancies in the residuals. H.264 also supports 'entropy coding of the transformed residual coefficients and' of the supporting data such as motion ' vectors.

[0009] Entropy is a measure of the average information content per source output unit, and is typically expressed in bits/pixel. Entropy is maximized when all possible values of the source output unit are equal (e.g., an image of 8-bit pixels with an average information content of 8 bits/pixel). Coding the source output unit with fewer bits) on average, generally results in information loss. Note, however, that the entropy can be reduced so that the image can be coded with fewer than 8 bits/pixel on average without information loss.

[00.10] The H.264 specification provides two alternative processes of entropy coding - CABAC and CAVLC. CABAC provides a highly efficient encoding scheme when it is known that certain symbols are much more likely than others. Such dominant symbols may be encoded with extremely small bit/symbol ratios. CABAC continually updates the frequency statistics of the incoming data, and adaptively adjusts the arithmetic and context model of the coding algorithm in real-time. CAVLC uses multiple variable length codeword tables to encode transform coefficients. The codeword best table is selected adaptively based on a priori statistics of already processed data. A single table is used for non-coefficient data.

[0011] The H.264 specification provides for seven profiles each targeted to particular applications, including a Baseline Profile, a Main Profile, an Extended Profile, and four High Profiles. The Baseline Profile supports progressive video, uses I and P slices, CAVLC for entropy coding, and is targeted towards real-time encoding and decoding for applications. The Main Profile supports both interlaced and progressive video with macroblock or picture level field/frame mode selection, and uses I, P, B slices, weighted prediction, as well as both CABAC and CAVLC for entropy coding. The Extended Profile supports both interlaced and progressive video, CAVLC, and uses I, P, B, SP₅ SI slices. [0012] The High Profile extends functionality of the Main Profile for effective coding. The High Profile uses adaptive 8x8 or 4x4 transform, and' enables perceptual quantization matrices. The High 10 Profile is an..extension of the High Profile for 10-bit component resolution. The High 4:2:2 Profile supports 4:2:2 chroma format and up to 10-bit component resolution (e.g., for video production and editing).. The High 4:4:4 Profile supports 4:4:4 chroma format and up to 12-bit component resolution. It also enables lossless mode of operation and direct coding of the RGB signal (e.g., for professional production and graphics).

[0013] Prior to CABAC, the arithmetic coding technique typically used in image compression is the QM-coder adopted in JPEG, JPEG2000 and JBIG standards. However, this technique uses an approximation to avoid expensive hardware multipliers, which makes the interval range updating and, the probability prediction rules used in the QM-coder implementation^' imprecise. This has greatly limited the efficiency of the arithmetic coding. Another limitation of the QM-coder is that.it does not supply a good way for the context adaptation in the bit coding process. The context based adaptive binary arithmetic coding (i.e., CABAC) proposed by the JVT committee uses an improved version of arithmetic coder, known as an M-coder. The M-coder has not only overcome the precision issue, but also simplified the operation used to update the interval range. It replaces the use of multipliers with a modulation table, which supplies sufficient information to keep track the probability state transition and the interval change. In addition to the use of M-coder, CABAC also incorporates a bit level content adaptive scheme that fine-tunes the probability model for each bit in its decoding process based on the accumulative statistics of the same bit of the same syntax element previously decoded.

[0014] However, the JVT-proposed H.264 CABAC algorithm and its various software implementations are intrinsically serialized operations. Such a software solution is very slow in performance because there is a strong dependency between consecutive bits, due to (a) the nature of the statistical modeling in the arithmetic coding, and Qo) the bit level dependency in the context modeling of the H.264 CABAC decoding process. Thus, there is no known software implementation that can meet, for instance, with the real-time 30 frame per second for the performance requirement for the High Definition 192OxIOSO interlace (10801) or 1280x720 progressive (720P) formats used in the broadcast standard. In addition, an H.264 CABAC bit stream has a huge bit,rate fluctuation, which makes it very difficult for any implementations to build an ASIC hardware component in a SOC system to meet the real-time performance requirement for demanding applications, such as ,high definition video broadcasting. 10015} What is heeded, therefore, are architectures that are H.264 CABAC enabled.

SUMMARYOF THEINVENTION

[0016J One embodiment of the present invention provides a multi-stage context- adaptive binary arithmetic coding (CABAC) architecture device for decoding a video bit stream. The device includes one or more stages adapted to operate on Bin 0 only, for detecting macroblόcks neighboring a current macroblock, and for calculating a context index increment for each syntax element. The device further includes one or more. stages adapted to operate on all bins, for writing updated probability state index and most probable symbol value's back to a previous stage table corresponding to a current context index, wherein if the current context index and a subsequent context index are the same, the updated probability state index and a most probable symbol values are forwarded to a decode stage of a subsequent bit corresponding to the subsequent context index. In one such embodiment, bits of the video bit stream are decoded at a rate that enables processing of 30 frames per second. Each stage can be implemented, for example, with at least one of a gate-level logic state machine and a memory. Information provided by the one or more stages adapted to operate on Bin 0 only can be revised for subsequent bins.

[0017] Another embodiment of the present invention provides a multi-stage context- adaptive binary arithmetic coding (CABAC) architecture device for decoding a video bit stream. In this embodiment, the device includes a history table lookup stage adapted to operate on Bin 0 only, for detecting macroblocks neighboring a current macroblock, and for retrieving previously stored syntax elements information for the neighboring macroblocks. The device further includes a neighborhood information context increment stage adapted to operate' on Bin 0 only, for parsing the retrieved syntax elements information for the neighboring macroblocks,' and calculating a context index increment for each syntax element. The device also includes a decode stage adapted to operate on all bins, for carrying out decode processes. The device further includes a state MPS write-back stage adapted to operate on all bins, for writing updated probability state index and most probable symbol values back to a stateMPS value table corresponding to , a current context index, wherein if the current context index and a subsequent context index are the same, the updated probability state, index and a most probable symbol values are forwarded to, the decode stage of a subsequent bit. In one such embodiment, each stage is processed in one clock cycle. In another such embodiment, bits of the video bit stream are decoded in, an average of four clock cycles. The device may include a context final stage adapted to operate on all bins, for calculating a final context for a particular syntax element^' The device may include a state MPS read stage adapted. to_. operate on all bins, fςr storing and retrieving a probability state index and a most probable symbol value using a stateMps value table and a context index as ah offset into that table. The device may include a binarisation stage adapted to operate on all bins, for performing a bin match process for each syntax element. In one particular embodiment, the history table lookup stage includes a macroblock history lookup table, and is further configured to calculate a table address for each of the neighboring macroblocks so that the previously stored syntax elements information can be retrieved. The retrieved syntax elements information for the neighboring macroblocks may include, for example, macroblock attributes, residual attributes, motion vector attributes, and sub-macroblock attributes. In another particular embodiment, the decode stage includes a range table, an least probable symbol (LPS) transition table, and a most probable symbol (MPS) transition table, with the probability state index used as an offset to each table. In another particular embodiment, the decode stage executes a DecodeDecision process, a

DecodeBypass process, or a DecodeTerminate process for each bin. Information provided for Bin 0 by the history table lookup stage and the neighborhood information context increment stage can be reused for subsequent bins. The device can be implemented, for instance, as a system-on-chip (SOC). [0018J The features and advantages described herein are not all-inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the figures and description. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and not to limit the scope of the inventive subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

[0019] Figure Ia illustrates a multi-stage H.264 CABAC decoding architecture configured in accordance with an embodiment of the present invention.

[00201 Figure Ib illustrates the CABAC pipeline control for the multi-stage H.264. CABAC decoding architecture of Figure Ia₅ in accordance with an embodiment of the present invention.

[0021] Figure 2 illustrates a bin decode flow carried out by the multi-stage H.264 CABAC decoding architecture of Figures la-b, in accordance with an embodiment of the present invention.

[0022] Figure 3 illustrates a decode decision flow carried out by the multi-stage H.264 CABAC decoding architecture of Figures la-b, in accordance with an embodiment of the

> present invention.

[0023] Figure 4 illustrates a bypass decoding process carried out by the multi-stage H.264 CABAC decoding architecture of Figures la-b, in accordance with an embodiment of the present invention.

[0024] Figure 5 illustrates a decode decision before termination flow carried out by the multi-stage H.264 CABAC decoding architecture of Figures la-b, in accordance with an embodiment of the present invention.

[0025] Figure 6 illustrates a renormalization flow carried out by the multi-stage H.264 CABAC decoding architecture of Figure Ia, in accordance with an embodiment of the present invention.

[0026] Figure 7 illustrates an example CABAC decoding sequence for four syntax elements, in accordance with an embodiment of the present invention. DETAILED DESCRIPTION OF THE INVENTION

[0027] An architecture capable of Content Based Adaptive Binary Arithmetic Coding (CABAC) decoding at the syntax element level is disclosed. The architecture employs a multi-stage stage pipeline to implement the functions of CABAC bit parsing and decoding processes based on the EL264 CABAC algorithm. Each stage can be carried out in one clock cycle, and npt all stages are executed for every bit. The architecture can be implemented, for example, using gate-level logic state machines as part of a system-on- chip (SOC) solution for a video/audio decoder for use in high definition television broadcasting (HDTV) applications. Other such video / audio decoder applications are enabled as well.

Multi-Stage Architecture ,

[0028] Figure Ia, illustrates a multi-stage H.264 CABAC decoding architecture configured in accordance with an embodiment of the present invention. In this example configuration, the H.264 CABAC decoding, architecture operates on the lowermost level of the input stream, where each syntax element is executed based on a seven-stage pipeline.

[0029] As can be seen, the seven-stage pipeline includes macroblock A and B address ^' calculation, a macroblock history table lookup, macroblock A and B information parsing, context calculation, state MPS value table lookup, LPS and MPS transition tables and a range table, coding offset and range update, and a probability state index and value MPS update, As is known, MPS stands for most probable symbol and LPS for least probable symbol. Each of these functional blocks can be implemented, for example, with conventional technology, such as gate-level logic state machines. The lookup tables can be implemented, for example, with RAM or ROM. Other suitable processing environments (e.g., microcontroller having a I/O ports for each block, and embedded routines for carrying out the functionality described herein) and memory facilities will be apparent in light of this disclosure.

[0030] In one embodiment, the CABAC pipeline control is based on state machines executing the flow chart illustrated in Figure Ib. The seven states (designated SO through S6) of the flow include history' table lookup (HIST_TBL__LKUP), neighborhood information context increment (NBHINFCLCTXINC), context final (CTX_FINAL), state . MPS read (STMtSJREAD), decode (DEO)₅ binarisation (BIN), and state MPS .write-' . back (STMPSJWRBACK).

[0031] , State SO (HIST_TBL_LKUP) corresponds to the macroblock A and B address calculation and macroblock histpry table lookup blocks of the architecture shown in Figure Ia. State Sl (NBHINFO_CTXINC) corresponds to the macroblock, A and, B information parsing block of the architecture shown in Figure Ia. State S2 (CTXJ⁷INAL) corresponds to the context calculation block of the architecture shown in Figure Ia. State S3 (STMPS_READ) corresponds to the state MPS value table lookup , block of the architecture shown in Figure Ia. State S4 (DEC) corresponds to the LPS and MPS transition tables and a range table blocks of the architecture shown in Figure Ia. State S 5 (BIN) corresponds to the coding offset and range update block of the architecture shown in Figure Ia. State S6 (STMPS JWRBACK) corresponds to the probability state index and value MPS update block of the architecture shown in Figure Ia. ' .

[0032] Note that states SO to S6 for Bin 0 are processed during clock cycles Cl through C7 respectively. As will be explained in turn, if the current context index and subsequent context index are same, the updated variables provided by the state MPS write-back (STMPSJWRBACK) are forwarded to the decode (DEC) stage of the subsequent bit, which in this example flow is Bin 1. This forwarding eliminates redundant processing, and eliminates a processing cycle for that bit. Furthermore, note that states SO and Sl can be performed for Bin 0 only, and the outputs therefrom reused for subsequent bins. Thus, only states S2 to S6 are performed for Bin 1 (processed during clock cycles C5 through C9, respectively), and only states S2 to S6 are performed for Bin 2 (processed during clock cycles C5 through C9, respectively). This eliminates two addition clock cycles from the decoding of all bins after Bin 0. Given the conditional forwarding from the state S6 (MPS write-back stage) to the state S4 (decode stage), and the reuse of Bin 0 results from states SO and Sl for subsequent bins, it can be seen that 1 bit of the input video processing stream can be processed in an average of four clock cycles. Such a bit processing rate enables the 30 frame per second performance requirement for High Definition 1920x1080 interlace (10801) or 1280x720 progressive (720P) formats used in the broadcast standard. In one particular embodiment, the bits of the video bit stream are decoded in an average of 1 bit every four 200 MHz clock cycles. Other clock speeds can be used as well. , ■ , ' '

[0033] Each of the seven states of the CABAC decoding process will now be discussed in more detail.

[0034] History Table Lookup Stage: CAJBAC context for each syntax element for a current macroblock is based on the syntax element information of its neighboring macroblocks A and B (MB_A and MB_B, respectively). After processing each macroblock, the value of each syntax element is stored in a macroblock history table. In one particular embodiment, the value of each syntax element is stored in two 512 x 160b blocks of the macroblock history table (one block is used for frame or field only mode, and both blocks are usdd for MBAFF mode). Other macroblock storage facilities and sizes will be apparent in light of this disclosure.

[0035] In this stage of the decoding process, based on the macroblock index, the left (MB_A) neighbor and top (MBJB) neighbor of the current macroblock are detected (e.g., by detecting the address offset into the macroblock history table for MB_A and MB_B). The macroblock history table address for these neighbors is calculated (using the MB_A and MB_B address calculation). These table addresses can be calculated, for example, as described in the H.264 specification. Alternatively, these table addresses can be calculated as described in the previously incorporated U.S. Application No. (not yet known), filed August 31, 2005, titled "Macroblock Neighborhood Address Calculation" <attorney docket number 22682-10708>. In any case, the calculated addresses are used to read out previously stored syntax elements information for neighbors macroblock A and macroblock B from the macroblock history table. For a multiple bit syntax element, this state occurs for bin 0 only.

[0036] Neighborhood Information Context Increment Stage: In this state, the syntax elements information for macroblock A and macroblock B is parsed. In one particular such embodiment, the following syntax elements information (macroblock attributes, residual attributes, motion vector attributes, and sub-macroblock attributes) for macroblock A and macroblock B is parsed using the MB_A and MB_B information parsing block. // MB ATTRIBUTES

- MB_^SKIP in - MB 14X4.

111 -MB 116x16

113 - MB EPCM ^'

HA -Pl 6X16

115 -Pl 6X8

Jl 6 -P8X16 .

111 - P8x8 in -B Direct 16x16

119 -B LO 16x16

//10 -B ^"Ll 16X16

//11 -B Bi 16x16,

//12 -B LO LO 16x8

//13 -B LO LO 8x16

//14 -B Ll Ll 16x8

//15 -B Ll Ll 8x16

//16 -B ^~L0 Ll 16x8

//17 -B LO Ll 8x16

7/18 -B ^"Ll LO 16x8,

1119 -B Ll LO 8x16

//20 -B ^"LO Bi 16x8

//21 -B LO Bi 8x16

//22 -B ^"Ll Bi 16x8

//23 -B Ll Bi 8x16

//24 -B Bi LO 16x8

//25 -B Bi LO 8x16

//26 -B Bi Ll 16x8

//27 -B Bi_Ll_8xl6

//28 -B JBiJBM 6x8

//29 -B Bi Bi 8x16

//30 -B 8x8

//31 - FIELD FLAG

// RESIDUAL ATTRIBUTES

//35:32 -CBPLUMA

//37.-36-CBPCHROMA

// 39:38 - INTRACHROMAPREDMODE

//42:40 -QPDELTA

//43 - Y DC CODEDBLOCKFLAG

//44 -Y AC CODEDBLOCKFLAG //45 -UDCCODEDBLOCKFLAG

//46 -VDCCODEDBLOCKPLAG

//47 - U AC CODEDBLOCKFLAG

//48 -VACCODEDBLOCKFLAG

// 51:49 -REFJDXJLO

// 54:52 - REFJDX J-I

// 59:55 - RSVD

// 60 - B_Direct_8x8 J)

//61 -BJ)irect_8x8J

// 62 - B J)ire^'ct_8x8__2

//63 -B Direct 8x83

// MOTION VECTOR ATTRIBUTES

// 71:64 -MVDX LO

// 79:72 - MVDY LO

// 87:80 - MVDX Ll

// 95:88 - MVDYJLl

// SUE \ MBATTRIBUTES

1196 -P LO 8x8 0

Il 97 -P LO 8x4 0

Il 98 -P LO 4x8 0

Il 99 -P LO 4x4 0

Il 100 -B LO 8x8 0

It 101 -B Ll^" 8x8 0

If 102 -B Bi 8x8^" 0

Il 103 -B^' Lθ^" 8x4 ^"o

Il 104 -B LO 4x8 0

Il 105 -B^' Ll _8x4 0

If 106 -B Ll 4x8^' 0

If 107 -B^" Bi 8x4^" 0

If 108 -B^" Bi 4x8^~ 0

Il 109 -B^" ^~L0^~ _4x4_ 0

Il 110 -B Ll ^"4x4^" 0

II 111 -B^" ^"Bi 4x4^* 0

Il 112 -P^" LO 8x8 1

Il 113 -P LO ^'8x4^~ ^"l

Il 114 -P^~ LO 4x8^~ 1

Il 115 -P^~ LO 4x4 ^"l

Il 116 -B LO 8x8^" ^"l

Il 117 -B^" Ll 8x8^" ^"l

Il 118 -B^" Bi 8x8 1

Il 119 -B LO 8x4 1

Il 120 -B LO 4x8 1

Il 121 -B^" Ll 8x4 ^"l

[0037] Calculating the context index increment (ctxldxlnc) for each syntax element can be performed by the MB_A and MB_B information parsing block, for example, as per the rules specified in Section 9.3.3.1 "Derivation process for ctxldx" of the previously incorporated H.264 specification. This neighborhood information context increment state occurs for bin 0 only.

[0038] Context Calculation Stage: Calculating the final context for a particular syntax^' element is carried out by the context calculation block. In one embodiment, this block is configured to perform the context calculation in accordance¹ with the rules specified in Section 9,3.3.1 "Derivation process for ctxldx" of the previously incorporated H.264 specification. This state occurs for all bins. ,

[0039] State MPS Read Stage: During the initialization process, for each context variable, the two variables are initialized. These variables are pStateldx and valMPS. The pStateldx variable corresponds to a probability state index, and the valMPS variable corresponds to the value of the most probable symbol. In one such particular embodiment, these variables are initialized in accordance with section 9.3.1 "Initialisation Process" of the previously incorporated H.264 specification.

[0040] The initialized values are stored into the stateMPS value table. In this state the previously stored (after intialisation or decode operation) pStateldx and valMPS are read out of stateMps value table with the context index (ctxldx) as the offset into the table. This state occurs for all bins.

[0041] Decode Stage: In this state, rangeTabLPS, transIdxLPS, and transIdxMPS values are read out of the range table, LPS transition table, and MPS transition table, respectively j with pstateldx as the offset to each table. These tables can be implemented, for example, with ROM based on Table 9 33 - "Specification of rangeTabLPS depending on pStateldx and qCodlRangeldx" and Table 9 34 - "State transition table" of the previously incorporated H.264 specification. This state occurs for all bins.

[0042] The DecodeDecision or DecodeBypass or DecodeTerminate process of the bin decode state can be executed based on the flow chart illustrated in Figure 2, which illustrates the overall bin decode flow in accordance with H.264. As explained in the H.264 specification, inputs to this process include the bypassFlag, ctxldx, and the state variables codIRange and codIOffset of the arithmetic decoding engine. The output of this process is the value of the bin. In operation for decoding the value of a bin, the context index (ctxldx) is passed to the arithmetic decoding process DecodeBin(ctxIdx). If bypassFlag is equal to 1, then the DecodeBypass is executed (discussed with reference to Figure 4). If bypassFlag is equal to 0 and ctxldx is equal to 276, then DecodeTerminate is executed (discussed with reference to Figure 5). Otherwise, DecodeDecision is executed (discussed with reference to Figure 3).

[0043] Figure 3 illustrates a decode decision flow carried out by the multi-stage H.264 CABAC decoding arςhitecture of Figures la-b, in accordance with an embodiment of the present invention. As explained in the H.264 specification, inputs to this process include ctxldx, codlRange, and cόdlOffset. Outputs of this process are the decoded value binVal, and the updated state variables codlRange and codIOffset. Initially, the value of the variable codIRangeLPS is derived. In particular, given the current value of codlRange, codlRange is mapped to the index of a quantised value of codlRange, which is denoted by the vari ble qCodIRangeldx: qCodIRangeldx =( codlRange » 6 ) & 0x03. Given qCodIRangeldx and pStateldx associated with ctxldx, the value of the variable rangeTabLPS as specified in Table 9-33 of H.264 is assigned to codIRangeLPS: codIRangeLPS = rangeTabLPS [ pStateldx ][ qCodIRangeldx ]. Next, the value of codlRange - codIRangeLPS is assigned to codlRange, to which the current value of codIOffset is compared, ϊf codIOffset is larger than or equal to codlRange, the value 1 - valMPS is assigned to the decoded value bin VaI, codIOffset is decremented by codlRange, and codlRange is set equal to codIRangeLPS. Otherwise valMPS is assigned to bin VaI. Given the decoded value bin VaI, the state transition process is performed. Also, depending on the current value of codlRange, renormalization is performed. Inputs to the state transition process include the current pStateldx, the decoded value binVal and valMPS values of the context variable associated with ctxldx. Outputs of this process include the updated pStateldx and valMPS of the context variable associated with ctxldx. Depending on the decoded value binVal, the update of the two state variables pStateldx and valMPS associated with ctxldx is derived as follows:

if( binVal = = valMPS ) pStateldx = transIdxMPS( pStateldx ) else { if( pStateldx = = 0 ) valMPS = 1 - valMPS p$tateldx = transIdxLPS( pStateldx ) ' }

[0044] Figure 4 illustrates a bypass decoding process carried out by the multi-stage H.264 CABAC decoding architecture of Figures la-b, in accordance with an embqdiment qf the present invention. As previously explained, the bypass decoding process is invoked when bypassFlag is equal to 1. .Inputs to this process includes bits from slice data and the state variables codIRange and^' codlOffset. Outputs of this process include the updated variables codIRange and codlOffset, and the decoded value binVal. Initially, the value of codlOffset is doubled (i.e., left-shifted by 1) and a single bit is shifted into codlOffset by using read_bits( 1 ). Next, the value of codlOffset is compared to the value of codIRange. If codlOffset is larger than or equal to codIRange, a value of 1 is assigned to binVal and codlOffset is decremented by codIRange. If, however, codlOffset is less than codIRange, a value of 0 is assigned to binVal.

[0045] Figure 5 illustrates a decode decision before termination flow carried out by the multi-stage H.264 CABAC decoding architecture of Figures la-b, in accordance with an embodiment of the present invention. Inputs to this process include bits from slice data and the variables codIRange and codlOffset. Outputs of this process include the updated variables codIRange and codlOffset, and the decoded value binVal. This decode decision before termination flow applies to the decoding of end_of_slice_flag and of the bin indicating the I_PCM mode corresponding to ctxldx equal to 276. Initially, the value of codIRange is decremented by 2. Next, the value of codlOffset is compared to the value of codIRange. If codlOffset is larger than or equal to codIRange, a value of 1 is assigned to binVal, no renormalization is carried out, and CABAC decoding is terminated. In such a case, the last bit inserted in register codlOffset is rbsp_stop_one_bit. If codlOffset is less than codIRange, a value of 0 is assigned to binVal and renormalization is performed (as will be discussed with reference to Figure 6). As noted in the H.264 specification, this procedure may also be implemented using the DecodeDecision(ctxIdx) flow with ctxldx = 276. In the case where the decoded value is equal to I₃ seven more bits would be read by the DecodeDecision(ctxIdx) flow, and a decoding process would have to adjust its bit stream pointer accordingly to properly decode following syntax elements. [0046] Figure 6- illustrates a renormalization flow carried out by the multi-stage H.264 CABAC decoding architecture of Figure Ia, in accordance with an embodiment of the present invention. Inputs to. this process include bits from slice data and the variables codIRange and codIOffset. Outputs of this process include the updated variables codIRange and codIOffset. As illustrated in Figure 6 and explained in the H.264 specification, the current value of codIRange is first compared to 0x0100. If codIRange is larger than or equal to 0x0100, then no renormalization is needed and the RenormD process is finished. If₅ however, codIRange is less than 0x0100, then the renormalization loop is entered. With, each pass of this loop, the value of codIRange and codIOffset is each doubled (i.e., left-shifted by 1), and a single bit is shifted into codIOffset by using read_bits( 1 ), The process is then repeated as shown.

[0047] Binarisation Stage:. In this state, a bin match process is performed for each syntax element. In one particular embodiment, this bin match process for each syntax element is executed in accordance with section 9.3.2 "Binarisation Process" of the H.264 specification. The binary index is updated, and the renormalization process of Figure 6 is executed. The updated codIRange & codIOffset values are saved. This binarisation state is carried out for all bins. , ,

[0048] State MPS Write-back Stage: In this state, the updated pstateldx and valMPS values are written back to the stateMps value table corresponding to the current context index. If the current context index and subsequent context index are same, the updated pstateldx and valMPS (results of S 6, Figure Ib) can be forwarded to the decode stage (S4, Figure Ib) of the subsequent bit. This state occurs for all bins.

[0049] One embodiment of the present invention is configured to implement the following syntax in accordance with the H.264 specification.

[0050] Slice Data Syntax: slice_data( ) { if( entropy_coding_mode_flag ) while( !byte_aligned( ) ) cabac_alignment_one_bit

CurrMbAddr = first_mb_in_slice * ( 1 + MbaffFrameFlag ) moreDataFlag = 1 prevMbSkipped = 0 dό { ^• if( slieejype != I ScSc slicejype != SΪ ) if( !entropy_cόding_mode_flag ) { mb_skiρ_run > prevMbSkipped = ( mb_skip_run > 0 ) for( i=0; i<mb_skip_run; i++ ) '

CurrMbAddr = NextMbAddress( CurrMbAddr ) moreDataFlag = more_rbsp_data( ) } else { mb_skip_flag moreDataFlag ?= !mb_skip_flag

} _, ' if( moreDataFlag ) {

^' if( MbafiFrameFlag ScSc ( CurrMbAddr % 2 = = 0 1 1

( CurrMbAddr % 2 = = 1 ScSc prevMbSkipped ) ) ) mb_field_decoding_flag macroblock_layer( )

} , if( !entropy_coding_mode_flag ) moreDataFlag = more_rbsp_data( ) else { if( slice_tyρe != I && slice_type != SI ) prevMbSkipped ?= mb_skip_^flag if( MbaffFrameFlag && CurrMbAddr % 2 = = 0 )

. moreDataFlag = 1 else { i . end_j_>f_slice_flag moreDataFlag = !end_of_slice_flag }

^• }

CurrMbAddr = NextMbAddress( CurrMbAddr ) } while( moreDataFlag )

[0051] Macroblock Layer Syntax: macroblock_layer( ) { mb_type if( mbjype = = I_PCM ) { while( !byte_aligned() ) pcm_alignment_zero_bit for( i = 0; i < 256 * ChromaFoπnatFactor; i++) pcm_byte[ i ] } else { if( MbPartPredMode( mbjype, 0 ) != Intra_4x4 ScSc

MbPartPredMode( mb_type₅ 0 ) != Intra_16xl6 ScSc , NumMbPart( mbjype ) = = 4 ) ^' sub_mb_pred( mb_type ) else mbjpred( mbjype ) if( MbPartPredMode( mb_tyρe, 0 ) != Intra_16xl6 ) coded_block_pattern if(ι CodedBlockPatternLuma > 0 1 1 CodedBlockPattemChroma > 0

MbPartPredMode( mbjype, 0 ) = = Intra_16xl6 ) { rnb_qp_delta residual( )

}

[0052] Macroblock Prediction Syntax: mb_pred( mb_type ) { if( ]VtbPartPredMode( mbjype, 0 ) = = Intra_4x4 1 1

MbPartPredMode( nib_type, 0 ) = = Intra_16xl6 ) { i^'f( MbPartPredMode( mb_type, O ) = = Intra_4x4 )

, fdr( luma4x4BlkIdx=0; luma4x4BlkIdx<16; luma4x4BlkIdx-H- ) { prev_intra4x4_pred_mode_flag[ luma4x4BlkIdx ] if( !prev_intra4x4__pred_mode_flag[ luma4x4BlkIdx ] ) rem_intra4x4_pred_mode[ luma4x4BlkIdx ] . } • - , . intra_chroma_pred_mode } else if( MbPartPredMode( mb_type, 0 ) != Direct ) { for( mbPartldx = 0; mbPartldx < NumMbPart( mb_tyρe ); mbPartIdx++) if( ( num_ref_idx_10_active_minusl > 0 | | mb_field_decoding_flag ) &&

MbPartPredMode( mb_type, mbPartldx ) != PredJLl ) ref_idx_10[ mbPartldx ] for( mbPartldx = 0; mbPartldx < NumMbPart( mbjype ); mbPartIdx++) if( ( num_ref_idx_ll_active_minusl > 0 | | mb_field_decoding_flag ) &&

MbPartPredMode( mbjype, mbPartldx ) != PredJLO ) refjdxjl [ mbPartldx ] for( mbPartldx = 0; mbPartldx < NumMbPart( mbjype ); mbPartldx-H-) if( MbPartPredMode ( mb_type, mbPartldx ) != Pred_Ll ) for( compldx = 0; compldx < 2; compldx++ ) mvd JO [ mbPartldx ] [ 0 ] [ compldx ] for( mbPartldx = 0; mbPartldx < NumMbPart( mbjype ); mbPartIdx++) if( MbPartPredMode( mbjype, mbPartldx ) != Pred_L0 ) for( compldx = 0; compldx < 2; compldx++ ) mvd Jl [ mbPartldx ][ 0 ][ compldx ]

} }

[0053] Sub-Macroblock Prediction Syntax: sub_mb_pred( mb_type ) { for( mbPartldx = 0; mbPartldx < 4; mbPartIdx++ ) sub_mb_type[ mbPartldx ] for( mbPartldx = 0; mbPartldx < 4; mbPartldx-H- ) if( ( num_ref_idx_10_active_minusl > 0 1 1 mb_field_decoding_flag ) && . ^{' "} ' . mbjype != P_8x8refD && sub_mb_type[ mbPartldx ] != B_Direct_8x8 && SubMbPredMode( sub_mb_type[ mbPartldx ] ) != Pred_Ll ) ^' ref_idx_10[ mbPartldx ] ^{' •} for( mbPartldx = 0; mbPartldx < 4; mbPartldx-H- ) if( (num_ref_idx_ll_active_minusl > 0 | f> mb_field_decoding_:_flag ) sub_mb_tyρe[ mbPartldx ] != B_Direct_8x8 && SubMbPredMode( sub_mb_type['rhbPartIdx ] ) != PredJLO ) ref_idx_ll [ mbPartldx ] for( mbPartldx = 0; mbPartldx < 4; mbPartIdx++ ) ^' if( sub_mb_type[ mbPartldx ] != B_Direςt_8x8 &&

SubMbPredMode( sub_mb_type[ mbPartldx ] ) != PredJLl ) for( subMbPartldx = 0; subMbPartldx < NumSubMbPart( sub_mb_type[ mbPartldx ] ); , subMbPartIdx++) for( compldx = 0; compldx < 2; compldx++ ) mvd_10[ mbPartldx ][ subMbPartldx ][ compldx ] for( mbPartldx = 0; mbPartldx < 4; mbPartldx-H- ) if( sub_mb_type[ mbPartldx ] != B_Direct_8x8 &&

SubMbPredMode( sub_mb_type[ mbPartldx ] ) != PredJLO ) for( subMbPartldx = 0; subMbPartldx < NumSubMbPart( sub_mb_type[ mbPartldx ] ); subMbPartIdx++) for( compldx = 0; compldx < 2; compldx++ ) mvd_l 1 [ mbPartldx ] [ subMbPartldx ] [ compldx ]

[0054] Residual Data Syntax: residual( ) { if( !entropy_coding_mode_flag ) residual_block = residual_block_cavlc else '

residual_block( LumaLevel[ i8x8 * 4 + i4x4 ], 16 ) } else {

^■ if( MbPartPredMode( mbjype, 0 ) = = Intra_16xl6 ) for( i = 0; i < 15; i++ )

Intral 6x16ACLevel[ i8x8 * 4 + i4x4 ][ i ] = 0 else , ' for( i = 0; i < 16; i++ )

LumaLevel[ i8x8 * 4 + i4x4 ] [ i ] = 0 ' for( iCbCr = 0; iCbCr < 2; iCbCr++ ) if( CodedBlockPatternChroma & 3 ) /* chroma DC residual present */ residual_block( ChromaDCLevei[ iCbCr ], 4 ) else ' ^■ for(^li = 0; i < 4; i++ >

ChromaDCLevel[ iCbCr ][ i ] = 0 for( iCbGr = 0; iCbCr < 2; iCbCr++ ) for( i4x4 = 0; i4x4 < 4; i4x4++ ) if( CodedBlockPatternChroma & 2 )

/* chroma AC residual present */ residual_block( ChromaACLevel[ iCbCr ][ i4x4 ], 15 ) else for( i = 0; i < 15; i++ )

ChromaACLevel[ iCbCr ] [ i4x4 ][ i ] = 0 }

Example CABAC Decoding Scenario [0055] Figure 7 illustrates an example CABAC decoding sequence for four syntax elements, in accordance with an embodiment of the present invention. The four syntax elements of this example include: coded block flag (CODED_BLOCK_FLAG), significant coefficient flag (SIG_COEFF_FLAG), last significant coefficient flag (LAST_SIG_COEFF_FLAG), and coefficient absolute level minimum 1 (COEFF_ABSLVL_MIN1). [0056] 'As can be seen, the decoding engine for the first bin for a particular syntax element has to execute the history table lookup (HISΪJTB, LJLKUP) and neighborhood . macrbblock A and macroblock B information parsing (NBINFO_CTXINC) to derive the' context index increment (ctxldxlnc), whereas for subsequent bins ctxldxlnc is a function of the previous ctxldx. In addition, and as previously explained, if the current context index and subsequent context index are same, the updated variables provided by the state MPS write-back '(STMPS_WRBACK) are forwarded to the decode (DEC) stage of the subsequent bit. Such a bit processing rate enables the 30 frame per second performance requirement for High Definition 1920x1080 interlace (10801) or 1280x720 progressive (720P) formats used in the broadcast standard. ,

[0057] The foregoing description of the embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of this disclosure. It is intended that the scope of the invention be, limited not by this detailed description, but rather by the claims appended hereto.

Claims

What is claimed is:

L A multi-stage content-adaptive binary arithmetic coding (CABAC) architecture device for decoding a video bit stream, comprising: a history ' table lookup stage adapted to operate on Bin 0 only, for detecting macroblocks neighboring a current macroblock, and for retrieving previously stored syntax elements information for the neighboring macroblocks; a neighborhood information context increment stage adapted to operate on Bin 0 only, for parsing the retrieved syntax elements information for the neighboring macroblocks, and calculating a context index increment for each syntax element; a context final stage adapted to operate on all bins, for calculating a final context for a particular syntax element; a state MPS read stage adapted to operate on all bins, for storing and retrieving a probability s,tate index and a most probable symbol value using a stateMps value table and a context index as an offset into that table; a decode stage adapted to operate on all bins, for carrying out decode processes; a binarisation stage adapted to operate on all bins, for performing a bin match process for each syntax element; and a state MPS write-back stage adapted to operate on all bins, for writing updated probability state index and most probable symbol values back to the ^' stateMPS value table corresponding to a current context index, wherein if the current context index and a subsequent context index are the same, the updated probability state index and a most probable symbol values are forwarded to the decode stage of a subsequent bit.

2. The device of claim 1 wherein each stage is processed in one clock cycle.

1 3. The device of claim I wherein bits of the¹ video bit stream are decoded in

2 an average of four clock cycles.

1 4. The device of claim 1 wherein bits of the video bit stream are decoded at a

2 ^• rate that enables processing of 30 frames per second.

I i 5. The device of claim 1 wherein each stage is implemented with at least one of a gate-level logic state machine and a memory.

1 6.¹ The device of claim 1 wherein the history table lookup stage includes a maciOblock history lookup table, and is further configured to calculate a table address for each of the neighboring macroblocks so that the previously stored syntax elements information can be retrieved.

1 7. The device of claim 1 wherein the retrieved syntax elements information for the neighboring macroblocks includes macroblock attributes, residual • attributes, motion vector attributes, and sub-macroblock attributes. ■ . ^•

8. The device of claim 1 wherein the decode stage includes a range table* an least probable symbol (LPS) transition table, and a most probable symbol (MPS) transition table, with the probability state index used as an offset to each table.

9. The device of claim 1 wherein the decode stage executes a DecodeDecision process, a DecodeBypass process, or a DecodeTerminate process for each bin.

10. The device of claim 1 wherein information provided for Bin 0 by the history table lookup stage and the neighborhood information context increment stage is reused for subsequent bins.

11. The device of claim 1 wherein the device is implemented as a system-on- chip (SOC).

12. A 'multi-stage context-adaptive binary .arithmetic coding (CABAC) architecture device for decoding a video bit stream, comprising¹. , , a history table lookup stage ..adapted to operate on Bin 0 only, for detecting macroblocks neighboring a current macroblock, and for retrieving previously stored syntax elements information for the neighboring macroblocks; a neighborhood information context increment stage adapted to operate on Bin 0 , only, for parsing the retrieved syntax elements information for the neighboring macroblocks, and calculating a context index increment for each syntax element; a decode stage adapted to operate on all bins, for carrying out decode processes; . ' and ' a state MPS write-back stage adapted to operate on all bins, for writing updated probability state index and most probable symbol values back to a stateMPS value table corresponding to a current context index, wherein if the current context index and a subsequent context index are the same, the updated probability state index and a most probable symbol values are, forwarded to the decode stage of a subsequent bit.

13. The device of claim 12 wherein bits of the video bit stream are decoded in an average of four clock cycles.

14. The device of claim 12 wherein bits of the video bit stream are decoded at a rate that enables processing of 30 frames per second.

15. The device of claim 12 wherein each stage is implemented with at least one of a gate-level logic state machine and a memory.

16. The device of claim 12 wherein information provided for Bin 0 by the history table lookup stage and the neighborhood information context increment stage is reused for subsequent bins.

1 17. , A multi-stage context-adaptive binary' arithmetic coding (CABAC)

2 architecture deyice for decoding a Video bit stream, comprising:

3 one or more stages adapted to operate on Bin 0 pnly, for detecting macroblocks

4 ^■ neighboring a current macroblock, and for calculating a context index

5 increment for each syntax element; and

6 one or more stages adapted to operate on all bins, for writing updated probability

7 state index and most probable symbol values back to a previous stage table

8 corresponding to a current context index, wherein if the current context

9 index and a subsequent context index are the same, the updated probability 0 state index and a most probable symbol values are forwarded to a decode 1 stage of a subsequent bit corresponding to the subsequent context index.

1 18. The device of claim 17 wherein bits of the video bit stream are decoded at

2 , a rate that enables processing of 30 frames per second.

l 19. , The device of claim 17 wherein each stage is implemented with at least

,2 one of a gate-level logic state machine and a memory.

1 20. The device of claim 17 wherein* iriformation provided by the one or more

2 stages adapted to operate on Bin 0 only is reused for subsequent bins.