WO2008037113A1

WO2008037113A1 - Apparatus and method for processing video data

Info

Publication number: WO2008037113A1
Application number: PCT/CN2006/002518
Authority: WO
Inventors: Shilin Wang; Huaping Liu; Zesheng Yuan
Original assignee: Thomson Licensing
Priority date: 2006-09-25
Filing date: 2006-09-25
Publication date: 2008-04-03
Also published as: CN101513067B; CN101513067A

Abstract

Video decoding includes very similar processing steps for different standards. The processing can work independently and in parallel in separate modules. Known multi-standard video decoders suffer from bottlenecks resulting from centrally organized processing. An improved apparatus for decoding video data comprises common elements of a RISC processor, including instruction providing unit (51,52,53, 54), queuing unit (55) and ALU (59), and special video processing modules, wherein the video processing modules are embedded in the RISC processor, so that they also receive instructions through the instruction bus (IB) and provide (IRB) data to the queuing unit (55), like the common RISC processor elements. The special video processing modules include a motion compensation unit (510), means (512) for performing IDCT and inverse quantization, an entropy decoding unit (513) and a filter unit (515).

Description

Apparatus and method for processing video data

Field of the invention

This invention relates to an apparatus and a method for processing video data. In particular, the processing can be performed in the context of decoding video data.

Background

For today's video standards e.g. MPEG2, AVS, VC-I and H264, the decoding procedure mainly includes four stages: entropy or bit-stream decoding, inverse transformation and inverse quantization, motion compensation, and de-blocking filter (except for MPEG2) . For supporting high-resolution HD video, a high performance decoding process is required. All current video standards use macroblocks (MBs) , particularly MBs of 16X16 pixels as the luma processing unit. The MB can be divided into sixteen sub-blocks of 4X4 pixel. The corresponding colour or chroma data unit (Cb and Cr) is the 8x8 pixel block, which can be divided into sixteen 2X2 pixel blocks .

It is desirable to have a decoder chip that can process all current standards. The traditional approach is to put the individual decoding cores into one chip. However, the gate count of that chip will be high: though function blocks for different standards are similar, the processing details differ. Therefore function blocks for different standards are usually implemented in parallel. Further, programmable architectures exist in which the actual video processing is performed by software programs or in which the function blocks for video processing are controlled by separate processing cores. This requires a high amount of control information between the function blocks and the processing cores, usually on shared data buses.

Conventionally, the MBs are processed one by one, i.e. processing of a new MB begins after the previous MB is finished, and each processing block handles one MB at a time. This is depicted in Fig.l. Entropy decoding E for a MB comprises decoding the non-residual 10a and decoding the residual syntax element 10b. Then, inverse transformation and inverse quantization ITIQ are performed 10c. In the next step motion compensation MC, the prediction data are computed 1Od and the picture data are reconstructed 1Oe. The single blocks work simultaneously, but all on the same MB. Each block starts working when it has enough input data from the previous block. The duration of the process per MB is the cycle number clO from decoding the first MB level syntax to getting the reconstructed data for the last sub- block. The same steps lla-lle are performed for the next MB, wherein the first step of decoding 11a is executed after the last step of reconstructing the current MB 1Oe is finished.

Summary of the Invention

In order to reduce the gate count of a multi-standard video decoding chip, a uniform architecture is desirable that can support the decoding of several video standards. Further, known video processing systems suffer from the bottlenecks that result from centrally organized processing stages with shared data busses, shared memories and centralized control units that reduce the processing performance. The present invention provides a universal, modular and decentralized processing flow that enables high performance processing of video data according to a plurality of encoding standards. Moreover, the single function blocks can be used for a plurality of coding formats and standards.

Each of the different video standards has its special features. In order to support all of the video standards, the proposed architecture uses a combination of hardware and firmware (i.e. software that is not modified during normal operation and that is adapted to interact with particular hardware) to meet the requirements of different applications. The firmware implements the different video standard algorithms, while the hardware provides a modular platform that is adapted for the implementation. That means, it is possible to add some firmware code to support a particular video standard, and it is possible to remove some firmware code to make it not support a particular video standard. Thus, it is possible to adapt the decoder later to new standards. The interface between hardware and firmware is the instruction set.

According to one aspect of the invention, the hardware architecture comprises elements of a conventional RISC processor and re-programmable video processing function blocks, which are embedded into the structure of the RISC processor. That means e.g. that the video processing function blocks use the same channels for inter-block communication as the conventional RISC processing blocks, such as arithmetic-logic unit (ALU) , fetch unit, queue unit etc. In principle, the video decoding function blocks are sub-units within a specialized RISC processor. RISC is a processor design philosophy that uses a simple set of instructions that however take about the same amount of time to execute as a corresponding more complex set of instructions on a complex instruction set computer (CISC) .

In one embodiment of the invention, the single function blocks of the architecture can be re-programmed to comply with new formats and standards.

According to one aspect of the invention the multi-standard decoder adaptable for all current video standards uses 4X4 pixel blocks for luma and 2X2 pixel blocks for chroma (Cb and Cr) as the minimum processing unit. Although blocks of this size are not employed in some video standards, it is possible to support the minimum processing unit also for those video standards, including MPEG2.

According to one aspect of the invention, the function blocks are controlled in a decentralized manner.

According to one aspect of the invention, a device for decoding video data comprises at least means for providing decoded instructions, a queuing unit for receiving the decoded instructions and receiving result data, and for providing instructions on an instruction bus, an arithmetic-logic unit (ALU) and a data cache unit receiving instructions through the instruction bus and providing data to the queuing unit, a motion compensation unit, an ITIQ unit for performing inverse transformation (namely inverse DCT) and inverse quantization, an entropy decoding unit, and a filter unit, wherein the motion compensation unit, the ITIQ unit, the entropy decoding unit and the filter unit receive instructions through the instruction bus and provide data to the queuing unit. Advantageous embodiments of the invention are disclosed in the dependent claims, the following description and the figures.

Brief description of the drawings

Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in

Fig.l a conventional video data processing flow;

Fig.2 a pipelined video data processing flow;

Fig.3 pipeline stages of instruction execution;

Fig.4 the position of macroblocks within a picture;

Fig.5 an architecture comprising video processing modules embedded in a RISC processor; and

Fig.6 details of the motion compensation module.

Detailed description of the invention

The present invention uses a dedicated architecture and a corresponding instruction set. The instruction set can be divided into two parts, namely the general instructions similar to the conventional RISC (reduced instruction set computer) instructions, and the specialized instructions dedicated to video decoding. The general instructions are mainly used for controlling the decoding procedure, and the specialized instructions are mainly used for processing the computation during the decoding procedure. Exemplarily, the instructions are 32 bit wide.

The video data to be processed and the instructions are stored in SDRAMs. The architecture according to the invention uses a pipeline for instruction processing. As shown in Fig.3, any instruction execution can be divided into the following five stages:

Fetch: fetch the instruction from the SDRAM;

Decode: translate the instruction's format into the internal format;

Issue: issue the instruction into the function modules;

Execute: execute the instruction for the function modules;

Return: return the execution result

E.g. in one phase cl a first instruction il starts with being fetched. In then next phase c2 it is translated into the internal format, while the next instruction i2 is being fetched. In this phase, the fetched first instruction il is stored in a pipeline. In the next phase c3, while the two previous instructions il,i2 are in the pipeline, a new instruction i3 starts.

Fig.2 shows a generalized pipelined video data processing flow according to one aspect of the invention. The currently processed pixel data are copied into a pixel buffer for faster access. Input data are processed in an entropy decoding stage E by first decoding the non-residual data 20a and then decoding the residual data 20b, for which the decoded non-residual data are required. While decoded data are output of the residual data decoding procedure 20b, they are successively passed (through the queuing unit, not shown here) to the next step 20c of inverse transformation and inverse quantization ITIQ. In this example, the entropy decoding stage E waits for a certain time after it has processed its data 20b and before it starts processing new data 21a, to prevent buffer overflow due to slower units, e.g. motion compensation MC. At least the specialized video function modules can hold two or more MBs to be processed in parallel. If only two MBs in parallel are supported, the buffer for storing MVs and residual data in the related modules stores the MVs and residual data for the two MBs. Simultaneous processing of three or more MBs can be supported if additional buffer space is available within the modules.

In the following, the hardware architecture according to the invention is described. Corresponding to the pipeline stages of Fig.3, the architecture can include five parts: an instruction fetch part, an instruction decoding part, an instruction issuing part, an instruction execution part and a result return part. The architecture is shown in Fig.5.

The instruction fetch part includes an instruction cache interface module 51, an instruction cache module 52 and the actual fetch module 53 including a program counter PC. The instruction decoding part includes the decoding module 54, and the instruction issuing part includes a queue module 55. The instruction execution part includes a data cache module 57, a data cache interface module 58, an ALU module 59, a motion compensation module 510, a motion compensation interface module 511, an Inverse Transform/Inverse Quantization (ITIQ) module 512, an entropy decoder module 513, an entropy decoder interface module 514, a de-blocking filter module 515, a filter interface module 516 and a result arbiter module 56. The result arbiter module 56 sends intermediate results, i.e. results from the other blocks of the execution part, to the queuing stage 55 before the next processing step is executed. The input data come from an SDRAM via the "visiting SDRAM bus", and the final results are returned to the same SDRAM using the same bus. For the returning data, it could also be a separate bus. The result return part includes a visiting bus arbiter module 517.

In the following, the mentioned functional modules are described.

The instruction cache module 52 is mainly responsible for providing the instructions in this architecture. Through it, the instructions can be faster accessed than directly through the external SDRAM, since it stores instructions in an internal SRAM. The next instruction is determined by a program counter PC within the fetch module 53. If the access hits, i.e. if the determined instruction is cached in the SRAM of the instruction cache 52, the instruction cache module 52 sends the instruction data back. If the access misses, which means that the desired instruction does not exist in the SRAM of the instruction cache, then a command for getting the corresponding instruction from the SDRAM is issued to the instruction cache interface module 51. After the instruction is acquired from the instruction cache interface module 51, the instruction data are provided to the instruction cache module 52.

The fetch module 53 is responsible for determining the PC value according to the procedure of the program execution. The PC value is sent to the instruction cache module 52. If a jump or branch instruction can be met, the PC value in the fetch module 53 is changed accordingly; otherwise, it will be automatically increased by a defined increment. The decode module 54 decodes the instruction, i.e. it transfers the external format into an internal instruction format. The external format depends on the firmware, while the internal format is used by the function module that will receive the instruction.

After being decoded into the internal format by the decode module 54, the instructions are sent to the queue module 55, where they are stored, in principle in a FIFO manner (first-in-first-out) , in an operation queue 550 waiting for being issued to the function modules. The queue module 55 further comprises general registers 551 and specialized registers 552. When for an instruction being the first in the queue the corresponding function module is not busy, and all of the related source registers' values for this instruction are prepared, then the instruction are put on the issue bus IB, along with the data read from the general registers 551 and the specialized registers 552. Some instructions on the issue bus IB however may require no further data to be provided. The general registers 551 provide data on a general data bus GDB, which is e.g. 32 bit wide, and the specialized data registers provide data on a special data bus SDB, which is e.g. 128 bit wide. At the same time, every function module monitors the common issue bus IB and accepts instructions that are directed to it. Instructions can be conventional RISC processor instructions and can be addressed as in conventional RISC processors, e.g. by an address portion within the instruction. After execution in the respective functional module, the result is sent back via an intermediate result bus IRB to the queue module 55, and the queue module updates its destination registers. Thus, the queue module 55 can in a way be regarded as the control centre of the architecture. Though the processing is more decentralized than in conventional video decoding systems, the queue module controls the instruction flow.

Advantageously, the RISC processor elements that control the decoding process, e.g. the queue, are directly involved in the decoding process, so that only little communication between modules is necessary for the assignment of new data and instructions to the function modules.

The data cache module 57 contains an SRAM to enable faster accessing the picture data than directly through the external SDRAM. This module is mainly responsible for performing data load and store operations. When it captures from the issue bus IB an instruction for accessing the data cache, it calculates the access address according to the data of the instruction. For each data access, it first checks if the data exists in its SRAM. If the access of a store operation hits, the data in the SRAM of the data cache module 57 are updated. If the access of a load operation hits, the data are read and sent to the intermediate result bus IRB.

If the access misses, which means that the desired data does not exist in the SRAM of the data cache module 57, a command for getting the corresponding data is issued to the data cache interface module 58, which sends a request signal to the SDRAM to get the required data. After the data cache interface module 58 acquired the data from the SDRAM, the data are updated into the data cache SRAM and sent to the intermediate result bus IRB. The- entropy module 513 is the start point of the decoding procedure, obtaining all the elements for reconstructing the pictures from the encoded bit-stream. It decodes from the bit-stream the syntax elements according to the utilized video standard, including e.g. differential motion vector (mvd) , reference index, residual data etc. This module performs various computations incl . motion vector computation according to the mvd, computing the intra-mode according to pred_mode_flag and intra_luma_pred_mode, and computing the neighbour information for decoding the syntax elements .

The entropy module may automatically read the bit-stream to be decoded from an external SDRAM according to an address, which the programmer can set in the instruction. The entropy module 513 works together with the entropy interface module 514 to obtain the bit-stream from the SDRAM. If the entropy module is idle because it has currently no bit-stream data to process, it may send a request for data to the entropy interface module 514. The entropy interface module either sends back the required data to the entropy module 513, or if it has no data to provide then it may send a request for data to the SDRAM.

The motion compensation (MC) module 510 includes two parts or sub-modules (not shown in Fig.5): intra MC for intra prediction and inter MC for inter prediction. For the intra prediction, the prediction mode and residual data, which the entropy module decoded from the compressed bit-stream before, are sent to the intra MC sub-module. The intra MC sub-module is invoked by an instruction, calculates the prediction of a current 4x4 block, adds the prediction and residual data and thus gets the motion compensated (i.e. reconstructed) data for the block.

The inter MC sub-module performs the inter motion compensation. When decoding, this part needs to find appropriate integer samples based on motion vectors and reference index (refidx) of a sub block (4x4 block for luma, 2X2 block for Cb and Cr chroma) . Then the fractional prediction samples are derived through interpolation.

The MC interface module 511 provides the reference data for inter prediction in the inter MC sub-module. If currently required reference data for the inter MC sub-module are not available in the buffers of the MC interface module 511, the MC interface module 511 sends a request to the SDRAM to obtain those data. After the required data were returned to the MC interface module, they are stored in buffers and sent to the MC module 510.

The Inverse Transform and Inverse Quantization (ITIQ) module 512 is responsible for inverse scanning, inverse transformations and inverse quantization operations on 4*4 pixel sub-blocks of the residual data. It returns its result via the intermediate result bus IRB into the queue module 55. The data that are required by the ITIQ module are provided by the respective instruction.

The filter module 515 is applied to every decoded macro- block (MB) for reducing blocking distortion. The filter smoothes block edges, thus improving the appearance of the decoded frames. The filter module 515 can deal with the filter process of a MB (not mbaff) or a MB pair (mbaff) . It receives the required data of a current MB for filtering, such as MVs, "none zero" information, frame or field flag, the pixel data etc. through the instructions. For the mbaff mode case, it reads those data of the other MB from a filter interface module 516.

The filter interface module 516 is for storing and providing the neighbour MVs and the pixel data of the neighbour 4*4 sub-block, and for storing the loop-filtered and finally processed data into the SDRAM. If the neighbour information and filtered data would be stored into the SDRAM directly, the process will be very slow. Therefore these data are- stored into a buffer within the filter interface module 516, and then stored in the SDRAM using a burst write function, e.g. when the buffer is full. Thus the SDRAM efficiency is improved significantly. Also burst read operation from SDRAM to the interface module can in principle be used.

Several function modules return intermediate results, which require further processing and which are sent back to the queue module. In this example, these modules are an arithmetic-logic unit (ALU) 59, the data cache 57, the entropy module 513, the ITIQ 512, and the MC block 510. But since the queue module can accept only one result at a time, a result bus arbiter module 56 selects one result at a time and transfers it via the intermediate result bus IRB to the queue module 55. The result bus arbiter module may have internal buffers to store the results received from the function blocks while waiting for the intermediate result bus IRB.

There are several modules that need to access the external SDRAM, such as instruction cache interface 51, data cache interface 58, MC interface 511, entropy interface 514 and filter interface 516. The requests from all of these blocks to the SDRAM can not be served at the same time. Therefore the visiting bus arbiter module 517 selects one bus request to be active at a time, according to predefined priorities for the different interface modules.

In the following, the decoding procedure according to the previously mentioned phases is described. First, the fetch module 53 fetches an instruction from the instruction cache module 52 according to the program counter in the fetch module. Second, the instruction is sent via the instruction decoder module 54 to the instruction queue module 55. Third, the instructions in the instruction queue module 55 are issued to the related function module according to the respectively required operation. In the fourth phase, the function module performs its processing according to the instruction. Fifth, the operation result is returned via the intermediate result bus IRB to the register file

551,552 in the instruction queue module 55. When executing an instruction, the function modules may send requirement signals to their respectively related interface module if required data are missing.

The video decoding specific function modules, such as entropy decoder 513, ITIQ 512, motion compensation 510 and de-blocking filter 515, can be configured depending on a particular application to perform the actual operation required for decoding the respective coding format. The configuration can be based on firmware or software. For example, the motion compensation block can perform certain operations for decoding according to MPEG-4 Video standard, and other operations according to the AVC standard.

Whatever video standard to support with this architecture, the decoding procedure is controlled by the program, which uses always the same, defined instruction set. Further, the SDRAM storage space is shared by the program, the input bit-stream, the output decoding result and temporary data created during program execution. Before the decoding, some parts of the bit-stream are automatically put into the

SDRAM by the related hardware. New parts of the bit-stream are successively stored in the SDRAM little by little automatically. During the decoding procedure, the decoder uses the bit-streams little by little. At the same time, the reconstructed data being the picture data computed by the decoding architecture are stored into the SDRAM. The different stages of the processing however use separate areas of the SDRAM.

When the reconstructed data in the SDRAM are needed for displaying or other purpose, those data are output by hardware circuitry automatically. If those data are useful for decoding further picture, they remain in the SDRAM. Otherwise the related space in the SDRAM is overwritten with new picture data.

During the decoding procedure, the entropy module 513 and the entropy interface module 514 automatically read the bit-stream from a fixed SDRAM space according to the corresponding address in the entropy interface module 514. The address is increased by hardware, wherein the address will continue at the minimum address after the maximum address of the bit-stream address space in the SDRAM is reached. The de-blocking filter module 515 and the filter interface module 516 store the decoded result into a fixed SDRAM space automatically according to a corresponding address that is provided by the program.

In this architecture, the decoding procedure controlled by the firmware can be divided into three steps: First step is to decode the parameters on picture or slice level: If those parameters of the picture or slice level (such as QP, weighted prediction parameters, picture size, slice type etc.) are useful for decoding the other syntax elements, they will be stored in so-called global registers, The global registers are connected with the function modules, and control the instruction execution. Second step is to decode the syntax elements on MB level: these elements are decoded one by one. Like the picture or slice level parameters, these elements (such as macro-block type, frame or field flag) are stored into the global registers if they will control the other function modules. This architecture allows that the entropy module, ITIQ module, MC module and filter module are working in parallel on different MBs.

The third step is post decoding: after decoding all the elements of a macro-block, the firmware computes the next MB position in the whole picture. For the last MB of the picture, the firmware will do the DBP buffer management, and then continue with decoding the next picture.

For all current video standards including MPEG2, H.264, AVS and VC-I, the basic processing unit is the MB. The position of a MB in the whole picture is defined as shown in Fig.4, assuming the size of the picture is M*N. The execution within each of the functional modules is similar. Taking the MC module 510 as an example, the execution can be divided into the following steps, cf. Fig.5.

First, the instruction (in internal format, i.e. decoded) which was received in the queue module 55 from the decode module, and if necessary queued in the operation queue 550, is sent into the motion compensation module 510. The instruction brings some data along, e.g. motion vectors and/or the residual data. These data may be stored in an internal buffer MCBUF.

Second, after getting the instruction and the related data, the MC module 510 begins to execute the instruction. During execution, if the required data are available within the internal buffer MCBUF of the MC module 510 (e.g. from the previous MB) , those data will be used immediately. If the reference data are missing, the MC module sends a request signal to the MC interface module 511. If the MC interface module 511 finds those data in its internal buffer, then it returns these data to the MC module 510. Otherwise, the MC interface module 511 sends a request to the visiting bus arbiter module 517 which connects to the external SDRAM. The visiting bus arbiter 517 gets requests from all the interface modules, and selects one to visit the SDRAM and get the data.

Third, if the requested data returned from the SDRAM, they are. stored in the MC interface module 511 and returned to MC module 510. Fourth, after its computation the motion compensation result is sent to the result arbiter module 56, which gets all the results from the function modules and selects one after the other for returning to the queue module 55. Fifth, the result data after execution are written back to the registers 551,552 in the queue module 55, and the value in the registers 551,552 of the queue module 55 is updated.

For those modules that have no related interface module, such as ALU 59 or ITIQ 512, the execution has only three steps, namely the first, fourth and fifth of the above description.

An advantage of the invention is that the idle time of processing blocks is reduced. This leads to an improved efficiency, namely either less power consumption with a similar performance, or increased performance with comparable power consumption.

The present invention prevents the bottlenecks resulting from centrally organized processing of known multi-standard video decoders . An improved device for decoding video data comprises common elements of a RISC processor, including instruction providing unit, queuing unit and ALU, and special video processing modules, wherein the video processing modules are embedded in the RISC processor, so that they also receive instructions through the instruction bus and provide data to the queuing unit like the common RISC processor elements. The special video processing modules include a MC unit, means for performing IDCT and inverse quantization, an entropy decoding unit and a filter unit .

The invention is advantageous for video decoding products, particularly for HD resolution decoders implemented in a modular fashion, both in hardware or software, such as e.g. multi-standard decoders for H.264, VC-I, MPEG-2, AVC etc.

Claims

1. Device for decoding video data, comprising means (51,52,53,54) for providing decoded instructions; a queuing unit (55) for receiving the decoded instructions and receiving result data (IRB) , and for providing instructions on an instruction bus (IB); - an arithmetic-logic unit (59) and a data cache unit (57) receiving instructions through the instruction bus (IB) and providing (IRB) data to the queuing unit (55) ; a motion compensation unit (510); - ITIQ means (512) for performing inverse transformation and inverse quantization; an entropy decoding unit (513) ; and a filter unit (515) , wherein the motion compensation unit (510), the ITIQ means (512) , the entropy decoding unit (513) and the filter unit (515) receive instructions through said instruction bus (IB) and provide (IRB) data to said queuing unit (55) .

2. Device according to claim 1, wherein the motion compensation unit (510), the ITIQ means (512), the entropy decoding unit (513) and the filter unit (515) are capable of simultaneously processing data of two or more macroblocks .

3. Device according to claim 1 or 2, wherein each of the motion compensation unit (510) , the ITIQ means (512), the entropy decoding unit (513) and the filter unit (515) can simultaneously process video data blocks of different size.

4. Device according to one of the claims 1-3, wherein the queuing unit (55) comprises an operation queue (550) for instructions and at least two data queues (551,552), wherein the two data queues (551,552) have different width.

5. Device according to one of the claims 1-4, wherein each of the motion compensation unit (510), the ITIQ means (512), the entropy decoding unit (513) and the filter unit (515) has means for detecting that it has free processing capacity, and upon said detecting requests a new instruction from the queuing unit (55) .

6. Device according to one of the claims 1-5, further comprising a result arbiter module (56) for providing said result data (IRB) to the queue module (55) , wherein the result arbiter module receives data from the data cache unit (57) , the arithmetic- logic unit (59), the motion compensation unit (510), the ITIQ means (512) , the entropy decoding unit (513) and the filter unit (515), and wherein the result arbiter module comprises means for selecting one of said results at a time.

7. Device according to one of the claims 1-6, wherein the video processing unit is a 4X4 pixel block for luma data and a 2X2 pixel block for chroma data.

8. Device according to one of the claims 1-7, wherein the filter module (515) is a de-blocking filter that has a first mode for filtering single macroblocks and a second mode for filtering macroblock pairs, the device further comprising a filter interface module (516) , wherein for said second mode macroblock data of a second macroblock are read from the filter interface module (516) .

9. Device according to one of the claims 1-8, further comprising a bus arbiter module (517) for connecting to an external memory, the bus arbiter module (517) having means for selecting one of a plurality of bus requests from different interface modules according to predefined priorities.

10. Device according to one of the claims 1-9, wherein the entropy decoding unit (513), the ITIQ means (512), the motion compensation unit (510) and the filter unit (515) can be firmware configured to perform their respective operation adapted to different video coding formats.