CN101072349B

CN101072349B - Decoding system and method of context adaptive variable length codes

Info

Publication number: CN101072349B
Application number: CN 200710110295
Authority: CN
Inventors: 扎伊尔德·荷圣
Original assignee: Via Technologies Inc
Current assignee: Via Technologies Inc
Priority date: 2006-06-08
Filing date: 2007-06-08
Publication date: 2012-10-10
Anticipated expiration: 2027-06-08
Also published as: CN101072350A; TW200821982A; TWI354239B; TWI428850B; TWI348653B; TWI344795B; CN101087411A; CN101072349A; TW200803526A; TW200813884A; CN101072353A; CN101072353B; TW200809689A; CN101072350B

Abstract

Various embodiments of decoding systems and methods are disclosed. One system embodiment, among others, comprises a software programmable core processing unit having a context-adaptive variable length coding (CAVLC) unit configured to execute a shader, the shader configured to implement CAVLC decoding of a video stream and provide a decoded data output.

Description

The decode system of context adaptive variable length codes and method

Technical field

The present invention relates to a kind of decode system and method.

Background technology

Computer graphics is art and the science that produces image, image or other figure or image information with computer; Present drafting system comprises several interfaces more; For example the Direct3D interface of Microsoft and OpenGL or the like; So can be on the computer of carrying out specific operation system (like the WINDOWS of Microsoft) control such as graphics accelerator or GPU (generation of image, image often is called as " depicting picture (rendering) as " for graphics processing unit, the multimedia hardware that GPU) waits; The details of this generic operation generally is to be undertaken by graphics accelerator; In three-dimensional (3D) computer graphics, the geometry that constitutes subject surface (or object) in the scene is stored in the picture buffer (frame bufffer) after changing pixel (graphic element) into; Then be shown on the display unit; Each object or groups of objects all have the particular visual character relevant with appearance, for example material, reflectivity, shape, texture or the like, and what these can be defined as object or groups of objects describes content (context).

Computer graphics must be catered to the heavy taste requirement of consumer to the controlled and characteristic of recreation and other media product; Also want to produce real more image and improve processing speed and power consumption; Developed many standards and can utilize less figure place to produce the quality of preferable image, for example H.264 standard (being called the tenth one of ISO mpeg group MPEG-4 again) is a kind of high compression digital video coding standard at present, compares with the coding that MPEG-2 is compatible; H.264 compatible coding only needs similar 1/3rd figure place; Get final product the video of storage same video quality, H.264 standard provides two kinds of entropys (entropy) decoding program, is respectively context adaptive binary arithmetic coding (context-adaptive binary arithmeticcoding; CABAC) and context adaptive variable length codes (context-adaptive variablelength coding; CAVLC), CAVLC is a kind of Huffman (Huffman) content encoded adaptations, according to the total type of probability that can change each coded identification of coded data; CAVLC uses running-level (run-level) coding with succinct expression null string; Make and send some high frequencies+/-1 coefficient in this way and link, among the CAVLC, the adaptive coding is positioned at or is lower than the 2nd Hadamard conversion of DC coefficient of 4 * 4 conversions of fragment (slice) layer) in the nonzero coefficient of adjacent segment; The CAVLC decode structures can satisfy consumer's part demand at present, but in design, still has its restriction.

Summary of the invention

The present invention discloses a kind of context adaptive variable length codes (context-adaptive variablelength coding; CABAC) decode system and method (abbreviating decode system afterwards as); Apply to GPU (graphics processing unit, multithreading (multithread) the parallel computation core in GPU), briefly; In an embodiment; Native system comprises the software programmable core processing unit, has the CAVLC unit in it to carry out tinter (shader), and this tinter comprises expansion instruction group to carry out the CAVLC decoding of video flowing; And decoded data output is provided, wherein the CAVLC decoding is to be accomplished jointly by the performance element data path of the tinter of CAVLC unit, programmable core processing unit and the additional firmware that is used for the bit stream buffer of CAVLC processing environment.

Method embodiment then comprises the following steps: tinter is loaded on the programmable core processing unit with CAVLC unit; Carry out the tinter on the CAVLC unit; With the CAVLC decoded video streams; And decoded data output is provided, wherein the CAVLC decoding is to be accomplished jointly by the performance element data path of the tinter function of CAVLC unit, this programmable core processing unit and the additional firmware that is used for the bit stream buffer of CAVLC processing environment.

Those skilled in the art in inspect following graphic and specify after; When deducing out other system, method, characteristic and advantage; The system of all these deductions, method, characteristic and advantage all belong to scope of the present invention, receive as attaching the protection of claim.

Description of drawings

Here each side's viewpoint of the embodiment that discloses can be with reference to drawings as hereinafter obtaining more deep understanding, and the element in graphic does not limit its dimension scale, only is used to clearly demonstrate principle of the present invention, and similar label is represented corresponding part among each figure.

Fig. 1: the calcspar of graphics processor system embodiment, wherein can implement multiple decode system (and method) embodiment.

Fig. 2: the calcspar of exemplary process environment, wherein can implement multiple decode system embodiment.

Fig. 3: the selection element calcspar in Fig. 2 exemplary process environment.

Computation core calcspar in Fig. 4: Fig. 2 and Fig. 3 exemplary process environment wherein can be implemented multiple decode system embodiment.

Fig. 5 A: the selection element calcspar of the performance element in Fig. 4 computation core, wherein can implement multiple decode system embodiment.

Fig. 5 B: the calcspar of performance element data path, wherein can implement multiple decode system embodiment.

Fig. 6 A: the calcspar of decode system embodiment shown in Figure 5.

Fig. 6 B: the calcspar of the bit stream buffer embodiment of Fig. 6 A decode system.

Fig. 6 C: the calcspar of the relevant buffer embodiment of the content memorizer respective outer side edges of Fig. 6 A decode system.

Fig. 6 D: the calcspar that is used for the decode system 200 window structure embodiment that uses of CAVLC decoding.

Embodiment

The present invention has disclosed plurality of kinds of contents adaptive variable length codes (context-adaptivevariable length coding; CAVLC) decode system and method (will be commonly referred to as decode system afterwards); In an embodiment; Decode system is to be embedded in GPU (graphics processingunit in able to programme, multithreading GPU), the one or more performance elements of parallel computation core, utilizes the mode of software combined with hardware to reach decoding function; That is video decode is that content (context) with GPU programming (programming) cooperates the hardware that is performed in the GPU data path to be accomplished; Give an example, decoding operation or method are by the performance element data path of the tinter with extended instruction set (extendedinstruction set) (shader is like vertex shader), GPU and common completion of additional firmware institute that be used for the automatic management bit stream buffer of CAVLC processing environment; Unlike known old system; Only have the CAVLC processing method of simple hardware or simple software, limited enforcement elasticity, give an example; The pure digi-tal signal processor (digital signalprocessor, DSP) or the microprocessor based execution mode just be not used in the hardware of symbol decoding and bit stream management.

In addition, the automatic bit stream damper possesses some advantages, for example; In case the direct memory access (DMA) of bit stream buffer (direct memory access; DMA) engine is learnt the position (address) of bit stream, just can manage bit stream automatically and does not need further instruction, and such mechanism is just different with the conventional microprocessor system; Represented a large amount of indirect expenses once mentioning the bit stream management; Moreover through the bit quantity that record has used, bit stream buffer mechanism can detect and handle wrong bit stream.

Another advantage of decode system of the present invention is to reduce instruction delay (latency); Because the CAVLC decoding is very continuous operation; Be difficult for utilizing multithreading, therefore will use a kind of transmission mechanism to reduce latency delays in various embodiments, for example buffer transmission (registerforwarding); Further explain; Be that dark pipe (deep-pipeline) and multiline procedure processor can't be with same thread in each cycle execution commands, (generalforwarding) generally passed in some system's utilization, is (if identical through checking operand (operand) address and the instruction operands address that last time produced; Then use and last time produced operand), what the general transmission of this kind needed complicacy relatively reaches multiplex.In some decode system embodiment; Can use different transmission modes,, all utilize (for example altogether 2 of positions in the instruction no matter be the data of utilizing result of calculation (as being retained in inner buffer) last time still to come source operand; Each operand uses 1) encode; In this way, can reduce whole delay, improve the efficient of processor pipeline.

Decode system described herein can utilize (the International Telecommunication Union TelecommunicationStandardization Sector of known communication standard department of International Telecommunications Union; ITU-T) standard H.264; According to carrying out from GPU picture buffer memory or primary processor (like CPU (central processing unit; CPU)) the received one or more instruction groups of memory (as through prestrain mechanisms known such as (preload) or get failure etc. soon), multiple decode system embodiment can carry out computing.

Fig. 1 is the calcspar of graphics processor system 100 embodiment, has wherein introduced decode system and method, in some execution mode; Graphics processor system 100 can be computer system; Wherein, graphics processor system 100 can comprise by display interface unit (display interface unit, DIU) display unit 102 and the regional memory 106 (can comprise display buffer, picture buffer, texture buffer, commands buffer or the like) of 104 drivings; Regional memory 106 can picture buffer or storage element replace; (memory interfaceunit MIU) 110 is connected to GPU (graphics processing unit, GPU) 114 to regional memory 106 through one or more memory interface unit; In an embodiment; Memory interface unit 110, GPU 114, display interface unit 104 threes are connected to high-speed peripheral element interconnection (peripheral component interconnectexpress, PCI-E) Bus Interface Unit (bus interface unit, BIU) 118 of compatibility; In an embodiment; Bus Interface Unit 118 can use Graphic Address Remappng Table, and (graphics addressremapping table GART), also can use other memory bank drawing mechanism certainly; GPU 114 comprises decode system 200; Can be further described to this part after a while, though in some embodiment, the decode system in the GPU 114 200 is drawn as an element, what decode system 200 can comprise multi-graphics processor system 100 more in fact illustrates or does not illustrate element.

Bus Interface Unit 118 is connected to chipset 122 (like north bridge chipset) or switch; Chipset 122 comprises interface circuit (interface electronics); To strengthen from CPU (centralprocessing unit; CPU) 126 (claiming primary processor again) signal that receives; And separate from the signal of system storage 124 turnover and from output and go into the signal of (I/O) device turnover, though mention the PCI-E bus protocol here, also can use other connection and/or communication modes to link up primary processor and GPU 114 (like PCI, dedicated high speed bus etc.); System storage 124 also comprises drive software 128, and CPU 126 capable of using sends instruction group or order in the GPU 114 buffer.

Can dispose GPU more in addition in certain embodiments; Utilize PCI-E bus protocol or other communications protocol to be connected to other element of Fig. 1 via chipset 122; In an embodiment, GPU 100 can comprise all elements of Fig. 1, also can reject certainly, newly-increased or change some element; For example, can increase the South Bridge chip group that is connected to chipset 122 in addition.

See also Fig. 2, it is the calcspar of exemplary process environment, wherein application decoder system 200; GPU 114 comprises graphic process unit 202; 202 of graphic process unit comprise a plurality of performance elements, and (execution unit is EU) with computation core 204 (being the software programmable core processing unit), in an embodiment; Computation core 204 comprises and is embedded in performance element data path (execution unit datapath; EUDP) decode system 200 (being the CAVLC unit), this data path is dispensed to one or more performance elements, and graphic process unit 202 also comprises performance element set control and summit/crossfire cache element 206 (being called EU set control unit 206 later on) and (for example has the fixed function logic; Comprise triangle setup unit (triangle set-up unit; TSU), the drawing pipeline 208 of grid-segment generator (span-tilegenerator, STG) etc.), computation core 204 comprises a plurality of performance elements of associating; Calculation requirement with the tinter task that meets different coloration program; Said coloration program can comprise vertex shader, geometric coloration and/or pixel coloring device, makes drawing pipeline 208 ability deal with data, and the tinter of computation core 204 can carry out the most function of decode system 200; To specify the embodiment of figure processor 202 below, the details of decode system 200 then will be described.

Modes such as decode system can hardware, software, firmware or its combination are implemented; In preferred embodiment; Decode system 200 can comprise hardware or software; Utilize following known technology or its combination, for example: have gate and can to data-signal carry out logic function discrete logic, have the appropriate combination gate ASIC (application specific integrated circuit, ASIC), grid array able to programme (programmable gate array; PGA), formula grid array able to programme (field programmable gate array, FPGA) or the like element.

Please refer to Fig. 3 and Fig. 4; It is the calcspar that graphic process unit 202 embodiment select element, and as previously mentioned, decode system 200 can be the tinter in the graphic process unit 202; Add in addition and expand instruction group and other hardware element; The embodiment of graphic process unit 202 and corresponding program below will be described,, be enough to make the function and the framework of the clear relational graph processor of those skilled in the art though Fig. 3 and Fig. 4 do not draw the used whole elements of graphics process.See also Fig. 3; The center of processing environment able to programme is a computation core 204; It comprises decode system 200, and can handle various instructions, the computation core 204 multiple coloration program of can carrying out or video; Like summit, how much, pixel shader etc., the computation core 204 of multiline procedure processor can be handled a plurality of instructions at single clock in the cycle.

In Fig. 3; The related elements of graphic process unit 202 comprises computation core 204, texture filtering unit 302, pixel packing element 304, command stream processor 306, writes back unit 308 and texture address generator 310; EU set control unit 206 among Fig. 3 also comprises vertex cache and/or crossfire high-speed cache, and in addition, the texture filtering unit 302 of Fig. 3 provides texel (texel) data to give computation core 204 (input A and B); In some embodiment, texel data is 512 bit data.

Pixel packing element 304 provides the painted input of pixel device (PS input; Input C and D) give computation core 204, input is 512 bit data forms equally, in addition; Pixel packing element 304 is to EU set control unit 206 request pixel coloring device tasks; And EU set control unit 206 just can provide appointment performance element number (EU#) and thread number (THREAD#) to pixel packing element 304, because pixel packing element 304 and texture filtering unit 302 are known technology, just repeats no more here; Though Fig. 3 display pixel and texel package are 512 data packet, can change its size according to graphic process unit 202 required usefulness according to each embodiment.

Command stream processor 306 provides the triangular apex index to EU set control unit 206; In the embodiment of Fig. 3; Index is 256 data; 206 combinations of EU set control unit are imported from the vertex shader that the crossfire high-speed cache receives, and these data are delivered to computation core 204 (input E); EU set control unit 206 also combinatorial geometry tinter is imported, and these data are delivered to computation core 204 (input F); 206 other control execution unit input (the EU inputs) 402 of EU set control unit and performance element output (EU output) 404 (Fig. 4), in other words, each inlet flow and the output stream of EU set control unit 206 control computation core 204.

After handling; Computation core 204 provides pixel coloring device output (PS output, output J1 and J2) to give and writes back unit 308, and pixel coloring device output comprises color information; For example red/green/indigo plant/transparency (RGBA) information; About the data structure among the embodiment, pixel coloring device output can be two 512 data flow, and other embodiment also can use other bit width.

Except pixel coloring device output, computation core 204 also can be exported texture coordinate (TC, output K1 and K2) and give texture address generator 310; Comprising UVRQ information, texture address generator 310 sends texture description symbol request (T# request, input X) to the L2 of computation core 204 high-speed cache 408; The L2 high-speed cache 408 of computation core 204 can be given texture address generator 310 by output texture description symbol datas (T# data, output W) then, because texture address generator 310 and to write back unit 308 are known technology; Therefore repeat no more here, moreover, be 512 data though show URVQ and RGBA in drawing; But this parameter also can be done variation with different embodiment; In the embodiment of Fig. 3, bus is divided into two 512 bit channels, transmits 128 RGBA color-values and 128 UVRQ texture coordinates of 4 pixels simultaneously.

Drawing pipeline 208 comprises the graphics processing function of fixed function; For example; In response to the leg-of-mutton order of the drafting of sending from drive software; To carry out the summit conversion, object will change the triangle of working space and screen space from the object space kind into to vertex information through the vertex shader logic element in the computation core 204, and triangle arrives the triangle setup unit of drawing pipeline 208 through computation core 204; In conjunction with carrying out known task after the pixel; For example produce Bounding Box (bounding box), selection (culling), produce limbic function (edge function generation) and triangle level rejecting (triangle levelrejection) etc., then the triangle setup unit is again with the grid and the segment generation unit that have segment generation function in data passes to the pipeline 208 of drawing, therefore; Data object is divided into segment (for example 8 * 8,16 * 16 etc.); And be passed to other fixed-function unit, carry out the degree of depth (z-value) and handle, the for example high-order of z-value (same program when high-order, use figure place lack) rejecting than low order; Then the z-value is passed back the pixel coloring device logic element of computation core 204; To carry out the pixel coloring device function according to gained texture and pipeline data, the value that computation core 204 will have been handled exports the object element that is positioned at drawing pipeline 208 to, and object element carried out alpha test and template test before each high-speed cache will upgrade intrinsic value.

Please noting has 512 vertex cache to overflow the transmission (input G) of (spill) data between L2 high-speed cache 408 and the EU set control unit 206 of computation core 204; In addition, computation core 204 two 512 vertex cache of output (VC) write data (output M1 and M2) and do further processing to EU set control unit 206.

See also Fig. 4, it shows other element and the related elements of computation core 204, and computation core 204 comprises the performance element set 412 with one or more performance element 420a～420h (being generally called performance element 420 later on); Each performance element 420 can be handled a plurality of instructions in a clock cycle; Therefore, performance element set 412 can simultaneously or almost simultaneously be handled a plurality of threads when spike, although Fig. 4 only draws 8 performance elements (EU0～EU7); But do not represent to limit its quantity is 8; Can increase or reduce quantity in other embodiment, wherein at least one performance element (for example EU0 420a) has decode system 200, specifies as follows.

Computation core 204 also comprise memory access unit (memory access unit, MXU) 406, memory access unit 406 is connected with L2 high-speed cache 408 through memory interface moderator 410; L2 high-speed cache 408 receives vertex cache overflow data (input G) from EU set control unit 206; And provide vertex cache overflow data (output H) to EU set control unit 206, in addition, L2 high-speed cache 408 receives texture description symbol request (T# request from texture address generator 310; Input X); And, provide texture description symbol data (T# data, output W) to give texture address generator 310 in response to this request that receives.

Memory interface moderator 410 provides the control interface of regional VRAM (like picture buffer or regional memory 106); 118 interfaces that system is provided of Bus Interface Unit; It can be the PCI-E bus; Memory interface moderator 410 and Bus Interface Unit 118 are as the interface between memory and the L2 high-speed cache 408; In some embodiment, L2 high-speed cache 408 is connected with memory interface moderator 410 and Bus Interface Unit 118 through memory access unit 406, and memory access unit 406 is understood and converted the virtual memory address that obtains from L2 high-speed cache 408 and other block to the actual storage address.

Memory interface moderator 410 provides the storage access (like the read/write access) of L2 high-speed cache, can extract that the temporary access of instruction/constant/data/texture, direct memory access (DMA) (like loading/storage), index, buffer are overflowed, summit memory cache content is overflowed or the like.

Computation core 204 also comprises performance element input (EU input) 402 and performance element output (EU output) 404; The output that is respectively applied for the input that performance element set 412 is provided and receives performance element set 412; Performance element input 402 can be alteration switch (crossbar) or bus with performance element output 404, or other known output mechanism.

Performance element input 402 receives vertex shader input (input E) and geometric coloration input (input F) from EU set control unit 206, then information is offered performance element set 412, lets each performance element 420 go to handle; In addition, performance element input 402 receives pixel coloring device input (input C and D) and texel package (input A and B), and these packages are sent to performance element set 412, lets each performance element 420 go to handle; Moreover performance element input 402 offers performance element set 412 with these information then where necessary from L2 high-speed cache 408 reception information (L2 reads).

The performance element output 404 of Fig. 4 embodiment can be divided into idol output 404a and strange output 404b; Performance element output 404 and performance element input 402 the same alteration switch or the buses of can be; Or other known framework, performance element idol output 404a handles the output of

even performance element

420a, 420c, 420e, 420g, and performance element is very exported the output that 404b handles

strange performance element

420b, 420d, 420f, 420h; Generally speaking; Two

performance elements output

404a and 404b receive performance element jointly and gather 412 output, and like UVRQ and RGBA data, these outputs can be passed L2 high-speed cache 408 back; Or export to via J1 and J2 from computation core 204 and to write back unit 308, or export texture address generator 310 to via K1 and K2.

The performance element circulation of performance element set 412 often comprises several levels; As describe content level, thread or task level, instruction or carry out level; A time point in office, each performance element 420 possibly permitted two and described content, and it describes content wherein to utilize a flag or other mechanism identification; Before the task of belonging to this content begins; From EU set control unit 206 outputting content informations, the content hierarchical information can be the constant in tinter kind, I/O buffer quantity, instruction initial address, output bitmap, summit identifier, each constant buffer, and each performance element 420 in the performance element set 412 can store a plurality of tasks or thread (for example 32 threads) simultaneously; In an embodiment, each thread extracts instruction according to program counter.

The 206 similar general assignment scheduling of EU set control unit; Utilize data-driven (data-driven) method (like the summit in the input signal, pixel, how much packages) to assign the suitable thread in the performance element 420; For instance; EU set control unit 206 assigns a thread to an idle thread position in the performance element 420 of performance element set 412; When thread has begun to carry out, the data that vertex cache or other element or module (according to the tinter kind) are imported can be placed on to be shared in the temporary buffer.

Usually graphic process unit 202 is used programmable vertex, how much, and pixel buffer; No longer distinctly carry out or operate these elements to these elements as having the fixed-function unit out of the ordinary of different designs and instruction group, but the performance element 420a, the 420b...420n that are replaced to associating cooperate unified instruction group to carry out, (this performance element comprises decode system 200 except performance element 420a; Therefore have extra function) outside; It is all identical with structure that each is used for the design of performance element 420 of sequential operation, and in an embodiment, each performance element 420 can carry out the multithreading computing; When vertex shader, geometric coloration, pixel coloring device etc. produce different tinter tasks; These tinter tasks will be delivered to other performance element 420 and go to carry out, and in an embodiment, decode system 200 can use vertex shader; Some is different with other performance element 420; For example, performance element 420a uses decode system 200, and this is that other performance element (like the 420b of Fig. 4) is unexistent; Because the internal buffer of the one or more correspondences of decode system 200 management, decode system 200 is obtained data through wiring 413 and performance element input 402 from memory access unit 406.

When having generated other task; EU set control unit 206 can assign these tasks to give the available thread of different performance elements 420; When finishing the work; EU set control unit 206 is managed the release of related linear program again; In this regard; EU set control unit 206 is responsible for assigning the thread of the task of vertex shader, geometric coloration and pixel coloring device to performance element 420, writes down relevant task and thread then, specifically; EU set control unit 206 has the thread of all performance elements 420 and the resource table of memory (seldom doing explanation here), and EU set control unit 206 can be known which thread is assigned to which task use, the task termination of knowing which thread and will discharge, know shared buffer file memory buffer (register file memory register) that what take, know how many free spaces each performance element has.

Therefore, if give a performance element with a task assignment, like 420a; EU set control unit 206 can with this thread be denoted as busy in; Then whole shared buffer file memories is deducted buffer file body (footpring) quantity that each thread is used up, body is to decide according to the state of vertex shader, geometric coloration and pixel coloring device, in addition; Can there be different body sizes in each tinter stage; For example, the vertex shader thread can require 10 shared buffer file buffers, and the pixel coloring device thread can only require 5 buffers.

When thread is accomplished its work that is assigned; The performance element 420 that moves this thread just can send signal to EU set control unit 206; EU set control unit 206 just can upgrade resource table, marks this thread and does not use, and thread is shared the quantity add-back free space of buffer file space; When all threads all be in busy in or all shared buffer file memories distribute all that (or the buffer space that keeps is too little; Can't hold extra thread), then this performance element 420 is full at last, and EU set control unit 206 can not assign new thread to give this performance element again.

Also there is a thread controller each performance element 420 inside; Can manage or mark each thread and be in use (or in carrying out) or available; In this regard; In an embodiment, when vertex shader was just being carried out the function of decode system 200, EU set control unit 206 can prevent that geometric coloration and pixel coloring device from moving at the same time.

The performance element 420a that Fig. 5 A explanation has former figures processor 202 and computation core 204 characteristics; It comprises the performance element data path 512 that is embedded with decode system 200; Specifically; Fig. 5 A is the calcspar of performance element 420a, and in an embodiment, it comprises thread controller 506, buffer 508 (like the constant buffer), shared buffer file (the common register file of instruction cache controller 504,504 connections of and instruction director cache; CRF) 510, with thread controller 506 and buffer 508 and share performance element data path (the EU datapath that buffer file 510 is connected; EUDP) 512, performance element data path first-in first-out buffer (first in first out, FIFO) 514, state portion's buffer file (predicate register file, PRF) 516, scale buffer file (scalar register file; SRF) 518, data output controller 520 and thread task interface 524; As previously mentioned, performance element 420 receives input from performance element input 402, provides then to export to performance element output 404.

Thread controller 506 provides the controlled function of whole performance element 420a; Comprise each thread of management and arbitration functions; For example how decision carries out its thread, and EUDP 512 comprises decode system 200, can carry out various calculating; Comprise similarly be floating-point operation computational logic unit (arithmetic logic unit, ALU), logical circuit such as logic with shift function.

Data output controller 520 can move to the element that some is connected with performance element output 404 with the data of accomplishing; For example the vertex cache of EU set control unit 206, write back unit 308 or the like; EUSP512 transmits the information of " task termination " and gives data output controller 520, and the task of informing is accomplished, and data output controller 520 comprises storage compartment; To store the task of accomplishing (like 32 projects (entry)); Also comprise a plurality of inbound ports of writing, data output controller 520 is selected task from storage compartment, then describes the specified buffer position of content according to tinter; Read all dateout items from sharing buffer file 510, then data are delivered to performance element output 404.

The task recognition that thread task interface 524 output performance element 420a accomplish accords with to EU set control unit 206; The task recognition symbol can notify EU set control unit 206 to have has thread resources in the particular execution unit, can assign new task to give this performance element (like 420a).

In an embodiment; Constant buffer 508 can be divided into 16 blocks; Each block has the position of 16 128 horizontal vector constants; Tinter uses operand and indexed access constant buffer position, and wherein, index can be to comprise 32 or near the buffer of 32 integer constants of not having a sign.

Instruction cache controller 504 is interface squares of thread controller 506; If the thread controller request of reading (can carrying out the tinter sign indicating number as extracting from command memory) is arranged, instruction cache controller 504 can be searched label list (not drawing), hit/not in (hit/miss) test; Give an example; If being arranged in high-speed cache then the expression of instruction cache controller 504, the instruction of request hits, if the instruction of institute's desire request will be extracted then the expression not, if hit from L2 high-speed cache 408 or memory 106; And the request of not sending from performance element input 402 simultaneously; Then instruction cache controller 504 can be agreed request, and this is because the instruction cache of instruction cache controller 504 has only a reading-writing port, and performance element input 402 has the highest priority; On the contrary, if not, have EUDP FIFO 514 and commutable block is arranged in the L2 high-speed cache 408 and have living space, then instruction cache controller 504 can be agreed request.In an embodiment, the high-speed cache of instruction cache controller 504 comprises 32 groups, and each group has 4 blocks; Each block has 2 status signals, can represent three kinds of states, be respectively invalid, load or effective status; Before block loaded the L2 data, block was the engineering noise state, when waiting the L2 data; Be " loading " state, when loading the L2 data fully, then become " effectively " state.

Can read and write stating portion's buffer file 516 through EUDP path 512; Performance element input 402 is as the interface that gets into data and performance element 420a; In an embodiment; Performance element input 402 comprises 8 project first-in first-out buffers and gets into data with buffering, and performance element input 402 also can be delivered to data the instruction cache and the constant buffer 508 of instruction cache controller 504, and performance element input 402 also can keep the tinter content.

Performance element output 404 is as dateout is delivered to the interface that EU gathers control unit 206, L2 high-speed cache 408 and writes back unit 308 from performance element 420a; In an embodiment; Performance element output 404 comprises 4 project first-in first-out buffers, and in order to the reception requests for arbitration, and buffering exports the data of EU set control unit 206 to; Performance element output 404 comprises multiple function, can arbitrate reading request of instruction cache, data are exported and write request, EUDP read.

Share buffer file 510 and be used for storing input, output and temporal data; In an embodiment; Share 128 * 128 buffer files and that buffer file 510 comprises 8 memory pages (bank) and read one and write and a reading-writing port, one to read a write port be to supply EUDP 512 to use, and is used to instruct carry out the read/write access that starts; The amphitene journey is shared

memory page

0,2,4,6; Singular line Cheng Ze shares

memory page

1,3,5,7, the instruction of thread controller 506 pairing different threads, and confirm that the memory of sharing the buffer file does not does not read or write the memory page conflict.

Reading-writing port then supplies performance element input 402 and data output controller 520 to use, and to load initial thread input data and final thread output is written to EU set control unit data buffer and L2 high-speed cache 408 or other module, read-write I/O port is shared in performance element input 402 and performance element output 404; In an embodiment, write than read and have higher priority, 512 input data get into 4 different memory pages; Clash when avoiding that data load shared buffer file 510; 2 bit channel index, data, are given an example through to specify the beginning memory page of input data with 512 presumptive addresses of aliging (aligned base address), if the beginning channel indexes is 1; Then memory page 1 loads from least significant bit (least significant bit; 128 of first of LSB) starting at, next 128 load store pages or leaves 2 then, by that analogy; Suppose that thread benchmark memory page is compensated for as 0; Last 128 load store pages or leaves 0 then please note that two least significant bits of Thread Id are used to produce the memory page compensation, with the beginning memory page position of each thread of random alignment.

CRF buffer index and Thread Id can be used for setting up unique logical address, share reading and writing data of buffer file 510 with label pairing (tag matching), give an example; The address can be lined up 128, and is just the same with the width of sharing buffer file storage page or leaf, through combining 8 CRF buffer index and 5 Thread Id; Can set up unique 13 bit address; Each 1024 row has a label, and each row then has two 512 projects (character), each character to be stored in 4 memory pages; And two least significant bits of CRF index are added the memory page compensation of present thread, select to set up memory page.

The label matching method can let the buffer of different threads share buffer file 510; Effectively utilize memory; The memory usage degree of buffer file 510 shared in EU set control unit 206 records, and enough spaces are arranged when guaranteeing the new task of scheduling execution units 420a.

Check that the target CRF index of present thread accounts for the size of whole CRF buffers; Before thread controller 506 sets about carrying out thread and tinter execution; The input data just should be deposited in and share in the buffer file 510; When the thread execution end, data output controller 520 reads dateout from sharing buffer file 510.

The embodiment of aforementioned performance element 420 has the EUDP 512 that includes decode system 200; The embodiment of Fig. 5 B explanation EUDP 512; EUDP 512 comprises buffer file 526, multiplexer 528, vectorial floating-point (FP) unit 532, vectorial integer arithmetic logical block (ALU) 534, specific purposes unit 536, multiplexer 538, buffer file 540 and decode system 200; Decode system 200 comprises one or more CAVLC unit 530, and one or more crossfires of can decoding are given an example; The single CAVLC unit 530 single crossfire of can decoding; Two CAVLC unit 530 (shown in dotted line, but not drawing its annexation for succinct event), two crossfires or the like of can decoding simultaneously are in order to clearly demonstrate; Only to the operation of the decode system 200 that uses single CAVLC unit 530, its principle can be derived to surpassing a CAVLC unit in narration afterwards.

As shown in the figure, EUDP 512 comprises some the panel data paths corresponding to CAVLC decoding unit 530, vectorial floating point unit 532, vectorial ALU 534, specific purposes unit 536, and corresponding computing all can be carried out according to the instruction that receives in each unit; Buffer file 526 receives operand (being denoted as SRC1 and SRC2); In an embodiment, buffer file 526 can be the shared buffer file 510 shown in Fig. 5 A, states portion's buffer file 516 and/or scale buffer file 518, please notes in some embodiment; Also can use more operand computing (function) holding wire 542 to provide each unit 530～536 to receive the means of computing signal; Holding wire 544 is connected to multiplexer 528 at present, can transmit the currency that is encoded into instruction, supplies each unit 530～536 to carry out the integer arithmetic of lowerinteger value; Instruction decoder (not drawing) provides operand, computing (function) signal and present signal; The output result that the terminal multiplexer 538 of data path (can comprise the stage of writing back) is selected correct path delivers to buffer file 540, and output buffer file 540 comprises object component; Can be buffer file 526 or other buffer; Note that in an embodiment, when the source and the target buffer comprise similar elements; The position of instruction has the source and object component is selected, supply multiplexer handle from/deliver to the data of suitable buffer file.

Therefore; Performance element 420a can be regarded as multistage pipeline (like 4 rank pipelines; Have 4 ALUs); The CAVLC decoding operation is middle mutually the generation when carrying out for 4, needs postpone to let CAVLC decoding threading operation, gives an example; When bit stream buffer takes place to underflow bit (underflow), waits the initialization content memorizer, waits and bit stream is loaded fifo buffer and sREG buffer (explaining after a while) and/or processing time surpassed the predetermined door time etc., can add in the execution phase to postpone.

In some embodiment, decode system 200 utilizes single performance element 420a two bit streams of decoding simultaneously, gives an example; Expand the instruction group according to one; Decode system can use two data paths (like newly-increased another CAVLC unit 530) to carry out the decoding of two crossfires simultaneously, and certain also more or less crossfire (will use more or less data path so) of decodable code is when involving a plurality of crossfires; Some decode system 200 does not limit decoding simultaneously; In addition, in certain embodiments, multiple while crossfire decoding can be carried out in single CAVLC unit 530.

In an embodiment, when using two data paths, two threads, decode system 200 just can move simultaneously, give an example; In two crossfires decoding embodiment; The quantity of restriction thread is two, and first thread (like thread 0) is assigned first memory page (being CAVLC unit 530) of giving decode system 200, and second thread (like thread 1) is then assigned second memory page (being the dotted line CAVLC unit of Fig. 5 B) of giving decode system 200; In some embodiment; Two or more threads can run on single memory page, in addition, are to be embedded in EUDP 512 though show decode system 200 here; The element that also can comprise other similarly is the logical circuit in the EU set control unit 206.

Some embodiment of performance element 420a, EUDP 512 and CAVLC unit 530 has been described at present, below simplicity of explanation be used for the H.264CAVLC decode system of operation content, the level (level of the signal that known CAVLC program coding is relevant with macroblock (macroblock) or part macroblock; Size); Know that this level has many normal (like how many cycles) to repeat (run, running), just need not encode to each; Obtain and resolve (parse) this type of information from bit stream buffer; Used the information in the buffer when the Decode engine of decode system 200, then data can be replenished into again, and decode system 200 is extracted the macroblock information that includes level (level) and running (run) coefficient out from bit stream; Counter-rotating coded program, reconstruction signal then.Decode system 200 obtains macroblock information and resolves crossfire from bit stream buffer; To obtain level and coefficient value; Temporarily be stored in level array and running array, then read these level arrays and running array (like 4 * 4 block of pixel of the block in the macroblock), empty level array and the next block of running array preparation carrying out then; According to standard H.264, use each 4 * 4 block of software processes can set up complete macroblock.

The general computing of decoding macroblock information has been described; Following narration is set forth in the various elements of the decode system 200 in the content of CAVLC decoding program; Can list the various distortion that meet practical application in consideration, those skilled in the art can know that following employed many terms (like the title of each parameter) are to come from H.264 specification, for succinct so repeat no more; Only help to understand described distinct program and/or element, just can do further explanation again.

Fig. 6 A to Fig. 6 C is the calcspar of explanation decode system 200; The decode system of wherein drawing 200 has single CAVLC unit 530 (in Fig. 6 A to Fig. 6 C; Employed CAVLC unit 530 can exchange with decode system 200), so in embodiment, the single bit stream of decode system 200 decodable codes; Same principle can be applied to the decode system 200 with a plurality of CAVLC unit, a plurality of (as the two) crossfire of can decoding simultaneously.Briefly, Fig. 6 A is the selection element of CAVLC unit 530, and Fig. 6 B then explains the crossfire buffer function that the CAVLC unit provides, content memorizer (the comprising buffer) function of Fig. 6 C explanation CAVLC unit 530, the window structure of Fig. 6 D explanation CAVLC decoding.Though following narration is the content of relevant macroblock decoding, this principle can be applied to various segment decodings.

See also Fig. 6 A; CAVLC unit 530 comprises several hardware modules; Coefficient mark (coeff_token) module 610, level sign indicating number (CAVLC_LevelCode) module 612, level (CAVLC_Level) module 614, level 0 (CAVLC_L0) module 616, zero level (CAVLC_ZL) module 618, running (CAVLC_Run) module 620, level array (LevelArray) 622 and running array (RunArray) 624 are arranged; Decode system also comprises shift registor (SREG)-crossfire buffer/direct memory access (DMA) (DMA) engine 602 and (also is shown in Fig. 6 B; Be called the DMA engine modules afterwards), macroblock adjacent content (mbNeighCtx) memory 604 among total buffer 606, general register 608 and Fig. 6 C is (in an embodiment; The mbNeighCtx memory comprises 96 buffers; Can be 3 32 buffers that tinter writes), some buffer is not drawn in addition.

CAVLC unit 530 and the interface of performance element 420a comprise one or more target bus and corresponding buffer (like the DST buffer), two come source bus and corresponding buffer (SRC1, SRC2); Data on the target bus directly or indirectly (as via intermediate cache, buffer, buffer or memory) be sent to the inner or outside video processing units of GPU 114; Data on the target bus can be DX API form or other forms of Microsoft; These data comprise coefficient, macroblock parameter, operation information and/or IPCM sampling or other data; CAVLC unit 530 also comprises the memory interface of being made up of address bus and data/address bus; After obtaining the address from address bus, just can carry out the access of bitstream data, in an embodiment through the data that obtain from data/address bus; Data on the data/address bus can comprise the unencryption video flowing; Comprising various signal parameters and other data and form, in some embodiment, can use loading-store operation to come the access bitstream data.

Before beginning that each element of CAVLC unit 530 is described, the whole operation of the performance element 420a of the relevant once CAVLC decoding of simple declaration, usually; According to fragment (slice) form; Drive software 128 (Fig. 1) prepares the CAVLC tinter and with its load and execution unit 420a, this CAVLC tinter uses the stereotyped command group to add coeff_token, CAVLC_LevelCode, CAVLC_Level, CAVLC_L0, CAVLC_ZL, reach the CAVLC_Run instruction, can carry out the decoding of bit stream; Here the principle of name is the instruction that each module can be sent same names; In addition, also have with read operation and relevant READ_LEVEL_RUN and the CLR_LEVEL_RUN instruction of clear operation, in an embodiment at level array 622 and running array 624; Before sending other instruction; First instruction that the CAVLC tinter is carried out is INIT_CAVLC and INIT_ADE instruction, and these two instructions make CAVLC unit 530 beginning CAVLC decoding bit streams, and bit stream is separated code-point from crossfire begins to load fifo buffer; This two instructions will be described after a while; Therefore CAVLC unit 530 provides parsing bit stream, initialization decoding hardware and buffer/memory construction and level-running (level-run) decoding, and said H.264CAVAC decoding program function will be in explaining after a while, and the operation from bit stream buffer earlier begins.

About resolving bit stream, the data/address bus reception bit stream from memory interface is cushioned by SREG crossfire buffer/DMA engine 618 then; The fragment data resolution phase provides bit stream decoding, and bit stream (like the NAL bit stream) comprises one or more picture, is cut to figure shelves head (header) and many fragments (slice); A fragment comprises a series of macroblock usually, and in an embodiment, external program (being 530 outsides, CAVLC unit) is resolved NAL bit stream, decoding clip file head, transmitted the pointer that points to this fragment data (beginning the place like fragment); Usually; Drive software 128 is handled bit stream from fragment data, because this is the function that application program and API provide, the pointer transmission of pointing to the fragment data position also involves first byte address (like RBSPbyeAddress) of fragment data and points out that bit stream begins or the position compensation pointer of header position (like sREGptr) (like a position or a plurality of position); The initialization of bit stream will be in explaining after a while; In some embodiment, can utilize primary processor (like the CPU 126 of Fig. 1) to handle external program, so that picture decoding and the decoding of sheet paiagraph header to be provided; In some embodiment; Because decode system 200 carries out H.264 bit stream parsing from picture, and the CAVLC decode operation is to set about carrying out from macroblock according to fragment data, in some embodiment; Because the programmable features of CAVLC unit can be decoded in any stage.

See also Fig. 6 B; It is selection componentry and the calcspar of other element of the SREG crossfire buffer/DMA engine 602 of CAVLC unit 530; It comprises

operand buffer

661 and 663, receives SRC1 and SRC2 value respectively, is passed to buffer 656 and 667 again; CAVLC logical circuit 660 is exactly module and the element of Fig. 6 A; But do not comprise SREG crossfire buffer/DMA engine 602, mbNeighCtx memory 604, total buffer 606 and general register 608, SREG crossfire buffer/DMA engine 618 comprises internal bit stream damper 602b, in an embodiment, can be 32 buffers and 8 128 buffers of BigEndian form.SREG crossfire buffer/DMA engine 602 is set in the initialization directive that drive software 128 sends when beginning; In case start; Just manage the internal buffer 602b of SREG crossfire buffer/DMA engine 602 automatically, SREG crossfire buffer/DMA engine 602 keeps bit position to be resolved.

In an embodiment; SREG crossfire buffer/DMA engine 602 uses two buffers; Quick 32 flip-flops and one slower 512 or 1024 bit memories, bit stream can use the position, and shift registor 602a operates with the position; And bit stream buffer 602b operates with byte, can save the energy.Usually the instruction of shift registor 602a computing can be used a little position (as 1～3); Use the data that surpass a byte as shift registor 618a; Data (byte fragment) will send shift registor 602a to from bit stream buffer 602b; The buffer index can deduct the bytes in of transmission then, uses 256 or more during multidigit when the DMA engine of SREG crossfire buffer/DMA engine 602 detects, and just fills up bit stream buffer 602b from 256 of memory fetch; So a simple cyclic buffer (256 bit slice section x 4) has been carried out in CAVLC unit 530; To follow the trail of bit stream buffer 602b and to fill, in some embodiment, can use single buffer, but a cyclic buffer needs more complicated pointer to calculate the speed of catching up with memory.

It is interactive to utilize initialization directive to reach with internal buffer 602b, is called the INIT_BSTR instruction, in an embodiment; INIT_BSTR instruction (can send by drive software 128) and INIT_CAVLC (or _ ADE) instruction is almost sent simultaneously; Form delay (stall), get into buffer 602b up to bitstream data, in case data arrives buffer 602b; Remove the program that the delay situation begins the back; Afterwards, if the storage condition of buffer is lower than predetermined door, the DMA engine of SREG bit stream buffer/DMA engine 602 can continue to extract bitstream data and deposit buffer 602b in.If the byte address of known bits stream position and position compensation, INIT_BSTR instructs data load internal bit stream damper 602b, and the beginning hypervisor, and the call treatment fragment data all can send down the instruction of column format each time:

INIT_BSTR?offset，RBSPbyteAddress

This instruction is used for the internal buffer 602b with data load SREG crossfire buffer/DMA engine 602; In an embodiment, SRC2 buffer 663 provides byte address (RBSPbyteAddress), and SRC1 buffer 661 provides the position compensation; So, can use following general command format:

INIT_BSTR?SRC2，SRC1，

Wherein, SRC1 in this instruction and SRC2 and other signal are the values of corresponding

inner buffer

661 and 663, but are not limited to these buffers; In an embodiment, use the memory fetch of 256 byte align to come the access bitstream data, and with its 32 bit shift buffer 602a that write buffer register and be sent to SREG crossfire buffer/DMA engine 602; In an embodiment; Before these buffers or buffer carried out computing, the data in the bit stream buffer 602b were to arrange with byte mode, and this data arrangement can be implemented through arranging instruction; Also be referred to as the ABST instruction; The ABST instruction can be arranged the data in the bit stream buffer 602b, in decode procedure, arranges position (like filler) and will be dropped at last.

When shift registor 602a uses data; Internal buffer 602b just can padding data; In other words, the internal buffer 602b of SREG crossfire buffer/DMA engine 602 is similar to 3 to be the cyclic buffer of mould (modulo), data is imported 32 buffer 602a of SREG crossfire buffer/DMA engine 602; CAVLC unit 530 (like CAVLC logic module 660) can use the READ instruction from shift registor 602a reading of data, and the form of READ instruction is following:

READ?DST，SRC1，

Wherein DST is corresponding to an output or a target buffer, and in an embodiment, SRC1 buffer 661 comprises does not have an integer value n of sign; Through the READ instruction; Obtain the n position from shift registor 602a, the data (like one or more grammatical items of decoding) when consumed 256 from 32 bit shift buffer 602a begin to extract operation automatically to obtain another data of 256; It is write the buffer of internal buffer 602b, then get into shift registor 602 and supply next to recycle.

In some embodiment; If be used the position or the byte of predetermined quantity corresponding to the data of the shift registor 602a of symbol decoding; And internal buffer 602b does not receive any data again; Then CAVLC logical circuit 660 can postpone, so that carry out other thread (for example with the irrelevant thread of CAVLC decoding program), similarly is the vertex shader operation.

Use the DMA engine of SREG crossfire buffer/DMA engine 602 can reduce required number of buffers; Postpone (for example, in some GPU, can arrive for 300 multicycles) with compensation memory; When having used bit stream; Can ask to flow into the bitstream data that comes the back, if bitstream data makes the risk (the for example known periodicity that lets signal flow to processor pipeline from CAVLC unit 530) of the oriented underflow bit of bit stream buffer 602b very little, but the transmission delay signal is given processor pipeline; Pausing operation waits data arrives bit stream buffer 602b.

In addition, SREG crossfire buffer/DMA engine 602 just has the ability of handling error bit stream originally, gives an example; Because bit stream mistake; Might not detect fragment ending mark, this faults may cause the complete mistake of decoding, and with finally pattern or fragment; The figure place that SREG crossfire buffer/DMA engine 602 records use; If the figure place of using is greater than preset threshold value (can change to each fragment), end process program and the signal of removing delivered to processor (like primary processor) then, processor is carried out coding and is attempted from mistake, replying then.

The instruction of two relevant bit stream accesses is INPSTR and INPTRB instruction; Whether INPSTR and INPTRB instruction are used for detecting has the special pattern (pattern of appearance in fragment or macroblock; Begin or finish pattern like data), need not carry out bit stream and just can begin to read bit stream, in an embodiment; Instruction sequences is INPSTR, INPTRB, is the READ instruction then that the INPSTR instruction comprises column format down:

INPSTR DST，

In an embodiment; Inspect bit stream and with the highest effective 16 deliver to target (DST) buffer low 16 of shift registor 602a; Higher 16 of the target buffer comprise the sREGbitptr value; Data can not shift out as operation result from shift registor 602a, can implement instruction according to following formula illustration pseudo-code:

MODULE INPSTR(DST)

OUTPUT[31:0]DST

DST＝{ZE(sREGbitptr)，sREG[msb:msb-15]}；

ENDMODULE

Another instruction relevant with bit stream is the INPTRB instruction; Inspect raw bytes order payload (rawbyte sequence payload; RBSP) tail end position (like the byte align data flow), the INPTRB instruction is used to read bit stream buffer 602b, can be down column format:

INPTRBDST.

In the INPRB computing, not from shift registor 602b shift-out bit,, then comprised the RBSP position of rest if the highest significant position of shift registor 602b comprises 100 (non-limiting), remaining byte is a zero-bit all just, can implement instruction according to following formula illustration pseudo-code:

MODULE?INPTRB(DST)

OUTPUT?DST；

REG[7:0]P；

P＝sREG[msb:msb-7]；

Sp＝sREGbitptr；

T[7:0]＝(P＞＞sp)＜＜sp；

DST[1]＝(T＝＝0x80)?1:0；

DST[0]＝！(CVLC_BufferBytesRemaining＞0)；

ENDMODULE

The READ instruction is used to arrange the data in the bit stream buffer 602.

The bit stream buffer operation of CAVLC unit 530 has been described at present; Be the initialization of CAVLC computing again; Especially initializes memory, working space structure and Decode engine (like CAVLC logical circuit 660) are at the fragment section start, before the grammatical item of decoding corresponding to first macroblock; Initialization working space structure, total buffer 606, general register 608 and CAVLC Decode engine; In an embodiment, drive software 128 sends this initialization operation of INIT_CAVLC instruction carrying out, and the INIT_CAVLC instruction can have following command format:

INIT_CAVLC?SRC2，SRC1

Wherein, SRC2 comprises fragment data bits number to be decoded, and this value is write inner CVLC_bufferBytesRemaining buffer:

SRC1[15:0]＝mbAddrCurr，

SRC1[23:16]＝mbPerLine，

SRC1[24]＝constrained_intra_predflag，

SRC1[27:25]＝NAL_unit_type(NUT)，

SRC1 [29:28]=chroma_format_idc (in an embodiment, the chroma_format_idc value is 1 o'clock corresponding 4:2:0 form, can use other sampling mechanism in other embodiment)

SRC1 [31:20]=undefined

About INIT_CAVLC instruction, SRC1 value is write the corresponding field of total buffer 606, utilize INIT to instruct; In addition the SRC2 value is write inner buffer (like CVLC_bufferByteRemaining), the CVLC_bufferByteRemaining buffer is used for the restore errors bit stream, gives an example; During the decoding beginning, CAVLC unit 530 (like SREG bit stream buffer/DMA engine 602) is to the buffering position in the relevant bit stream of a slice segment record, after bit stream uses; CAVLC unit 530 countings also upgrade the CVLC_bufferByteRemaining value; If this value is lower than 0, this expression buffer or bit stream are wrong, and this moment is termination rapidly; And return and use program control or drive software 128 controls, recover.

See also Fig. 6 C; But the INIT_CAVLC instruction is each memory structure of initialization CAVLC unit 530 also; Like mbNeighCtx memory 604, left side mbNeighCtx buffer 684, present mbNeighCtx buffer 686; In an embodiment, the macroblock benchmark adjacent content memory of mbNeighCtx memory 610 is arranged in memory array, to store the data of relevant macroblock row; MbNeighCtx buffer 686 is used to store the macroblock of current decoder at present; And left side mbNeighCtx buffer 684 is used to store (left side) macroblock of early decoding, in addition, utilizes top index 683, left side index 685, reaches present index 687 (in Fig. 6 C, representing with arrow) sensing mbNeighCtx memory 604, left side mbNeighCtx buffer 684 and present mbNeighCtx buffer 686; When the present macroblock of decoding; The data storing of decoding is in present mbNeighCtx buffer 686, when the content character of known CAVLC decoding, and the information of being collected when last time decoding macroblock according to the CAVLC_TOTC instruction present macroblock of decoding; That is the left side macroblock is stored in left side mbNeighCtx buffer 684 and utilize left side pointer 685 to point to, and the top macroblock is stored in the array element [i] 681 and utilize top index 683 to point to.

The INIT_CAVLC instruction is used for initialization and present adjacent relevant top and left

side index

683 and 685 of macroblock of macroblock (like the element of mbNeighCtx memory array 604); Give an example; Left side index 685 can be made as 0 and top index 683 can be made as 1; In addition, the INIT_CAVLC instruction also can be upgraded total buffer 606.

In an embodiment; MbNeighCtx memory 604 comprises the array with 120 elements; Be denoted as mbNeighCtx [0], mbNeighCtx [1] ... mbNeighCtx [119]; The multipotency of each picture width stores 120 macroblocks (because of HDTV is 1920 * 1080 pixels), other array structures of those skilled in the art's different sizes capable of using.

Give an example, judge whether adjacent macroblock (like the left side macroblock) exists (effectively), the CAVLC_TOTC instruction must be carried out computing (like mbCurrAddr%mbPerLine), and whether check result is 0, in an embodiment, carries out following formula:

a＝(mbCurrAddr％mbPerLine)

The mbCurrAddr representative is corresponding to the present macroblock position of binary character to be decoded, and mbPerLine represents the macroblock quantity of each row, and division, multiplication and subtraction are used in top calculating.

Consider following formula:

mbCurrAddr∈[0:maxMB-1]

Wherein, maxMB is 8192, and mbPerLine=120; Multiplication capable of using and (1/mbPerLine) that searched by the forms that are stored in on-chip memory (like 120 * 11 bit tables) carry out division, if mbCurrentAddr is 13, then use 13 * 11 multipliers; In an embodiment,, store 13 than the top with the round numbers as a result of multiplying; Carry out 13 * 7 multiplying, store lower 13, the subtraction that carries out 13 at last is with decision " a "; Whole operation program needs 2 cycles, can store this result and use for other computing, calculates once whenever mbCurrAddr just changes.

In some embodiment, do not carry out modulus (modulo) computing; Change with performance element (like performance element 420a; 420b or the like) the tinter logical circuit in provides first mbAddrCurr value; It is positioned at first row of first fragment, gives an example, and this tinter logical circuit can descend column count:

mbAddrCurr＝absoluteMbAddrCurr n?x?mbPerLine

Use the CWRITE instruction can " move " content of mbNeighCtx memory 604, the form of CWRITE instruction can be:

CWRITE?SRC1，

Wherein, SRC1 [15:0]=mbAddrCurr; CWRITE instruction copies to the top mbNeighCtx [i] and the left side mbNeighCtx [i-1] of mbNeighCtx [] structure 604 from the suitable field of present mbNeighCtx buffer 686, and as (mbAddrCurr%mbPerLine==0), left side mbNeighCtxLeft 684 is labeled as and does not have (as initially changing into 0); Can utilize the content of CWRITE instruction " moving " mbNeighCtx memory 604, general register 608 and total buffer 606; Give an example, left side and the top block (like mbNeighCtx [i] or present macroblock) of the related content of mbNeighCtx memory 604 to i macroblock moved in the CWRITE instruction, and empties mbNeighCtx buffer 686; As previously mentioned; With mbNeighCtx memory 604 two relevant indexs is left side index 685 and top pointer 683, and after the CWRITE instruction, the top index increases by 1; The content of macroblock then moves to the top position and the leftward position of array 604 at present, and said system can reduce quantity to one a read/write port of the read/write port of memory array.

Utilize the INSERT instruction can upgrade the content of mbNeighCtx memory 604, local register 608 and total buffer 606, the form of INSERT instruction can be:

INSERT?DST，#Imm，SRC1

In this INSERT instruction, #Imm comprises 10 bit digital, preceding 5 bit widths of data and higher 5 positions that appointment will be inserted data, and input parameter has following column format:

Mask＝NOT(0xFFFFFFFF＜＜#Imm[4:0])

Data＝SRC1?&?Mask

SDATA＝Data＜＜#Imm[9:5]

SMask＝Mask＜＜#Imm[9:5]

Output DST can represent by following formula:

DST＝(DST?&?NOT(sMask))I?SDATA

Give an example, (like INSERT $mbNeighCtxCurrent_1, #Imm10 SRC1) writes present macroblock, and this operation can not influence left side index 685 and top index 683 (that is only writing current position) in INSERT instruction capable of using.

The INSERT instruction can write present mbNeighCtx 686; The array element that left side pointer 685 points to is identical with adjacent (adjacent to present mbNeighCtx) array element (being mbNeighCtx [i-1]); When sending the CWRITE instruction; Index 685 and top pointer 683 element pointed on the left of at present whole or some contents of mbNeighCtx structure can copy to; The top pointer increases by 1 (like the modulus value of each row macroblock) simultaneously, in copy operation (or afterwards), empties present mbNeighCtx array element with 0 value.

The data structure that remaines in mbNeighCtx memory 604 is following:

mbNeighCtxCurrent[01:00]：2’b：mbType

mbNeighCtxCurrent[65:02]：4’b：TC[16]

mbNeighCtxCurrent[81:66]：4’b：TCC[cb][4]

mbNeighCtxCurrent[97:82]：4’b：TCC[cr][4]

When carrying out the CWRITE instruction, can upgrade the present mbNeighCtx 686 of mbNeighCtx [] adjacent data and initialization.

The content memorizer structure that CAVLC unit 530 uses has been described at present; Next explain how CAVLC unit 530 and CAVLC_TOTC instruction utilizes adjacent content information calculations TotalCoeff (TC), and which CAVLC TotalCoeff will use show decoding symbols with deciding, the CAVLC decoding is to utilize the H.264 length-changeable decoding table of specifications (being called the CAVLC table afterwards) usually; Wherein the content according to the early decoding symbol decides the CAVLC table that is used to decode; Therefore, each symbol may be used different CAVLC tables, and Fig. 6 D shows a basic window structure; It is variable-sized 2D array; " forms " array (the corresponding symbol of each forms) is provided, and each symbol all is a huffman coding, huffman coding is deposited into following window structure:

struct?Table{

unsigned?head；

struct?table{

unsigned?val；

unsigned?shv；

}table[]；

}Table[]；

Pairing (MatchVLC function) method according to each preamble (prefix coding) is described below; Usually the CAVLC table is divided into variable-length part and regular length part; Therefore utilize the fixed dimension index search can simplify pairing, in the MatchVLC function, carry out the READ operation and can not shift out bit stream from shift registor 602a; The READ operation is different with the READ instruction (being used for bit stream buffer 602b) of front explanation, and the latter is to bit stream.In the matchVLC function, duplicate some positions (fixL) from bit stream buffer 602b, then in specifying forms, to search, each project in the forms of specifying comprises a coefficient to (dublet is like value and figure place), and this units can be used for handling bit stream.

FUNCTION MatchVLC(Table，maxIdx) INPUT Table； INPUT maxIdx； Idx1＝CLZ(sREG)； //count number of leading zeros Idx1＝(Idx1＞maxIdx)?maxIdx:Idx1； fixL＝Table[Idx1].head； SHL(sREG，Idx1+#1)； //shift buffer Idx1+1bitleft Idx2＝(fixL)?0:READ(fixL)； (val，shv)＝Table[Idx1][Idx2]； SHL(sREG，shv)； return val； ENDFUNCTON

Fig. 6 D is the calcspar of the illustration 2D array of aforementioned window structure, is used for explaining the MatchVLC function of CAVLC decode content, and this example is the table 9-5 (nC==-1) of H.264 specifications:

With regard to pseudo-code, this table can be represented by following formula:

Table9-5[8]＝{

0，{{33，0}}，

0，{{0，0}}，

0，{{66，0}}，

2，{{2，2}，{99，2}，{34，2}，{1，2}}，

1，{{4，1}，{3，1}}，

1，{{67，1}，{35，1}}，

1，{{68，1}，{36，1}}，

0，{{100，0}}

}；

Above-mentioned pseudo-code can be expressed as the 2D forms of Fig. 6 D; Utilize this window structure; Above-mentioned MatchVLC function can be used for the CAVLC decoding; The MatchVLC function can calculate in the bit stream number from highest order continuous 0 (count leading zero, CLZ), with the forms of the corresponding known grammatical item of access; In addition, when the CLZ value greater than macIdx, then MatchVLC function on parametrization is removed (parameterizedclear zero) operation, maxIdx replys (being 0000000) in the forms of Fig. 6 D then.Another benefit of MatchVLC function and window structure is not need a plurality of instructions to handle, as long as following MatchVLC fragment: Idx1=CLZ (sREG); //count number of leading zeros, and and Idx1=(Idx1＞maxIdx)? MaxIdx:Idx1.Utilize following MatchVLC fragment remove the position of having used: SHL (sREG, Idx1+#1); //shift buffer Idx1+1bit left.Utilize following MatchVLC fragment to read subarray header: fixL=Table [Idx1] .head, and Idx2=(! FixL)? 0:READ (fixL).The number in the place ahead continuous 0 maybe be identical, but the varying in size of tail end position, CA SEX-type state narration capable of using (casestatement) in an embodiment (but use than multi-memory better simply code structure).

Utilize (val, shv)=(sREG shv) obtains actual value from forms, also knows the actual number of bits that this grammatical item uses, and shifts out these positions from bit stream, then the grammatical item value is put back to the target buffer for Table [Idx1] [Idx2] and SHL.

The front has been described bit stream parsing, initialization Decode engine and memory construction and VLC matching method and window structure; Get back to Fig. 6 A at present and describe CAVLC Decode engine (like CAVLC logical circuit 660) and program; In case bit loading stream, Decode engine, memory construction and buffer; Drive software 128 sends CAVLC_TOTC instruction activation coeff_token module 610, and CAVLC_TOTC command format can be:

CAVLC_TOTC?DST，S1，

Wherein S1 and DST are respectively input buffer and inner output buffer, have following column format:

SRC1[3:0]＝blkIdx

SRC1[18:16]＝blkCat

SRC1[24]＝iCbCr

Remaining position is undefined, and output format is following:

DST[31:16]＝TrailingOnes

DST[15:0]＝TotalCoeff

Therefore; Coeff_token module 610 receives (whether the expression chrominance channel has in processing corresponding to mbCurrAddr, mbType; Like iCbCr) and blkIdx (like block index; Because figure possibly be cut into many blocks) information, when the macroblock of access bit stream buffer 602b, what blkIdx represented that ad-hoc location handles is 8 * 8 block of pixels or 4 * 4 block of pixels; This category information is to be provided by 128 of drive softwares; Coeff_token module 610 comprises look-up table (look-up table), obtains tail end 1 (TrailingOnes) and all coefficients (TotalCoeff) according to the look-up table of aforementioned input coeff_token module 610, have in tail end 1 expression one row what 1; All coefficients then represent have how many running/level coefficients right from the data slot that bit stream is pulled out; TrailingOnes and TotalCoeff will import CAVLC_Level module 614 and CAVLC_ZL module 618 respectively, and TrailingOnes also imports CAVLC_L0 module 616 simultaneously, first level (like the DC value) that its correspondence is taken out from bit stream buffer 602b.

The suffix length of CAVLC_level module 614 record symbols (like the number of tail end 1); And combine LevelCode calculating to be stored in the hierarchical value (level [Idx]) of level array 622 and running array 624; CAVLC_Level module 614 is according to the CAVLC_LVL ordering calculation, and the form of CAVLC_LVL instruction is following:

CAVLC_LVL?DST，S2，S1，

Wherein:

S1＝Idx(16-bit)，

S2＝suffixLength(16-bit)，and

DST＝suffixLength(16-bit)。

SuffixLength representes character code length, from the input of drive software 128 information of specifying suffixLength can be provided, and in addition, in an embodiment, because upgraded the suffixLength value, DST and S2 can be obtained by same buffer.

Here also can use transmission buffer (keeping the data that particular module produces in inside); F1665 and F2667 like Fig. 6 B; Whether an instruction and corresponding module use the transmission buffer in instruction, to represent with the transmission flag; The symbol of representative transmission buffer has F1 (using the value in transmission source 1, position 26 expressions in an embodiment, can instructing) and F2 (using the value in transmission source 2, position 27 expressions in an embodiment, can instructing); If use the transmission buffer, the CAVLC_LVL instruction has following illustration form:

CAVLC_LVL.F1.F2?DST，SRC2，SR1，

Wherein, If F1 or F2 are made as 1; Then the transmission of appointment source will become input, and the level index (level [Idx]) that transmission buffer F1 is produced corresponding to CAVLC_Level module 614 is through input multiplexer 630 after increment (increment) module; The suffixLength that transmission buffer F2 is produced corresponding to CAVLC_Level module 614; And with input multiplexer 628, other input of multiplexer 603 and multiplexer 628 also has EU buffer input (being denoted as EU among Fig. 6 A), explains as follows.

CAVLC_Level module 614 also has another input levelCode; Be to provide by 612 of CAVLC_LevelCode modules; CAVLC_LevelCode module 612 and CAVLC_Level module 614 join operation decoding hierarchical value (in adjustment size (scaling) conversion coefficient (transformcoefficient) value before), the command format of activation CAVLC_LevelCode module 612 is following:

CAVLC_LC SRC1，

Wherein, SRC1=suffixLength (16-bit), if use transmission buffer F1 665, then the instruction expression is as follows:

CAVLC_LVL.F1 SRC1，

If set F1; Then passing on SRC1 will be as input; Cooperate Fig. 6 A, if set F1 (like F1=1), then CAVLC_LevelCode module 612 uses transmission SRC1 value (like the suffixLength of CAVLC_Level module 614) as input; Not so (like F1=0), the value of EU buffer will be as input.

Get back to CAVLC_Level module 614 now; The suffixLength input can be via multiplexer 628 from 614 transmissions of CAVLC_Level module; Also can provide to multiplexer 628 via the EU buffer; The Idx input can also can provide to multiplexer 630 via the EU buffer via multiplexer 630 from CAVLC_Level transmission (can carry out increment or auto-increment through the increment module) equally in addition.CAVLC_Level module 614 also directly receives the levelCode input from CAVLC_LevelCode module 612, and except sending the output of transmission buffer to, CAVLC_Level module 614 also provides level index (level [idx]) to export to level array 622.

As previously mentioned, TrailingOnes output (like the DC value) is sent to CAVLC_L0 module 616, through following instruction activation CAVLC_L0 module:

CAVLC_LVL0 SRC，

Wherein, SRC=trailingOnes (coeff_token), the output of CAVLC_L0 module 616 comprises the level index (Level [Idx]) of exporting to level array 622; Coefficient value is encoded into sign (sign) and size (magnitude), and CAVLC_L0 module 616 provides the positive negative value of coefficient, and the sizes values that CAVC_Level module 614 provides combines with the positive negative value that CVLC_L0 module 616 provides; Write level array 622; Utilize level index (level [idx]) to specify writing position, in an embodiment, each sub-block of coefficient is 4 * 4 matrixes (block is 8 * 8); Also not to scan (raster) in proper order; This array converts 4 * 4 matrixes after a while to, and in other words the coefficient level and the running of decoding are not scan format, utilize level-operational data; Can rebuild 4 * 4 matrixes (but being Z word scanning sequency), be arranged in 4 * 4 matrixes of scanning sequency then again.

Send the output TotalCoeff of coeff_token module 610 to CAVLC_ZL module 618, through following instruction activation CAVLC_ZL module 618:

CAVLC_ZL DST，SRC1，

Wherein, SRC1=maxNumCoeff (16-bit) and DST=ZerosLeft (16-bit), maxNumCoeff (H.264 standard) is as the source value of instruction; In other words; MaxNumCoeff is by software set, and in some embodiment, maxNumCoeff is stored in the hardware; Transform coefficients encoding becomes (level; Running) coefficient is right, and representative is encoded into 0 coefficient (level) number, and CAVLC_ZL module 618 provides two output ZerosLeft and Reset (reset=0) to give multiplexer 640 and 642; Multiplexer 640 also receives transmission buffer F2 from CAVLC_Run module 620, and multiplexer 642 receives the value F1 of the transmission buffer of increment (via the increment module) from CAVLC_Run module 620.

CAVLC_Run module 620 receives ZerosLeft and Idx input from multiplexer 640 and 642 respectively, and exports running index (Run [Idx]) to operating array 624, as previously mentioned, because can use running-length coding to carry out compression further; Therefore it is right that coefficient coding becomes (level, running), gives an example, and supposes to have numerical value 10 12 12 15 19 11100000010; Can be encoded into (10,0) (12,1) (15,0) (19; 0) (1,2) (0,5) (1,0) (0; 0), this character code is shorter usually, and index is the manipulative indexing of level index, through following instruction activation CAVLC_Run module 620:

CAVLC_RUN DST，S2，S1，

Wherein because upgraded the ZerosLeft value, so DST and S2 can from buffer obtain, CVLC_Run not have sign numerical value following:

S1＝Idx(16-bit)，

S2＝ZerosLeft(16-bit)，

DST＝Zerosleft(16-bit)

Can know that by Fig. 6 A if use the transmission buffer, the form of CAVLC_RUN instruction is following:

CAVLC.F1.F2 DST，SRC2，SRC1，

Wherein, if set F1 or F2, just the corresponding transmission source of expression will be as input.

As for two buffer arrays; Level array 622 is corresponding to level; And operate array 624 corresponding to running, and each array all comprises 16 elements, each element of level array 622 all comprises the value of 16 tool signs; And each element of running array 624 all comprises 4 values of not having a sign, utilizes following instruction to read running and hierarchical value from running array 624 and level array 622 respectively:

READ_LRUNDST，

Wherein, in an embodiment, DST comprises 4 128 continuous buffer (like the interim or shared buffer of EU); This operation can be read level buffer 622 and the running buffer 624 in the CAVLC unit 530, and it is stored in target buffer DST, when reading running and it being stored in the buffer; The running value can convert 16 values of not having a sign to, gives an example, and preceding 2 buffers keep 16 16 levels (that is 16 coefficients of array stores the first stroke) value; The the 3rd and the 4th buffer then keeps 16 16 running values, if surpass 16 coefficients, it is decoded to memory; In an embodiment; Value is write according to following order: in first buffer, minimum effective 16 comprise LEVEL [0], a position 16-31 comprises LEVEL [1] or the like, 112-127 comprises LEVEL [7] up to the position by that analogy; Minimum effective 16 comprise LEVEL [8] in second buffer ..., the running value is also used same aligning method.

It is following to be used to remove another command format that operates array 624 and level array 624 buffers:

CLR_LRUN。

The software (coloration program) and the hardware operation (like module) of aforementioned decode system 200 (like CAVLC unit 530) can utilize following pseudo code to represent:

Residual_block_cavlc(coeffLevel，maxNumCoeff){
	CLR_LEVEL_RUN
coeff_token
	if(TotalCoeff(coeff_token)＞0){
if(TotalCoeff(coeff_token)＞10&&TrailingOnes(coeff_token)＜3)
	suffixLength＝1
Else
	suffixLength＝0
?CAVLC_level0()；
	for(I＝TrailingOnes(coeff_taken)；I＜TotalCoeff(coeff_token)；i++){
CAVLC_levelCode(levelCode，suffixLength)；
	CAVLC_level(suffixLength，i，levelCode)
?}
	?CAVLC_ZerosLeft(ZerosLeft，maxNumCoeff)
for(i＝0；i＜TotalCoeff(coeff_token)1；i++){
	CAVLC_run(i，ZerosLeft)
READ_LEVEL_RUN(level，run)
	run[TotalCoeff(coeff_token)1]＝zerosLeft
coeffNum＝□1
	for(i＝TotalCoeff(coeff_token)1；i＞＝0；i--){
coeffNum+＝run[I]+1
	coeffLevel[coeffNum]＝level[i]
}
	?}
}

[0223] What should stress is; The present invention lifted attendes the embodiment of institute or " preferable " embodiment is merely possible execution example; Only in order to clearly demonstrate principle of the present invention, even if the foregoing description is imposed variation and modifies right neither spirit and principle of taking off system and method described in this; All these modifications and variation should include in scope of the present invention, receive to protect as attaching the claim scope.

Claims

1. decode system, it comprises:

The software programmable core processing unit, it has the context adaptive variable length codes CAVLC unit that can carry out tinter, and this tinter comprises and expands the instruction group in order to the CAVLC decoding of implementing video flowing and decoded data output is provided;

Wherein this CAVLC decoding is to be accomplished jointly by the performance element data path of this tinter of this CAVLC unit, this software programmable core processing unit and the additional firmware that is used for the bit stream buffer of CAVLC processing environment,

Wherein this CAVLC unit also comprises

Coefficient mark (coeff_token) module, in order to receive macroblock information, first instruction (CAVLC_TOTC) in response to this tinter provides tail end 1 information and all coefficient information,

Level (CAVLC_Level) module, in order to receive this tail end 1 information and level sign indicating number information, second instruction (CAVLC_LVL) in response to this tinter provides suffix length information and level index (Level [Idx]) information,

Level sign indicating number (CAVLC_LevelCode) module, in order to receive this suffix length information, the 3rd instruction (CAVLC_LC) in response to this tinter provides this level sign indicating number information to this level module,

Level 0 (CAVLC_L0) module, in order to receive this tail end 1 information, the 4th instruction (CAVLC_LVL0) in response to this tinter provides the second level index (Level [Idx]) information to give the level array,

Zero level (CAVLC_ZL) module in order to receiving this all coefficient information and maximum number coefficient information, makes (CAVLC_ZL) in response to the five fingers of this tinter, provides left 0 information and replacement to be worth to first multiplexer and second multiplexer,

Running (CAVLC_Run) module in order to receive this left 0 information and the second level index information from this first multiplexer and this second multiplexer, in response to the 6th instruction (CAVLC_RUN) of this tinter, provides running index (Run [Idx]) information to give the running array,

Wherein this level array and this running array response provide decoding hierarchical value and decoding running value in the 7th instruction (READ_LRUN) of this tinter, and empty in response to the 8th instruction (CLR_LRUN) of this tinter.

2. system according to claim 1, wherein this level sign indicating number module receives this suffix length information from transmission buffer or performance element buffer.

3. system according to claim 1, wherein this level sign indicating number module receives this suffix length information and this level index information from transmission buffer or performance element buffer, and this level index information is through increment operation.

4. system according to claim 1, wherein this first multiplexer and this second multiplexer receive this left 0 information and this second level index information from the first transmission buffer and the second transmission buffer respectively.

5. system according to claim 1; Wherein this CAVLC unit uses the position in the instruction; Judge whether the previous operation result that is stored in inner buffer is available, or come the data in the source operand whether can supply one or more modules in present computing, to use.

6. system according to claim 1; Wherein this CAVLC unit also comprises direct memory access (DMA) (direct memory access, DMA) engine modules comprises this bit stream buffer and DMA engine in it; The instruction that this DMA engine modules is carried out to each fragment in response to this tinter; The position of the predetermined quantity in using this bit stream repeats to insert the position of this predetermined quantity automatically, and this position is corresponding to this video flowing.

7. system according to claim 6, wherein the possibility of this CAVLC cell response oriented underflow bit in this bit stream buffer postpones this DMA engine modules.

8. system according to claim 6; Wherein this DMA engine modules is used to write down the use bits number in this bit stream buffer; And, suspend this bit stream buffer computing, and control is transferred to primary processor in response to detecting this bits number greater than predetermined value.

9. coding/decoding method, it comprises step:

Tinter is loaded in the programmable core processing unit with CAVLC unit, and this tinter comprises expansion instruction group;

Carry out this tinter on this CAVLC unit, with the CAVLC decoded video streams; And

Decoded data output is provided;

Wherein this CAVLC decoding is to be accomplished jointly by the performance element data path of this tinter of this CAVLC unit, this programmable core processing unit and the additional firmware that is used for the bit stream buffer of CAVLC processing environment,

Wherein said method also comprises step:

Coefficient mark (coeff_token) module of this CAVLC unit receives macroblock information;

First instruction (CAVLC_TOTC) in response to this tinter provides tail end 1 information and all coefficient information;

The level of this CAVLC unit (CAVLC_Level) module receives this tail end 1 information and level sign indicating number information;

Second instruction (CAVLC_LVL) in response to this tinter provides suffix length information and level index (Level [Idx]) information;

Level sign indicating number (CAVLC_LevelCode) module of this CAVLC unit receives this suffix length information;

The 3rd instruction (CAVLC_LC) in response to this tinter provides this level sign indicating number information to this level module;

Level 0 (CAVLC_L0) module of this CAVLC unit receives this tail end 1 information;

The 4th instruction (CAVLC_LVL0) in response to this tinter provides the second level index information (Level [Idx]) to give the level array;

Zero level (CAVLC_ZL) module of this CAVLC module receives this all coefficient information and maximum number coefficient information;

The five fingers in response to this tinter make (CAVLC_ZL), provide left 0 information and replacement to be worth to first multiplexer and second multiplexer;

The running of this CAVLC unit (CAVLC_Run) module receives this left 0 information and the second level index information from this first multiplexer and this second multiplexer respectively; And

In response to the 6th instruction (CAVLC_RUN) of this tinter, provide running index (Run [Idx]) to give the running array;

10. method according to claim 9, wherein this level sign indicating number module receives this suffix length information and this level index information from transmission buffer or performance element buffer, and this level index information is through increment operation.

11. method according to claim 9, wherein this first multiplexer and this second multiplexer receive this left 0 information and this second level index information from the first transmission buffer and the second transmission buffer respectively.

12. method according to claim 9 also comprises step:

Whether this CAVLC unit uses the position in the instruction, judge whether the previous operation result that is stored in inner buffer is available, or come the data in the source operand can supply one or more modules in present computing, to use.

13. method according to claim 9 wherein also comprises step:

In response to the instruction that this tinter is carried out to each fragment, the position of the predetermined quantity in using this bit stream repeats to insert the position of this predetermined quantity automatically, and this position is corresponding to this video flowing.

14. method according to claim 13 also comprises step:

In response to the possibility of oriented underflow bit in this bit stream buffer, postpone to use this position in this bit stream buffer.

15. method according to claim 13 also comprises step:

Write down the use bits number in this bit stream buffer, and, suspend this bit stream buffer computing, and control is transferred to primary processor in response to detecting this bits number greater than predetermined value.