TW200803526A

TW200803526A - Decoding of context adaptive variable length codes in computational core of programmable graphics processing unit

Info

Publication number: TW200803526A
Application number: TW96120899A
Authority: TW
Inventors: Zahid Hussain; Duc Huy Bui
Original assignee: Via Tech Inc
Priority date: 2006-06-08
Filing date: 2007-06-08
Publication date: 2008-01-01
Also published as: TWI344795B; TWI348653B; TW200821982A; TW200809689A; TWI354239B; CN101072353A; TW200813884A; CN101072353B; CN101072350B; TWI428850B; CN101072349B; CN101072350A; CN101072349A; CN101087411A

Abstract

Various embodiments of decoding systems and methods are disclosed. One system embodiment, among others, comprises a software programmable core processing unit having a context-adaptive variable length coding (CAVLC) unit configured to execute a shader, the shader configured to implement CAVLC decoding of a video stream and provide a decoded data output.

Description

200803526 九、發明說明：【發明所屬之技術領域】本發明係有關於資料處理系統，尤指可程式之圖形處理糸統及方法。【先前技術】電腦繪圖乃是以電腦產生圖像、影像或其他圖形或圖像貪訊之一門藝術和科學，目前的繪圖系統多包含數個介面，例如微軟的Direct3D介面及0penGL等等，如此可於執行特定作業系統（如微軟的WINDOWS)的電腦上控制諸如圖形加速器或圖形處理單元(g—ics p職：ing (或^、於二維(3D)電腦麵中，構成場景中物件表面 (或物體）之幾何形狀經轉變為像素（圖碰i w 1200803526 IX. Description of the Invention: [Technical Field of the Invention] The present invention relates to a data processing system, and more particularly to a programmable graphics processing method and method. [Prior Art] Computer graphics are the art and science of computer-generated images, images or other graphics or images. The current drawing system contains several interfaces, such as Microsoft's Direct3D interface and 0penGL. It can be controlled on a computer running a specific operating system (such as Microsoft's WINDOWS) such as a graphics accelerator or graphics processing unit (g-ics p: ing (or ^, in a two-dimensional (3D) computer surface, forming the surface of the object in the scene (or object) geometry is transformed into pixels (Figure touch iw 1

ΤΤ:Ρυ)等的多雜硬體’圖像、影像之產生常被稱為、=繪成像(卿dering)」，此類操作的細節一般是由圖形加私’還要能產生更加真實的影像，現已發展出許多標準可以利用 200803526ΤΤ:Ρυ) and other multi-hardware 'images, images are often referred to as, = painting (deer), the details of such operations are generally added by the graphics 'to be more realistic Image, has developed many standards can be used 200803526

視頻品質的視訊，不夕二为之一的位元數，即可儲存同樣 Η·264 ‘準提供兩種熵解碼程刀另J疋内谷適應性一進位异術編碼响ve 職ry arithmetic c〇ding，CABAC)以及内容適應性可變長度編碼(context-adaptive variable length c〇ding，CAVLc)， CAVLC是一種霍夫曼(Huffman)編碼的内容適應性變化，根據編碼資料的總類會改變每一個編碼符號的機率， CAVLC使用運作一層級(run-level)編碼以簡潔表達零字串，使用這種方法發出一些高頻+A4係數並於相鄰圖塊的非零係數連結，CAVLC中，將適應性編碼位於或低於切片 (slice)層的4 X 4轉換的DC係數之第二Hadamard轉換），目鈾CAVLC解碼結構可滿足消費者的部分需求，但是在設計上仍有其限制。【發明内容】本發明揭露一種内容適應性可變長度編碼 (context-adaptive variable length coding，CABAC)之解碼系統及方法（之後簡稱為解碼系統），運用於圖形處理單元 (graphics processing unit，GPU)内之多執行緒(nmltithread) 平行計算核心，簡單地說，於一實施例中，本系統包含一 200803526 軟體可程式核心處理單元，其内具有一 CAVLC單元以執行一著色器(shader)，該著色器可以執行視訊流的CAVLC 解碼，並提供一解碼資料輸出。方法實施例則包括下列步驟：將著色器载入具有 CAVLC單元之可程式核心處理單元，CAVLC執行該著色益以CAVLC解碼一視訊流，並提供一解碼資料輪出。熟習本技藝人士於檢視以下圖式及詳細說明之後，當可推演出其他系統、方法、特徵及優點，所有此等推演的糸統、方法、特徵及優點均屬本發明之範圍，受到如附申睛專利範圍之保護。【實施方式】本發明揭示了多種内容適應性可變長度編碼 (context-adaptive variable length coding，CAVLC)之解碼系統及方法（之後將通稱為解碼系統），於一實施例中，解碼糸統係内嵌於圖形處理單元(graphics processing unit，GPU) 的可程式、多執行緒、平行計算核心之一個或多個執行單元中，利用軟體結合硬體之方式來達成解碼功能，亦即視訊％碼疋以圖形處理单元編程（pr〇gramming)的内容 (context)配合施行於圖形處理單元資料路徑内的硬體所完成，舉個例子，解碼運算或方法係由具有擴充指令集 (extended instruction set)的著色器（shader，如頂點著色Video quality video, the number of bits in one of the two is one, you can store the same Η · 264 'Quasi-provide two entropy decoding knives another J 疋内谷 adaptability one carry 异编码 code ve ry arithmetic c 〇ding, CABAC) and context-adaptive variable length c〇ding (CAVLc), CAVLC is a content adaptive change of Huffman coding, which changes according to the general class of coded data. The probability of each coded symbol, CAVLC uses a run-level code to succinctly express the zero string, using this method to emit some high frequency + A4 coefficients and join the non-zero coefficients of adjacent blocks in CAVLC. The adaptively encodes the second Hadamard conversion of the 4×4 converted DC coefficient at or below the slice layer. The Uranium CAVLC decoding structure can meet some of the consumer's needs, but there are still limitations in design. . SUMMARY OF THE INVENTION The present invention discloses a content adaptive variable length coding (CABAC) decoding system and method (hereinafter simply referred to as a decoding system), which is applied to a graphics processing unit (GPU). In the embodiment, the system includes a 200803526 software programmable core processing unit having a CAVLC unit to execute a shader. The shader can perform CAVLC decoding of the video stream and provide a decoded data output. The method embodiment then includes the steps of loading the shader into a programmable core processing unit having a CAVLC unit, the CAVLC performing the rendering to decode a video stream in CAVLC, and providing a decoded data round. Those skilled in the art will be able to devise other systems, methods, features, and advantages after the following figures and detailed descriptions. All such systems, methods, features and advantages are within the scope of the present invention. The protection of the scope of the patent. [Embodiment] The present invention discloses a plurality of content-adaptive variable length coding (CAVLC) decoding systems and methods (hereinafter generally referred to as a decoding system). In an embodiment, the decoding system is Embedded in a graphics processing unit (GPU) of one or more execution units of a programmable, multi-threaded, parallel computing core, using software to combine hardware to achieve decoding, that is, video % code The content of the pr〇gramming is done in conjunction with the hardware implemented in the data path of the graphics processing unit. For example, the decoding operation or method has an extended instruction set. Color shader (shader coloring)

器）、圖形處理單元的執行單元資料路徑、以及用於CAVLC 200803526 處理環境中的自動管理位元流緩衝器之附加硬體所共同完成，不像已知的舊有系統，僅具有單純硬體或單純軟體之 CAVLC處理方法，限制了實施彈性，舉個例子，純數位訊號處理器(digital signal processor，DSP)或微處理器基礎實施方式便沒有用於符號解碼及位元流管理之硬體。 - 另外，自動位元流緩衝器具備一些優點，例如，一旦 • 位元流缓衝器的直接記憶體存取(direct memory aeeess， DMA)引擎得知位元流的位置（位址），便會自動管理位元 • 流而不需要進一步的指令，這樣的機制就跟傳統的微處理器系統不同’位元流管理不再代表了大量的間接費用，再則，透過記錄已使用的位元數量，位元流緩衝器機制可以摘測和處理錯誤的位元流。本案解碼系統的另一個優點是可以減少指令延遲 ⑽哪y)，因為CAVLC解碼是非常連續的動作，不易利用多執灯緒’因此在各種實施例中就會使用一轉遞機制來 φ 減少等待延遲’例如暫存器轉遞(registerforwarding)，進— ‘ 步解釋，便是深管(deeP-pipeline)及多執行緒處理器無法以 ‘ 肖―執行緒在每—週期執行指令，有些系統利用-般轉遞 … （general f0rwarding)，是藉由檢查前次產生的運曾元址以及指令運算元位址(如果相㈤，則使:前 -人產生運异兀），此種一般轉遞需要複雜的比較及多工動作在某些解碼系統實施例中，會使用不同的轉遞方式，不管是利用前次計算結果（如保留在内部暫存器）還是來源運算元的資料，均利用指令中的位元（例如總共2位元， 8 200803526 每一運算元使用i位元）來編碼，藉由這種方式，可以減少整體的延遲，改善處理器管線的效率。這裡描述的解碼系統可以利用已知的國際電信聯盟通説標準部 H (International Telecommunication Union Telecommunication Standardization Sector，ITU-T) H.264 標準，根據執行從圖形處理單元晝面緩衝器記憶體或主處理器（如中央處理單元(central processing unit，CPU))記憶體所接收到的一個或多個指令組（如藉由預載入(prel〇ad) 等已知機制或是快取失敗等），多種解碼系統實施例即可進行運算。第一圖係圖形處理器系統100實施例之方塊圖，其中介紹了解碼系統及方法，於某些實施方式中，圖形處理器系統100可為電腦系統，其中，圖形處理器系統100可包含由顯示介面單元(display interface unit，DIU)104驅動的顯示裝置102以及區域記憶體l〇6 (可包含顯示緩衝器、畫面缓衝器、紋理緩衝器、命令缓衝器等等），區域記憶體 106可以晝面緩衝器或儲存單元取代，區域記憶體1〇6透過〆個或多個記憶體介面單元(memory interface unit， MIU)ll〇連接至圖形處理單元(graphics processing unit， GPU)114，於一實施例中，記憶體介面單元110、圖形處理單元114、顯示介面單元104三者連接至高速週邊組件互連(peripheral component interconnect express，PCI-E)相容之匯流排介面單元(bus interface unit，BIU)118，於一實施例中，匯流排介面單元118可以使用圖形位址重繪表(graphics 200803526 address remapping table，GART) ’當然也可使用其他記憶體繪圖機制，圖形處理單元114包含解碼系統2〇〇，稍後會針對此部分作進一步的說明，雖然於某些實施例中將圖形處理單元元114内的解碼系統2〇〇晝成一個元件，但是解碼系統200其貫可以包含更多圖形處理器系統1〇〇的繪示或未繪示元件。陛井？广囬早, the execution unit data path of the graphics processing unit, and the additional hardware used for the automatic management of the bit stream buffer in the CAVLC 200803526 processing environment, unlike the known legacy systems, with pure hardware Or the software-only CAVLC processing method limits the implementation flexibility. For example, a pure digital signal processor (DSP) or microprocessor based implementation does not have hardware for symbol decoding and bit stream management. . - In addition, the automatic bit stream buffer has some advantages, for example, once the bit memory buffer's direct memory aeeess (DMA) engine knows the location (address) of the bit stream, Will automatically manage the bit stream • stream without further instructions, this mechanism is different from the traditional microprocessor system. 'Bit stream management no longer represents a lot of overhead, and then, by recording the used bits The quantity, bit stream buffer mechanism can extract and process the wrong bit stream. Another advantage of the decoding system of the present invention is that it can reduce the instruction delay (10) which y), because CAVLC decoding is a very continuous action, and it is not easy to utilize multiple lights. Therefore, in various embodiments, a transfer mechanism is used to reduce the wait. Delays such as registerforwarding, in-step interpretation, deeper (deeP-pipeline) and multi-threaded processors can't execute instructions in every cycle of 'xiao-execution, some systems make use of General transfer (general f0rwarding), by checking the previous generation of the Yun Zengyuan address and the instruction operation meta-address (if phase (five), then: pre-human generation of transport), this general transfer Complex comparisons and multiplexed actions are required. In some decoding system embodiments, different transfer methods are used, whether using the previous calculation results (such as retained in the internal register) or the source operand data. The bits in the instruction (for example, a total of 2 bits, 8 200803526 each element using i bits) are encoded, in this way, the overall delay can be reduced, and the efficiency of the processor pipeline can be improved. . The decoding system described herein can utilize the known International Telecommunication Union Telecommunication Standardization Sector (ITU-T) H.264 standard, according to the execution of the slave graphics processing unit buffer memory or the host processor. (such as a central processing unit (CPU)) one or more groups of instructions received by the memory (such as known mechanisms such as preloading (prel〇ad) or cache failures, etc.) The decoding system embodiment can perform the operation. The first diagram is a block diagram of an embodiment of a graphics processor system 100 in which a decoding system and method are described. In some embodiments, the graphics processor system 100 can be a computer system, wherein the graphics processor system 100 can include Display interface unit (DIU) 104 driven display device 102 and area memory 16 (which may include display buffer, picture buffer, texture buffer, command buffer, etc.), area memory 106 may be replaced by a face buffer or a storage unit, and the area memory 1〇6 is connected to a graphics processing unit (GPU) 114 through one or more memory interface units (MIUs). In one embodiment, the memory interface unit 110, the graphics processing unit 114, and the display interface unit 104 are connected to a high-speed peripheral component interconnect express (PCI-E) compatible bus interface unit (bus interface). Unit, BIU) 118. In an embodiment, the bus interface unit 118 can use a graphical address redraw table (graphics 200803526 address rem Mapping table, GART) 'Of course other memory drawing mechanisms can also be used. Graphics processing unit 114 includes a decoding system 2, which will be further described later, although in some embodiments graphics processing elements are The decoding system 2 within 114 is grouped into one component, but the decoding system 200 may include more or less of the components of the graphics processor system. Sakai? Wide back

組）或開關，晶片組122包含介面電路（如时⑽ ―加―’以增強從中央處理單元 umt ’ CPU)126 (又稱主處理器）接收到的訊號，並分離狭 f統記憶體124進出的信號與從輸出入(I/O)裝置進出的詔號，象然這裡提到Ρα_Ε匯流排協定，不過也可使用其他 1連j及/或通財式㈣通主處理g與卿處理單元人魅專用向速匯流排等），系統記憶體124還包 ::=128’可利用中央處理單二令傳达給_處拜元114_暫存器。 h 种可再另外配㈣形處理單元，利用第圖一以包含第—圖㈣實施例t 處理單元1 〇〇可些元件，_，f亦可剔除、新增或改變某組。曰加逹接至晶片組122的南橋晶片用请參閱第二圖，一解碼系統200，其為例示處理環境之方塊圖，其中應圖形處理單兀114包含_圖形處理器 200803526 202 ’圖形處理器202則包含多個執行單元(execution uit， EU)和計算核心204 (即軟體可程式核心處理單元），於一貫施例中’計异核心204包含内嵌於執行單元資料路徑Group) or switch, chipset 122 includes interface circuitry (eg, (10) - plus - to enhance the signal received from central processing unit umt 'CPU) 126 (also known as the main processor), and separate the narrow memory 124 Incoming and outgoing signals and nicknames entering and exiting from the input/output (I/O) device, as mentioned here, Ρα_Ε busbar agreement, but other 1s and/or tongs (4) can be used to process g and qing processing. The unit memory is dedicated to the speed stream, etc.), and the system memory 124 is also packaged with::=128', which can be communicated to the _Fee 114_ scratchpad by the central processing unit. h can be additionally equipped with a (four)-shaped processing unit, which can be removed, added or changed by using the first embodiment to include the first-fourth embodiment of the processing unit 1 . Referring to the second diagram, a decoding system 200 is illustrated as a block diagram illustrating a processing environment in which the graphics processing unit 114 includes a graphics processor 200803526 202 'graphic processor. 202 includes a plurality of execution units (implementation uit, EU) and a computing core 204 (ie, a software programmable core processing unit). In the consistent example, the 'differentiation core 204 includes a data path embedded in the execution unit.

(execution unit data path，EUDP)的解碼系統 2〇〇(即 CAVLC 單元），該資料路徑分配至一個或多個執行單元，圖形處理 ‘ 器202還包含執行單元集合控制及頂點/串流快取記憶體 • 單元206 (以後稱為EU集合控制單元206)以及具有固定 ⑩功能邏輯⑷如士含三角形設定單元㈨⑽咏妙叩皿心 TSU)、柵格-圖塊產生器（Span4ile generat〇r，STG)等）的繪圖管線208，計算核心204包含聯合的多個執行單元，以符合不同著色器程式的著色器任務之計算要求，所述著色器程式可包含頂點著色器、幾何著色器、及//或像素著色為，使繪圖管線208能處理資料，計算核心204的著色為月b進行解碼系統2〇〇的大部分功能，下面將詳細說明圖形處理202的實施例，接著說明解碼系統2〇〇的細節。 _ 解碼系統可以硬體、軟體、韌體或其組合等方式實施，、於較佳貫施例中，解碼系統2〇〇可包含硬體或軟體，利用 y列已知技術或其組合，例如··具有邏輯閘而可對資料信 1進行遴輯功能的離散邏輯電路、具有適當組合邏輯閘的 w用木成氧路（applicati〇n 匕吨加以circuit， )ΊΓ 粒式化閘極陣列（pr〇grammabie gate array， )琢式了私式化閘極陣列（field programmable gate airay，FPGA)等等元件。、 w翏考第二圖及第四圖，其為圖形處理器搬實施例 200803526 選擇7〇件之方塊®，如制述，解κ统· {以是圖形處理器202内的著色器，另外加上擴充指令组及其他硬體元件，以下將朗_處理器搬及對應程序之實施例，雖然第三®與第四圖並未繪出圖形處理所用_全部元件’但是已足以令熟悉此技藝者崎相關圖形處理器的功能及架構。請參閱第三圖，可程式處理環賴巾心為計算核^ 204，其包含解瑪系統2〇〇，並可處理各種指令，計算核心204可以執行或映射多種著色器程式，如頂點、幾何、像素著色諫式等’多執行緒處理H的計算核心、2〇4可以在單一時脈週期内處理多個指令。於第三圖中，圖形處理器2〇2的相關元件包含計算核心204、紋理過濾單元3〇2、像素打包元件3〇4、命令流處理态306、舄回單元308、以及紋理位址產生器31〇，第三圖中的EU集合控制單元206也包含頂點快取記憶體及/ 或串流快取記憶體，另外，苐三圖的紋理過濾單元3〇2提供紋素(texel)資料給計算核心204 (輸入a及Β)，於某些實施例中，紋素資料為512位元資料。像素打包元件304提供像素器著色輸入（PS輸入，輸入C和D)給計算核心204，輸入同樣是512位元資料格式，另外，像素打包元件304向EU集合控制單元2〇6請求像素著色器任務，而ElJ集合控制單元206便會提供指定執行單元號碼(EU#)及執行緒號碼(執行緒#)給像素打包元件304，因為像素打包元件304及紋理過濾單元3〇2是已知的技術，這裡便不再贅述，雖然第三圖顯示像素及紋 12 200803526 素封包為512位元之資料封包，但是可依各實施例根據圖形處理器202所需的效能改變其大小。命令流處理器306提供三角形頂點索引給EU集合控(execution unit data path, EUDP) decoding system 2 (ie, CAVLC unit), the data path is allocated to one or more execution units, and the graphics processing unit 202 also includes execution unit set control and vertex/streaming cache Memory • Unit 206 (hereafter referred to as EU Set Control Unit 206) and with fixed 10 function logic (4) such as the singular triangle setting unit (9) (10) 咏叩叩 T 、、, grid-tile generator (Span4ile generat〇r, STG), etc., drawing pipeline 208, computing core 204 includes a plurality of joint execution units to meet the computational requirements of the colorizer tasks of different shader programs, which may include vertex shaders, geometry shaders, and / / or pixel coloring, so that the drawing pipeline 208 can process the data, the calculation of the color of the core 204 for the month b for most of the functions of the decoding system 2, the embodiment of the graphics processing 202 will be described in detail below, followed by the decoding system 2 〇〇 details. The decoding system may be implemented in the form of hardware, software, firmware, or a combination thereof. In a preferred embodiment, the decoding system 2 may include hardware or software, using known techniques of the y column or a combination thereof, for example. · Discrete logic circuit with logic gate for the function of data letter 1 , wood with oxygen circuit with proper combination of logic gates (applicati〇n 加以加以 circuit, ) 粒 grained gate array ( Pr〇grammabie gate array, ) is a type of field programmable gate air array (FPGA) and other components. The second diagram and the fourth diagram are selected for the graphics processor to move the embodiment 200803526 to select the box of the 7-piece, such as the description, the solution, the colorizer in the graphics processor 202, and With the addition of the instruction set and other hardware components, the following will implement the embodiment of the processor and the corresponding program, although the third and fourth figures do not describe the _all components used in graphics processing, but it is enough to familiarize themselves with this. The function and architecture of the artist-related graphics processor. Referring to the third figure, the programmable processing core is a computing core 204, which includes a gamma system, and can process various instructions. The computing core 204 can execute or map multiple colorizer programs, such as vertices and geometries. , pixel coloring, etc. 'Multi-thread processing H's computational core, 2〇4 can process multiple instructions in a single clock cycle. In the third figure, the relevant elements of the graphics processor 2〇2 include a computation core 204, a texture filtering unit 3〇2, a pixel packing component 3〇4, a command stream processing state 306, a bypass unit 308, and a texture address generation. The EU set control unit 206 in the third figure also includes a vertex cache memory and/or a stream cache memory. In addition, the texture filter unit 3〇2 of the third graph provides texel data. To computing core 204 (input a and Β), in some embodiments, the texel data is 512-bit data. Pixel packing component 304 provides pixel shader input (PS input, inputs C and D) to computing core 204, the input is also in the 512-bit data format, and in addition, pixel packing component 304 requests pixel shader from EU collective control unit 2〇6 The task, and the ElJ collection control unit 206 will provide the specified execution unit number (EU#) and the thread number (thread #) to the pixel packing component 304, since the pixel packing component 304 and the texture filtering unit 3〇2 are known. The technique is not described here. Although the third figure shows that the pixel and texture 12 200803526 is a 512-bit data packet, it can be changed according to the performance required by the graphics processor 202 according to various embodiments. Command stream processor 306 provides a triangle vertex index to the EU collective control

制單元206，於第三圖的實施例中，索引為256位元之資料，EU集合控制單元206組合從串流快取記憶體接收到的頂點著色器輸入，並將這些資料送至計算核心204 (輸入 E) ; EU集合控制單元206亦組合幾何著色器輸入，並將這些資料送至計算核心204 (輸入F) ; EU集合控制單元 206另外控制執行單元輸入(EU輸入)4〇2及執行單元輸出 (EU輸出)404 (第四圖），換句話說，EU集合控制單元2〇6 控制計异核心204的各輸入流與輸出流。經過處理之後，計算核心204提供像素輸出:輪出J1與J2)給寫回單元3〇8，像素著色器輸出包括色衫貧訊，例如紅/綠/藍，透明度(rgba)資訊，關於實施=中的資料結構，像素著色器輸出可以是兩條沿位疋之貢料流’其他實_亦可使用其他的位元寬度。除c了 =著色器輪出’計算核心2〇4亦會輪出紋理座 2 (TC ’輸出K1及Κ2)給紋理位址括UVRQ資訊，紋理位址 ^其中包 to ^ ^ 口口向计异核心204的 X)，然後計算核心2〇4紅2^ ^求（爾未，輸入 >务、+、斤咕一夬取έ己憶體408會輪屮对採 “述付遽貧料（丁#資料，輪出曰輪出紋理因為紋理仙：產生$ 3 、、_紐產生器310， 0— 时310及寫回早元308是已知的姑t 因此廷裡不再贅述，制- 的技術，The unit 206, in the embodiment of the third figure, indexes the data of 256 bits, and the EU set control unit 206 combines the vertex shader inputs received from the stream cache and sends the data to the computing core. 204 (input E); the EU set control unit 206 also combines the geometry shader inputs and sends the data to the compute core 204 (input F); the EU set control unit 206 additionally controls the execution unit inputs (EU inputs) 4〇2 and The execution unit output (EU output) 404 (fourth diagram), in other words, the EU collective control unit 2〇6 controls the respective input streams and output streams of the different core 204. After processing, the computational core 204 provides pixel output: round out J1 and J2) to the writeback unit 3〇8, and the pixel shader output includes color shade, such as red/green/blue, transparency (rgba) information, regarding implementation = in the data structure, the pixel shader output can be two tributary streams along the edge 'other real _ can also use other bit width. In addition to c = shader round out 'calculation core 2 〇 4 will also turn out texture seat 2 (TC 'output K1 and Κ 2) to the texture address including UVRQ information, texture address ^ where the package to ^ ^ mouth X) of the different core 204, and then calculate the core 2〇4 red 2 ^ ^ seeking (Ir, input > service, +, 咕咕夬 έ έ έ 408 408 408 408 408 屮屮屮屮采采采采采采采述述(Ding #资料, turn out and turn out the texture because of the texture fairy: generate $ 3,, _ New generator 310, 0 - when 310 and write back to early 308 is known, so the Tingri no longer repeats - Technology,

…、旦中頒示URVQ及rgb A 13 200803526 是512位元之資料，但是此參數亦可隨不同實施例而做變化’於第三圖的實施例中，匯流排分成兩條512位元通道，同時傳輸4個像素的128位元RGBA色彩值及128位元 UVRQ紋理座標。繪圖管線208包含固定功能之圖形處理功能，例如，因應從驅動軟體發出之繪製一三角形的命令，頂點資訊通過計算核心204内的頂點著色器邏輯元件以進行頂點轉換，物件將從物件空間種換成工作空間及螢幕空間的三角形，三角形通過計算核心204到達繪圖管線2〇8的三角形設定單元，結合圖元後進行已知的任務，例如產生邊界盒 (bounding box)、揀選(culling)、產生邊緣功能(以明 functi: generation)及三角形層級剔除(triangle level等接著三角形設定單元再將:紐傳遞至_管線中具有圖塊產生功能的栅格及圖塊產生單元，因此，龍物件被分割成圖塊(例如8x8、16χ16 #)，並且傳遞至其他的固定功能單元，進行深度（z-值）處理，例如z省之高階（同樣的程序在高階時使用的位元數比低階少）剔除，然後將& 值傳回計算核心2G4的像素著色器邏輯轉，以根據所得 ^里及管線資料進行像素著色器功能，計算核心、撕將已处理之值輪出至位於_管線細内之目標單元，目標單 =在各快取記憶體將更新内部值之前進細則試及模板測請，意計算核心204的L2快取記憶體4〇8以及孤集合控制單元206之間有512你士位tg的頂點快取記憶體溢出 14 200803526 的傳輸(輪入G)，另外，計算核心2 個512位元頂點快取記憶體㈣寫人 =兩給即集合控制單元挪做進—步的處理。及奶） ^參閱第四圖’其顯示計算核心綱的其他元關元件，計鼻核心204包含具有一個或多個執行單= 420a〜420h(以後通稱執行單元42〇)的執行單元隼人每-個執行單元420可以在—個時__處理口 _ ^因此，執行單元集合412在尖峰時可以同時或幾^ 日守處理多個執打緒，儘管第四圖僅繪出8個執行單元 (EU0〜EU7) ’但是並不麵_減轉8，於其他實施例可以增加或減少數量，其巾至少—錄行單元（例如咖 420a)具有-解碼系統2〇〇，詳細說明如下。計算核心2G4亦包含記憶體存取單元（m⑽ry access umt ’刪)· ’記憶體存取單元概藉由記憶體介面仲裁器指與L2絲記憶體.連接，u快取記憶體爾從EU集合控制單元2〇6接收頂點快取記憶體溢出資料（輪入G) ’並提供頂點快取記憶體溢出資料（輸丨η)給即集合，制單元206，另外，L2快取記憶體柳從紋理位址產生器310接收紋理描述符號請求（τ#請求，輸人X)，並因應接收_該請求，提供紋理描述符號資料（Τ續料，輪出W)給紋理位址產生器31〇。記憶體介面仲裁器410提供了區域視訊記憶體（如晝面緩衝器或區域記憶體106)的控制介面，匯流排介面單兀Η8則提供了系統的介面，其可為pci_E匯流排，記憶 15 200803526 體介面仲裁11410和匯流排介面單元118做為記憶體及u 快取讀體408之間的介面，於某些實施例中，L2快取記憶體彻藉由記憶體存取單元鄕與記憶體介面仲裁器 410以及賤排介面單元118連接，記憶體存取單元^ 會把從L2快取記憶€ 408及其他區塊得到的虛擬記憶體 ^ 位址轉換成實際記憶體位址。 * 記紐介面賴11 提供L2快取記鐘的記情體 # 存取（如讀，寫存取），可提取指令/常數/資料入紋理、絲記憶體存取（如載入/儲存）、索引暫存存取、暫存器温出、頂點快取記體内容溢出等等。計算核心204還包含執行單元輸入⑽輪入）4〇2和執行單元輸出⑽輸出）4〇4,分別用於提供執行單元集合412的輸入以及接收執行單元集合412的輸出，執行單元輸入術和執行單元輸出姻可以是交換開關一邮或匯流排，或是其他已知的輸出機制。 • ^執行單元輸入402從EU集合控制單元2〇6接收頂點、著色器輸入（輸入E)以及幾何著色器輸入（輸入，然後將貧訊提供給執行單元集合仍，讓各執行單元去 > 處王里；另外，執行單元輸入4〇2接收像素著色器輸入（輸入C及D)及紋素封包（輸入A及幻，並將這些封包傳送至執行單元集合化，齡執行單元·錢理；再者，執行單元輸入402從L2快取記憶體4〇8接收資訊（L2讀取），然後在必要時將這些資訊提供給執行單元集合412。第四圖實施例的執行單元輸出404會分成偶輸出4〇4a 16 200803526 和可輸出404b’執行單元輸出4〇4和執行單元輸入一樣可為交換關或匯流排，歧其他已知的架構，執行單兀偶輸出404a處理偶執行單元42〇a、42〇c、42加、42如的輸出，而執行單元奇輸出4_處理奇執行單元物卜 420d 42Gf 4201ι的輪出，總而言之，兩個執行單元輸出 404a和404b共同接收執行單元集合412的輪出，如wRQ 及RGBA資料’這些輪出可傳回L2快取記憶體4〇8，或是從計算核心204經由η及12輸出至寫回單元細，或是經由Κ1及Κ2輸出至紋理位址產生器。..., URVQ and rgb A 13 200803526 are 512-bit data, but this parameter can also be changed with different embodiments. In the embodiment of the third figure, the bus is divided into two 512-bit channels. Simultaneously transmit 128-bit RGBA color values of 4 pixels and 128-bit UVRQ texture coordinates. The drawing pipeline 208 includes a fixed function graphics processing function, for example, a command to draw a triangle from the driver software, the vertex information is converted to the vertex by the vertex shader logic in the core 204, and the object will be swapped from the object space. The triangles of the workspace and the screen space, the triangles reach the triangle setting unit of the drawing pipeline 2〇8 through the calculation core 204, and perform known tasks in combination with the primitives, such as generating a bounding box, culling, generating The edge function (in the case of explicit functi: generation) and the triangle level culling (triangle level and then the triangle setting unit and then: the new: to the _ pipeline with the block generation function of the grid and the tile generation unit, therefore, the dragon object is segmented Plots (eg 8x8, 16χ16 #) and pass them to other fixed function units for depth (z-value) processing, such as high-order z-order (the same program uses fewer bits in higher order than lower order) ) cull, then pass the & value back to the pixel shader logic of the computation core 2G4, based on the resulting Line data for pixel shader function, calculate the core, tear the processed value out to the target unit located in the _ pipeline fine, target list = before the cache memory will update the internal value before the detailed test and template test The L2 cache memory 4〇8 of the core computing core 204 and the orphan set control unit 206 have a transmission of 512 vertices of the vertex cache memory of the trajectory of the trajectory of the singularity of the singularity of the singularity of the singularity of the sequel 512-bit vertices cache memory (4) Write people = two to give the collection control unit to move forward - step processing. And milk) ^ See the fourth picture 'which shows the other core components of the computing core, count the nose core 204 includes an execution unit having one or more execution orders = 420a to 420h (hereinafter referred to as execution unit 42A). Each execution unit 420 may be at - time __ processing port _ ^ Therefore, execution unit set 412 is at a peak At the same time, you can handle multiple executions at the same time or several days, although the fourth diagram only draws 8 execution units (EU0~EU7) 'but not face_decrease 8, which can increase or decrease the number in other embodiments. , the towel at least - the recording unit ( The coffee 420a) having - 2〇〇 decoding system, described in detail below. Computational core 2G4 also contains memory access unit (m(10)ry access umt 'deleted) · 'memory access unit is connected by memory interface arbitrator and L2 wire memory. u cache memory is collected from EU The control unit 2〇6 receives the vertex cache memory overflow data (round G) and provides the vertex cache memory overflow data (input η) to the set, the unit 206, and the L2 cache memory The texture address generator 310 receives the texture description symbol request (τ# request, input X), and provides a texture description symbol data (repeated, rounded out W) to the texture address generator 31 in response to receiving the request. . The memory interface arbiter 410 provides a control interface for the area video memory (such as the face buffer or the area memory 106), and the bus interface unit 8 provides a system interface, which can be a pci_E bus, memory 15 200803526 The physical interface arbitration 11410 and the bus interface unit 118 serve as an interface between the memory and the u cache reader 408. In some embodiments, the L2 cache memory is accessed by the memory access unit. The physical interface arbiter 410 and the buffer interface unit 118 are connected, and the memory access unit ^ converts the virtual memory address obtained from the L2 cache memory and other blocks into an actual memory address. * The New Interface 赖11 provides the L2 cache clock ticker #access (such as read, write access), extractable instructions / constant / data into texture, silk memory access (such as load / store) , index temporary access, scratchpad temperature, vertex cache, content overflow, and so on. The computing core 204 also includes an execution unit input (10) round-in) 4〇2 and an execution unit output (10) output) 4〇4 for providing an input of the execution unit set 412 and an output of the receive execution unit set 412, respectively, performing unit input operations and The execution unit output can be a switch or a bus or other known output mechanism. • Execution unit input 402 receives vertex, shader input (input E), and geometry shader input (input from EU set control unit 2〇6), then provides the poor message to the execution unit set, leaving each execution unit to go > In addition, the execution unit inputs 4〇2 to receive pixel shader inputs (input C and D) and texel packets (input A and magic, and transfer these packets to the execution unit aggregation, age execution unit·Qianli Furthermore, the execution unit input 402 receives information (L2 read) from the L2 cache memory 4〇8 and then provides the information to the execution unit set 412 as necessary. The execution unit output 404 of the fourth embodiment Divided into even outputs 4〇4a 16 200803526 and output 404b' execution unit outputs 4〇4 and execution unit inputs may be switched off or busbars, other known architectures, performing single-turn output 404a processing even execution units 42 〇a, 42〇c, 42 plus, 42 such as the output, and the execution unit odd output 4_ processing odd execution unit object 420d 42Gf 4201ι round, in summary, the two execution unit outputs 404a and 404b The rounds of the receive execution unit set 412, such as wRQ and RGBA data 'the rounds can be passed back to the L2 cache memory 4〇8, or from the compute core 204 via the η and 12 outputs to the writeback unit, or via Κ1 and Κ2 are output to the texture address generator.

灯早凡：^ 412的執行單元流通常包含數個層級，如描緣内容層級、執行緒或任務層級、指令或執行層級， f任一時間點，每—執行單元42G可能准許兩個描緣内容，其中_-位元旗標或其他機制識別其描_容屬於這個内容的任務開始之前，從Eu集合控制單元2〇6 輸出内容資訊，内容層級資訊可為著色器種類、輸入出暫存讀量、指令起始位址、輸出對映表、頂點識別符: 各常數缓__常數，執行單元集合412内的每一好兀42G可以同時儲存多個任務或執行緒（例如％ ^)，於—實施例中’每—執行緒根據程式計數器提取— 才曰令。木5控制早70 206類似總任務排程，利用資料驅動 _如en)方法（如輸入訊號内的頂點、像素、幾何封才曰派執行單元420内的適當執行緒，舉例來說，即隼制早凡206指派-個執行緒給執行單元集合412的執行^ 200803526 元420内的-個空執行緒位置，當一執行緒已開始執行，頂點快取記憶體或其他元件或模組（根據著色器種類輸入的資料會放置在共用暫存緩衝器中。通常圖形處理器202使用可程式頂點、幾何、缓衝器，不再把這些元件當成具有不同設計及指令組的夂別固定功能單元而各別執行或操作這些元件，而是的執行單元42Ga、懈.幾配合統—指令組執 III執行單元伽（這個執行單元包含解瑪系統·，一·功能）之外’每一個用於程式運算的執行 η之设计與結構均相同，於一實施例中，每-個執二:r420可以進打多執行緒運算，當頂點著色器、幾何 ^壬務將送至個別的執行單元去執行，於 420=系！^ 2〇0可使用一頂點著色器，與其他執行單元這是執行單元42Ga使用—解狀统2〇0，統::广)所沒有的，因為解係藉由接線二：應:内部緩衝器’解碼系統元鄕取得資料。丸仃早几輪入402自記憶體存取單 ¥生成了個別的^壬蘇些任務給不同執行單元4:的二::指派這 EU集合控制單元206再管理相_"、，田凡成任務，而言，即集合控制單元^ =的=’就，—點者色"的任務給執行單元㈣的執行緒，然後 18 200803526 圯錄相關的任務及執行緒，具體來說，EU集合控制單一 2〇6冒有所有執行單元·的執行緒及記憶體的資源卜裡不多做說明），EU集合控制單元篇會知道哪^ =給::個任務使用、知道哪-個執行緒的任務結: 要釋放、知㈣好㈣制健㈣案記憶體暫存哭因此，如果已將一個任務指派給一個傷，EU集合控制單元2〇6會將這個執 2 = 中’然後將全部的共用暫存器槽案記去碌著何著色器及像«色㈣«1，=頂點口二固2讀段可以有不同的機體大小，例如里: 益執订、柯以要求1G個共用暫存器财 j者色色器執行緒可以僅要求5個暫存器。时，而像素著當-執行緒完成其被指派^作行單7G 420便會發出一訊號仏 Μ九行绪的執集合控制單元挪便會更^Γ==2〇6，Ευ 行绪共用暫存器楼案空_數^=未使間，虽所有的執行绪都處於忙綠中 1加回可用空案記憶體都已分配完（或是保 ^用暫存器槽容納额外的執行緒），β]ψ 存°。工間太小，無法每，單元早元器合控制單元挪不會再=新订:元420算是已滿，肪集每—個執衧策新的執行緒給該執行可 19 200803526 以管理或標註每一個執行緒是在使用中（或執行中）或是可用的，就這一點而言，於一實施例中，當頂點著色哭正執行解碼系統200的功能時，Εϋ集合控制單元2〇6可以防止幾何著色器與像素著色器在此同時運行。The light is as follows: ^ 412 execution unit stream usually contains several levels, such as the content level, thread or task level, instruction or execution level, f any time point, each - execution unit 42G may allow two strokes Content, wherein the _-bit flag or other mechanism identifies the content of the content before the task belongs to the output, and the content information is output from the Eu collection control unit 2〇6, and the content level information can be the type of the shader, and the input is temporarily stored. Read quantity, instruction start address, output mapping table, vertex identifier: each constant __ constant, each good 42G in execution unit set 412 can store multiple tasks or threads simultaneously (for example, % ^) In the embodiment, the 'per-execution is extracted according to the program counter'. Wood 5 controls early 70 206 similar to the total task schedule, using data driven _ such as en) method (such as the vertices, pixels, geometric seals within the input signal to send the appropriate thread within the execution unit 420, for example, ie System 206 assigns - a thread to the execution of the execution unit set 412 ^ 200803526 - empty thread position in element 420, when a thread has started execution, vertex cache memory or other components or modules (according to The data entered by the shader type is placed in the shared scratchpad. Typically, the graphics processor 202 uses programmable vertex, geometry, and buffers, and no longer treats these components as distinct fixed functional units with different designs and instruction sets. And each of these components is executed or operated, but the execution unit 42Ga, the singularity, the instruction group, the execution unit gamma (this execution unit contains the gamma system, a function) The design and structure of the execution of the program operation are the same. In one embodiment, each of the two functions: r420 can perform multiple thread operations, and when the vertex shader and geometry are sent to the Execution unit to execute, in 420 = system! ^ 2〇0 can use a vertex shader, and other execution units which are used by the execution unit 42Ga - solution system 2〇0, system:: wide), because The solution is solved by wiring 2: should: the internal buffer 'decoding system yuan 鄕 to obtain the data. Maru 仃 early into the 402 from the memory access list ¥ generated individual ^ 壬 some tasks to different execution units 4: Two:: Assign this EU set control unit 206 to manage the phase _",, Tian Fancheng task, that is, the set control unit ^ = = 'just, the point color's task to the execution unit (four)'s thread And then 18 200803526 相关 related tasks and threads, specifically, the EU set control a single 〇6 with all the execution units of the thread and the resources of the memory are not explained), EU collection control unit The article will know which ^ = give:: a task to use, know which task - the task of the task: to release, know (four) good (four) make health (four) case memory temporarily crying, therefore, if a task has been assigned to a wound , EU collection control unit 2〇6 will hold this 2 = in ' After that, all the shared register slots will be recorded with the shader and the color like «color (4) «1, = vertex mouth two solids can have different body sizes, for example: in the case of: 1G shared registers can be used to require only 5 registers. When the pixel is done - the thread is finished and it is assigned ^ the line 7G 420 will send a signal 仏Μ 行的的集合集合控制控制控制控制控制控制挪挪 Ευ Ευ Ευ Ευ Ευ Ευ Ευ Ευ Ευ 共用Temporary store floor _ number ^= not in between, although all the threads are in the busy green 1 plus back to the available empty memory has been allocated (or use the scratchpad slot to accommodate additional execution) Thread), β] ψ Save °. The work space is too small, can not be every, the unit early element control unit will not move again = new order: yuan 420 is considered full, the fat set each one to implement a new thread to the implementation can be 19 200803526 to manage or It is noted that each thread is in use (or in execution) or available. In this regard, in one embodiment, when the vertex shading is performing the function of the decoding system 200, the set control unit 2〇 6 prevents geometry shaders and pixel shaders from running at the same time.

第五Α圖說明具有前述圖形處理器2〇2及計算核心 204特徵的執行單元420a，其包含内嵌有解碼系統2〇〇的執行單元資料路徑512，具體來說，第五A圖是一執行單元420a的方塊圖，於一實施例中，其包含指令快取記憶體控制器504、與指令快取記憶體控制器5〇4連接的執行緒控制器506、緩衝器5〇8 (如常數緩衝器）、共用暫存^檔案(common register file，CRF)51〇、與執行緒控制器 ^ 及緩衝器508及共用暫存器檔案51〇連接之執行單$資料路徑(EUdatapath，EUDP)512、執行單元資料路徑先進先出緩衝器(first in first out，FIF0)514、述部暫存器播宰 (_iCate register flle，PRF)516、純量暫存器槽 file ’ SRF)518、資料輪出控制器52Q以及執行緒任務介面524 ’如前所述，執行單元從執行單元輸入4〇2 接收輪入’絲提供輸出給執行單⑽出他。執行緒控制器506提供整個執行單元4施的控制功二每一個執行緒及判斷功能’例如決定如何執 =執 ti，EUDP512 包含解碼_ _，计了，包含像是浮點運算計算邏輯單元(arithmetic logic 腿’ ALU)、移位邏輯功能等邏輯電路。資料輸出控制器520可將完成之資料移至某些與執行 20 200803526 早讀出4〇4連接之元件頂點快取記憶體、寫回單控制單元m的務結束」的資訊給資料輪出:’EUSP5i2傳送「任資料輪出控制器520包含f二520,告知任務已完成，幻個項目（她y))，另包含分，以館存完成的任務（如別從館存部分選擇任務^固=埠，資料輪出控制器定的暫存器位置，從共用暫…x豕者色盗描營内容所指資料項目、然後將資料送至:^案广讀出所有的輪出執行緒任務介面524輪出^早：0輪出404。別符給EU集合控制單元2 丁欠早兀伽完成之任務識控制單元206有-特定執彳…碎識別符會通知EU集合新的任務給該執行單啦行绪資源，可指派每一 dm’,常數緩衝器508可以分成16個區塊’ 塊有16個128位元水平向量常數的位置，著色哭算元與-索引存取-常數緩衝器位置其引可以疋包含32位元或接近’、數的暫存器。域近32位兀不具正負號的整數常而Jr:取記憶體控制器504是執行緒控制器506的介面方塊，域減彳情蝴_ 提=亍著色㈣，指令快取記憶體控制器50: 3 標戴表（未1會出）’進行擊中/不中測試，舉個2 子’如果請求的指令位於指令快取記憶體控制器504的快取以意體中則表不擊中’如果所欲請求的指令將從L2快取記憶體408或記憶體⑽提取則表示不中，如果擊中， 21 200803526 而同時/又有從執行單元輪入4〇2發出的請求，則指令快取記憶體控制器504即可同意請求，這是因為指令快取記憶體控制H 504的指令快取記憶體只有一個讀寫埠，而執行單兀輸入402具有最高之優先權；相反地，如果不中，而 L2快取記憶體4〇8内有可取代的區塊並有空間存在扭兀^ ‘ FI= 514，則指令快取記憶體控制器5〇4可同意請求。於，貝靶例中，指令快取記憶體控制器504的快取記憶體包 3 32組，母一組有4個區塊，每一個區塊帶有2位元狀態訊號，可代表三種狀態，分別是無效、載入、或有效狀態二在區塊載入L2資料之前，區塊是「無效」狀態，當等候 L2貝料日寸，是「載入」狀態，當完全載入L2資料時，則成為「有效」狀態。透過EUDP路徑512可對述部暫存器檔案進行讀寫，執行單元輸入402做為進入資料與執行單元42〇a的介面，於一實施例中，執行單元輸入4〇2包含一 8項目先進 • 先出缓衝11以緩衝進入資料，執行單元輸人4G2亦可將資 ‘ 料达至指令快取記憶體控制器5〇4的指令快取記憶體及常數緩衝器508，執行單元輸入402也可保留著色器内容。 ,執仃單兀輸出404做為將輸出資料從執行單元42〇a ，至EU集合控制單元2〇6、u快取記憶體4〇8、及寫回單元308的介面，於一實施例中，執行單元輸出撕包含一4項目先進先出緩衝器，用以接收仲裁請求，並缓衝輪出至EU集合控制單元2〇6的資料，執行單元輸出姻包含多種功能，可以仲裁指令快取記憶體讀料求、資料輪 22 200803526 出寫入請求、eudp讀/寫請求。共用暫存益檔案510用於儲存輸入、輸出、以及暫存資料，於一實施例中，共用暫存器檔案510包含8記憶頁 (bank)的128 X 128位元暫存器檔案及一讀一寫和一讀寫埠，一讀一寫璋係供EUDP 512使用，用於指令執行啟動的讀寫存取’偶執行緒共享記憶頁〇、2、4、6，奇執行緒則共旱§己憶頁1、3、5、7，執行緒控制器506配對不同執The fifth diagram illustrates an execution unit 420a having the aforementioned graphics processor 2〇2 and computing core 204 features, which includes an execution unit data path 512 embedded with a decoding system 2〇〇, specifically, a fifth A diagram is a The block diagram of the execution unit 420a, in one embodiment, includes an instruction cache controller 504, a thread controller 506 coupled to the instruction cache controller 5〇4, and a buffer 5〇8 (eg, Constant buffer), common register file (CRF) 51〇, execution order data path (EUdatapath, EUDP) connected to the thread controller ^ and buffer 508 and the shared register file 51〇 512, execution unit data path first in first out (FIF0) 514, _iCate register flle (PRF) 516, scalar register file 'SRF) 518, data The round-out controller 52Q and the thread task interface 524', as previously described, the execution unit receives the round-in's input from the execution unit input 4〇2 to provide an output to the execution order (10). The thread controller 506 provides the control unit 2 of the entire execution unit 4 for each thread and the judgment function 'for example, how to perform the execution ti, and the EUDP 512 includes the decoding _ _, which includes, for example, a floating-point arithmetic calculation logic unit ( Arithmetic logic leg 'ALU), shift logic function and other logic circuits. The data output controller 520 can move the completed data to some information that is executed to perform the processing of the device vertex cache memory and the write back control unit m that are connected 4〇4 early. 'EUSP5i2 transmits "any data rotation controller 520 contains f two 520, telling the task has been completed, the magic project (her y)), and also includes the points, to complete the tasks of the library (such as not selecting tasks from the library part ^ Solid = 埠, the data wheel out of the controller's register location, from the shared temporary ... x 豕盗盗营营营内容内容内容内容内容内容内容内容盗盗盗盗盗盗盗 ^ ^ ^ ^ ^ ^ ^ The task interface 524 is rounded out early: 0 round out 404. The other is assigned to the EU set control unit 2. The task of the early sangha completion task control unit 206 has a specific execution... the broken identifier will notify the EU to collect a new task. The execution of the single-line resource can be assigned to each dm', and the constant buffer 508 can be divided into 16 blocks. The block has 16 positions of 128-bit horizontal vector constants, coloring crying elements and - index access-constants. Buffer position can be referenced to contain 32 bits or close to ', number The register is nearly 32-bit 兀 unsigned integers often Jr: the memory controller 504 is the interface block of the thread controller 506, the domain is reduced by _ 提亍 = 亍 coloring (four), instruction cache The memory controller 50: 3 marks the table (not 1 will be output) 'to hit / miss the test, give a 2 'if the requested instruction is located in the cache of the instruction cache controller 504 in the body Then the table does not hit 'If the command to be requested is extracted from the L2 cache memory 408 or the memory (10), it means no, if it hits, 21 200803526 and at the same time / there is a round from the execution unit 4〇2 The request cache memory controller 504 can agree to the request, because the instruction cache memory H 504 has only one read/write buffer for the instruction cache, and the execution unit 402 has the highest priority. Conversely, if not, and there is a replaceable block in the L2 cache memory 4〇8 and there is room for the twist ^ 'FI= 514, the instruction cache controller 5〇4 can agree Request, in the shell target example, the cache of the instruction cache controller 504 is cached. The body pack has 32 groups, and the parent group has 4 blocks, each block has a 2-bit status signal, which can represent three states, respectively, invalid, loaded, or valid state. 2 Load L2 data in the block. Previously, the block was in the "invalid" state. When waiting for the L2 material, it was in the "loaded" state. When the L2 data was completely loaded, it became "valid". The EUDP path 512 can be used to temporarily store the statement. The file is read and written, and the execution unit input 402 is used as an interface for entering the data and execution unit 42A. In an embodiment, the execution unit input 4〇2 includes an 8 item advanced • first out buffer 11 to buffer incoming data. The execution unit input 4G2 can also transfer the instruction cache memory and constant buffer 508 to the instruction cache memory controller 5〇4, and the execution unit input 402 can also retain the shader content. The executable unit output 404 is used as an interface for outputting data from the execution unit 42A, to the EU collective control unit 2〇6, the u cache memory 4〇8, and the write back unit 308. In an embodiment, The execution unit output tearing includes a 4-item FIFO buffer for receiving the arbitration request, and buffering the data that is rotated out to the EU collective control unit 2〇6, and the execution unit output marriage includes multiple functions, and the arbitration instruction cache can be arbitrated. Memory read request, data round 22 200803526 Write request, eudp read/write request. The shared temporary storage file 510 is used for storing input, output, and temporary storage data. In one embodiment, the shared temporary storage file 510 includes eight 128 X 128-bit temporary storage files of a bank and a first reading. One write and one read/write 埠, one read and write 璋 is used by EUDP 512, for read and write access of instruction execution, even logic shared memory page 2、, 2, 4, 6, and xy thread § Recalling pages 1, 3, 5, 7, the thread controller 506 is paired differently

行緒的指令，並確認共用暫存器檔案的記憶體沒有讀或寫記憶頁衝突。讀寫璋則供執行單元輸入402及資料輸出控制器52〇使用，以載入初始執行緒輸入資料以及將最終執行緒輸出舄至EU集合控制單元資料缓衝器及[2快取記憶體或其他模組，執行單元輸入402及執行單元輸出4〇4共享一讀寫I/O淳，於-實施例中，寫入比讀出具有更高的優先權，512位元輸入資料進入4個不同的記憶頁，以避免將資料載入刺暫存ϋ檔案M0時發生衝突，2位元通道索引、資料與512位元對齊基準位址(aligned base address)」起通過以指定輪入資料的開始記憶頁，舉個例子，如果開始通運糾為！，則記憶頁1載人從最低有效位元(least significant bit，LSB)起算的第一個 128 位元，下一個 12δ 位元則載入記憶頁2，以此類推，假設執行緒基準記憶頁 =償為0,最後_個⑶位元則載入記憶頁〇,請注意執行姐ID—的兩個最低有效位元用於產生一記憶頁補償，以隨機列母一個執行緒的開始記憶頁位置。 23 200803526 的、:暫存器索引及執行緒ID可用於建立—獨-盈-的邏輯位址’以標籤配難g matehing)北二 510的讀寫資料，舉個例子，位址可以排成、128位= ===記憶頁的寬度一樣，藉由結合二: 盗索引以及5位元的執行緒ID，可以建立獨一益 —的13位兀位址，每一個1〇24位元行 — 行則有兩個512位元項目（字元），# + 示喊，母一愔百由、，收）母一字元儲存於4個記The command is executed and the memory of the shared scratchpad file is not read or written. The read/write buffer is used by the execution unit input 402 and the data output controller 52 to load the initial thread input data and output the final thread output to the EU collective control unit data buffer and [2 cache memory or The other modules, the execution unit input 402 and the execution unit output 4〇4 share a read/write I/O淳. In the embodiment, the write has higher priority than the read, and the 512-bit input data enters 4 Different memory pages to avoid conflicts when loading data into the file temporary storage file M0. The 2-bit channel index, data and 512 bit aligned base address are used to specify the rounded data. Start the memory page, for example, if you start to trade! , memory page 1 carries the first 128 bits from the least significant bit (LSB), the next 12δ bit is loaded into memory page 2, and so on, assuming the thread reference memory page = compensation is 0, the last _ (3) bits are loaded into the memory page 〇, please note that the two least significant bits of the executive ID - are used to generate a memory page compensation, to randomize the memory of a thread to start the memory page position. 23 200803526,: the scratchpad index and the thread ID can be used to establish - the unique - the logical address of the 'tag with the difficulty of mateeing" North 2 510 read and write data, for example, the address can be arranged 128 bits = === The width of the memory page is the same. By combining the two: the index and the 5-bit thread ID, you can create a unique 13-bit address, one for each 24-bit line. — The line has two 512-bit items (characters), # + shouting, the mother is one hundred and one, and the receiving one is stored in four records.

=中，亚將CRF索引的兩個最低有效位元加入目前執行、、者的§己憶頁補償，以建立記憶頁選擇。標籤配對方法可讓不同執行緒的暫存器共享共用暫存 :棺案5=’有效利用記憶體’肪集合控制單元細記錄 :、用暫存器檔案51〇的記憶體使用程度，確鋪程執行單疋420a的新任務時有足夠的空間。檢查目前執行緒的目標CRF索引佔全部CRF暫存器 :大〗在執行緒控制裔5〇6著手進行執行緒及著色器執 =之前，輸入資料就應該存放於共用暫存器檔案5lQ中，田執行緒執行結束，資料輸出控制器52〇從共用暫存器樓案510讀取輸出資料。蓟述執行單元420之實施例具有内含解碼系統2⑽的 EUDP 512’第五b圖說明一 EUDP 512之實施例，EUDP 512 包含暫存器檔案526、多工器528、向量浮點(FP)單元532、向里整數异術邏輯單元(ALU)534、特殊目的單元536、多工斋538、暫存器檔案54〇、以及解碼系統200,解碼系統 2〇〇包含一個或多個cavlc單元530,可以解碼一個或多 24 200803526 個串流’舉個例子，單一 CAVT Γ惡-— * LAVLC早凡530可以解碼單— 串流，兩個CAVLC單元53〇(如虛線所示，但為簡潔：未繪出其連接關係）可以同時解碼兩個串流等等楚說明，之後的敘述僅針對使料_ Cavlc單元⑽= ，碼系統200之操作，其原則可推衍至超過-個Cavlc 如圖所示，EUDP 512包含對應於cavlcIn the middle, the two least significant bits of the CRF index are added to the current execution, and the § page compensation is established to establish the memory page selection. The tag pairing method allows the shared buffers of different threads to share the shared temporary storage: 55='effective use of memory' fat collection control unit fine record: use the memory file of the temporary file file 51〇 There is enough space for the new task of the execution of the single 420a. Check the current thread of the target CRF index accounted for all CRF register: large〗 Before the thread control 〇 5 〇 6 to start the thread and color holder = =, the input data should be stored in the shared register file 5lQ, After the execution of the field thread is completed, the data output controller 52 reads the output data from the shared register file 510. The embodiment of the execution unit 420 has an EUDP 512' including a decoding system 2 (10). The fifth b diagram illustrates an embodiment of an EUDP 512. The EUDP 512 includes a scratchpad file 526, a multiplexer 528, and a vector floating point (FP). Unit 532, inward integer arithmetic unit (ALU) 534, special purpose unit 536, multiplex 538, register file 54, and decoding system 200, decoding system 2 includes one or more cavlc units 530 , can decode one or more 24 200803526 streams 'for example, a single CAVT abomination - - * LAVLC early 530 can decode a single - stream, two CAVLC units 53 〇 (as shown by the dotted line, but for the sake of simplicity: The connection relationship is not shown. It is possible to decode two streams at the same time, etc., and the following description is only for the operation of the _Calc unit (10)=, the operation of the code system 200, and the principle can be deduced to more than one Cavlc. As shown, EUDP 512 contains the corresponding cavlc

53〇^t^^^ 532^tALU534.^9^^ 些平仃貧=路徑，每—個單元均可根據接收到的指令執行對應的運算’暫存器稽案526接收運算元（標示^ SRC1及SRC2) ’於一實施例中，暫存器檔案526可為第五A圖所示之共用暫存器槽案別、述部暫存器槽宰加及/或純量暫存器檔案518，請注意於某些實施例中可使用更多的運算元運算（魏）訊號線542提供各單’-、 5匕536接收運异訊號的手段’目前訊號線544連接至多工器528，可傳送編碼成指令之當前值，供每一個單^ 530〜536進行小整數值的整數運算，指令解碼器（未兀、運弄疋、運算（功能）訊號、以及目前訊號，資料路含寫回階段)末端的多工器538選擇正確路徑、則出、、、口果，运至暫存器檔案540,輸出暫存器檔案54〇 ^含:目標元件，可以是暫存器檔案526或其他暫存器，請注於—實施例中，#來源及目標暫存11包含相同\ $为後元具有來源及目標元件選擇，供多工哭來自/送至適當暫存器檔案的資料。 25 200803526 因此，執行單元420a可以視為一多階管線（如4階管線γ具有4個算術邏輯單元），CAVLC解碼運算於4個執 f陪相中發生’需要延遲好讓CAVLc解碼執行緒動作，舉個例子，當位元流缓衝器發生向下溢位(underfl〇w)、等候初始化内容記憶體、等觸位元流载人fifq緩衝器及 sREG暫存器（稍後解釋）、及/或處理時間已超過預定巧檻時間等，可以在執行階段加入延遲。與某些實施例中，解碼系統2〇〇利用單一執行單元 42〇a刪解碼兩個位元流，舉個例子，根據一個擴充指令組’解碼系統可以使用兩個資料路徑（如新增另一單兀530)㈣進行兩個串流的解碼，當然也可解碼較多或較少的串流（那麼就會使用較多或較少的資料路徑），當牵涉到多個串流，某些解碼系統200並不限制同時解碼: 另外，在某些實施例中，單—CAVLC單元53()可以執多重同時串流解碼。於-實施射，當解碼系統使用兩個#料路狎、兩個執行緒便可關時運行，舉侧子，在兩串流解ς 施例中，_執行緒的數量為兩個，第—執行緒（如執行緒〇)指派給解瑪系、统200的第—記憶頁（即cavlc單元 530) ’第—執仃绪（如執行緒則指派給解瑪系統扣〇的第二記憶頁（即第五B圖的虛、線CAVLC單元），於某此實施例中’兩個❹個執行緒可以運行於單—記憶頁^ 外’雖然此處顯示解碼系統是内嵌於EUDp 512，包含其他的元件，像是W集合控制單元规内的邏輯* 26 200803526 路。現已說明執行單元420&、£11〇?512、以及（^\^〇單元530的某些實施例，下面藺單解釋用於Η.264 CAVLC運算内容的解碼系統’已知CAVLC程序編碼與巨圖塊 (macroblock)或部分巨圖塊有關的信號之層級（level，大小），知道這個層級有多常（如多少週期）重複(nm，運作），就不需要對母一位元進行編碼’從位元流緩衝器獲得並解析(parse)此類資訊，當解碼系統200的解碼引擎使用了缓衝器内的資訊，則資料會再補充進去，解碼系統2〇〇從位元流抽出内含層級(level)及運作(mn)係數的巨圖塊資訊，反轉編碼程序，然後重建訊號。解碼系統2〇〇從位元流緩衝為獲得巨圖塊資訊並解析串流，以獲得層級及運作係數值，暫時儲存於層級陣列及運作卩㈣，接著讀出這些層級陣列及運作_ (如塊内之區塊的4χ4區塊像素），然後清空層級_及運作陣醉備進行下—健塊，根據 H.2641準，使用軟體處理每—個4χ4區塊可立的巨圖塊。蔣献人—馬孝序的内谷中之解碼系統200的各種元科下：:：際應用的各種變形列入考慮，熟悉此技藝者 ^的許多術語（如各參數的名稱）是出自ή 了簡潔之故不再魏，除非是有助不略核/或元件，才會再做進—步之說明。弟以Α圖至第六c圖是說明解碼系統·之方塊 27 200803526 其中繪出之解碼系統200具有單一 CAVLC單元53〇(於第六A圖至第六C圖，所使用之cavlc單元530可與解碼系統200互換），因此於實施例中，解碼系統2〇〇可解碼單一位兀流，同樣的原則可應用至具有多個CAVLC單元的解碼系統200，可同時解碼多個（如兩個）串流。簡單地說，第六A圖是CAVLC單元53〇的選擇元件，第六B圖則說明CAVLC單元提供的串流緩衝器功能，第六c圖說明CAVLC單元530的内容記憶體（包含暫存器）功能，第六D圖說明CAVLC解碼的表單結構。雖然下列敘述是有關巨圖塊解碼的内容，但是此原則可應用至各種圖塊解碼。請參閱第六A圖，CAVLC單元53〇包含數個硬體模組’有係數符記（coeff—t〇ken)模組61〇、層級碼 (CAVLC一LevelCode)模組 612、層級(CAVLC—Level)模組 614、層級 0(CAVLC_L0)模組 616、零層級(CAVLC—ZL)模組618、運作（CAVLC—Run)模組620、層級陣列 (LevelArmy)622、以及運作陣列(RunArray) 624，解碼系統還包含移位暫存裔（SREG>串流緩衝器/直接記憶體存取 (DMA)引擎602 (亦見於第六b圖，之後稱為DMA引擎模組）、總暫存器606、區域暫存器608、以及第六c圖中的巨圖塊相鄰内容(mbNeighCtx)記憶體604 (於一實施例中，mbNeighCtx記憶體包含96位元暫存器，可以是著色為舄入的3個32位元暫存器），另外有些暫存器未繞出。 CAVLC單元530與執行單元420a的介面包括一個或 28 200803526 多個目標紐排及對應㈣存器（如DST暫存器）、兩個來源匯流排及對應的暫存器（SRC卜聊2)，目標匯流排 ^的貧料可以直接或間接（如經由中職取記憶體、暫存益、缓衝器、或記憶體）傳送至圖形處理單元114内部或外部的視訊處理單元，目標匯流排上的資料可以是微軟的 . DX API格式或其他格式，這些資料包含係數、巨圖塊參53〇^t^^^ 532^tALU534.^9^^ Some flat poor = path, each unit can perform the corresponding operation according to the received instruction 'scratch register 526 receive operation unit (mark ^ SRC1 and SRC2) 'In one embodiment, the scratchpad file 526 may be a shared register slot case, a description register slot adder, and/or a scalar register file as shown in FIG. 518, please note that in some embodiments, more operand operations (wei) signal lines 542 can be used to provide a means for each of the single '-, 5 匕 536 to receive an alien signal. The current signal line 544 is connected to the multiplexer 528. The current value encoded into the instruction can be transmitted, and the integer operation of the small integer value is performed for each single 530~536, and the instruction decoder (the 兀, 运, 运算 (function) signal, and the current signal, the data path is written The multiplexer 538 at the end of the back phase selects the correct path, then the output, and the result is sent to the scratchpad file 540, and the output buffer file 54 〇^ contains: the target component, which can be the scratchpad file 526 or Other registers, please note that in the embodiment, #源 and target temporary storage 11 contain the same \ $ for the latter Source and target component selection, for multiplexed crying Information from/to the appropriate scratchpad file. 25 200803526 Therefore, the execution unit 420a can be regarded as a multi-stage pipeline (for example, the 4th-order pipeline γ has 4 arithmetic logic units), and the CAVLC decoding operation occurs in 4 accompaniment phases, which requires a delay to allow the CAVLc to decode the executor action. For example, when the bit stream buffer has a downward overflow (underfl〇w), waiting for initialization of the content memory, etc., the touch bit stream carrying the fifq buffer and the sREG register (explained later), And/or the processing time has exceeded the scheduled time, etc., and the delay can be added during the execution phase. In some embodiments, the decoding system 2 uses a single execution unit 42A to decode and decode two bitstreams. For example, according to an extended instruction set, the decoding system can use two data paths (such as adding another A single 兀 530) (4) to decode two streams, of course, can decode more or less streams (then will use more or less data paths), when involving multiple streams, some Some decoding systems 200 do not limit simultaneous decoding: Additionally, in some embodiments, the single-CAVLC unit 53() can perform multiple simultaneous stream decoding. In the implementation of the shot, when the decoding system uses two #料狎, two threads can be turned off, the side is in the two streams, in the two streams, the number of _executors is two, the first - The thread (such as executor) is assigned to the first memory page of the solution system (ie, cavlc unit 530) 'the first thread (such as the thread is assigned to the second memory of the solution system) Page (ie, the virtual, line CAVLC unit of Figure 5B), in some embodiments, 'two threads can run on a single-memory page', although the decoding system shown here is embedded in EUDp 512. , including other components, such as the logic within the W set control unit. * 26 200803526. The execution unit 420 &, £11 〇 512, and (^\^〇 some embodiments of the unit 530 have been described below,蔺Definition of the decoding system for Η.264 CAVLC operation content 'KevLC program code is known as the level (size) of the signal related to the macroblock or part of the giant tile, know how often this level is ( If the number of cycles is repeated (nm, operation), you don’t need to The code 'obtains from the bit stream buffer and parses such information. When the decoding engine of the decoding system 200 uses the information in the buffer, the data is replenished, and the decoding system 2 〇〇 from the bit stream Extracting the huge block information including the level and the operation (mn) coefficient, inverting the encoding process, and then reconstructing the signal. The decoding system 2 buffers the bit stream to obtain the macro block information and parses the stream to Obtain the level and operating coefficient values, temporarily store them in the hierarchical array and operation 卩 (4), and then read out these hierarchical arrays and operations _ (such as the 4 χ 4 block pixels in the blocks in the block), then clear the tier _ and operate the rush The next-health block, according to H.2641, uses software to process each huge block of 4χ4 blocks. Jiang Xianren-Ma Xiaoxu's various meta-systems of the decoding system 200 in the inner valley::: Various deformations of the application For consideration, many of the terms familiar to this artist (such as the name of each parameter) are from the succinct reason, no longer Wei, unless it is helpful to not even nuclear / or components, will be done again - step description Brother to the first Figure c is a block diagram illustrating a decoding system. 200803526 The decoding system 200 depicted therein has a single CAVLC unit 53A (in Figures 6A through 6C, the cavlc unit 530 used may be interchanged with the decoding system 200), Therefore, in an embodiment, the decoding system 2 can decode a single bit stream, and the same principle can be applied to the decoding system 200 having a plurality of CAVLC units, which can simultaneously decode multiple (eg, two) streams. 6A is a selection component of the CAVLC unit 53A, the sixth B diagram illustrates the stream buffer function provided by the CAVLC unit, and the sixth c diagram illustrates the content memory (including the scratchpad) function of the CAVLC unit 530, The sixth D diagram illustrates the form structure of the CAVLC decoding. Although the following description is about giant tile decoding, this principle can be applied to various tile decoding. Referring to FIG. 6A, the CAVLC unit 53 includes a plurality of hardware modules 'coeff-t〇ken module 61〇, a level code (CAVLC-LevelCode) module 612, and a level (CAVLC- Level 614, tier 0 (CAVLC_L0) module 616, zero level (CAVLC-ZL) module 618, operation (CAVLC-Run) module 620, hierarchical array (LevelArmy) 622, and operational array (RunArray) 624 The decoding system also includes a shift register (SREG) stream buffer/direct memory access (DMA) engine 602 (also see Figure 6b, hereinafter referred to as DMA engine module), total register 606. The area register 608, and the macroblock adjacent content (mbNeighCtx) memory 604 in the sixth c picture (in one embodiment, the mbNeighCtx memory includes a 96-bit register, which may be colored as an intrusion The other three 32-bit registers are not circumvented. The interface between the CAVLC unit 530 and the execution unit 420a includes one or 28 200803526 multiple target rows and corresponding (four) registers (such as DST register) ), two source bus and corresponding register (SRC Bu 2), the target bus can be poor The data processing unit is transmitted to the video processing unit inside or outside the graphics processing unit 114, such as via memory, temporary storage, buffer, or memory. The data on the target bus can be Microsoft. DX API format or other format, these data contain coefficients, giant block parameters

- f ' _資訊、及/或1PCM取樣或是其他資料，CAVLC f元530還包括由健匯流排和f龍流排組成的記憶體介=，從位址匯流排得到位址後，便可以藉由從資料匯流排得到的資料進行位元流資料的存取，於一實施例中，資料匯流排上的資料可以包括未加密視訊流，其中包括各種訊號參數及其他資料與格式，於某些實施例中，可以使用載入一儲存操作來存取位元流資料。销始說明CAVLC單it 530的各元件之前，簡單說 =一下有關CAVLC解碼的執行單元伽之整個操作，通 , 常’根據切片（slice)形式，驅動軟體128 (第一圖）準備 CAVLC著色器並將其載入執行單元42加，該cavlc著色吏用標準指令組加上coeff」〇ken、、 • CAVLC—Level、CAVLC L〇、CAVLC—ZL、及-cA，r皿才曰·？可以進行位元流之解碼，這裡命名的原則是各模组會發出相同名稱的指令，另外，在層級陣列622及運作陣列624還有跟讀取操作及清除操作有關的 READ_LEVEL_RUN 及 CLR—LEVEL RUN 指令，於一實施例中，在發出其他指令之前，CAVLC著色器執行的第一 29 200803526 指令是ΙΝΙΤ—CAVLC及INIT—ADE指令，這兩個指令使 CAVLC單元530開始CAVLC解碼一位元流，並將位元流從串流解碼點開始載入FIFO缓衝器，稍後將說明這兩個指令，因此CAVLC單元530提供了解析位元流、初始化解碼硬體及暫存器/記憶體結構、以及層級—運作 (level-run)解碼，所述h，264 CAVAC解碼程序功能將於稍後解釋，先從位元流緩衝器的操作開始。- f ' _ information, and / or 1PCM sampling or other information, CAVLC f 530 also includes a memory composed of health bus and f dragon flow row =, after obtaining the address from the address bus, you can In the embodiment, the data on the data bus can include an unencrypted video stream, including various signal parameters and other data and formats, by using the data obtained from the data bus to access the bit stream data. In some embodiments, a load-storage operation can be used to access the bitstream data. Before the description of each component of the CAVLC single it 530, simply say = the entire operation of the execution unit of the CAVLC decoding, usually, according to the slice form, the driver software 128 (first picture) prepares the CAVLC shader And load it into the execution unit 42. The cavlc coloring is combined with the standard instruction set plus coeff"〇ken, · CAVLC-Level, CAVLC L〇, CAVLC-ZL, and -cA. The decoding of the bit stream can be performed. The principle of naming here is that each module will issue an instruction with the same name. In addition, the hierarchical array 622 and the operational array 624 have READ_LEVEL_RUN and CLR_LEVEL RUN related to the read operation and the clear operation. The instructions, in an embodiment, prior to issuing other instructions, the first 29 200803526 instructions executed by the CAVLC shader are ΙΝΙΤ-CAVLC and INIT_ADE instructions, which cause the CAVLC unit 530 to start CAVLC decoding of the one-bit stream, The bit stream is loaded into the FIFO buffer from the stream decoding point. These two instructions will be described later, so the CAVLC unit 530 provides the parsing bit stream, the initialization decoding hardware, and the scratchpad/memory structure. And level-run decoding, the h, 264 CAVAC decoding program function will be explained later, starting with the operation of the bitstream buffer.

關於解析位元流，從記憶體介面的資料匯流排接收位元流，然後由SREG串流緩衝器/DMA引擎618進行緩衝’切片資料解析階段提供位it流解碼，位^流（如ϋ 位元流）包括-張或多張圖，將其切割成圖槽頭㈣㈣及許多切片（slice)，-張切片通常包含—系列的巨圖塊，於 -實施例中’外部程序（即CAVLC單元53〇外部）解析 NAL低流、解碼切片播頭、傳送指向該切片資料（如切片開始處）的指標’通常，驅動軟體m從切片資料處理餘流’因為這是應用程式及API提供的功能，指向ς 貧料位置的指標傳遞還牽涉到切片資料的第一位元組位址 (如RBSPbyeAddress )和指出位元流開始或標頭位置如 sREGptr)的位補償指標（如—個位元或多個位元），元流的初始化將於難解釋，於某些實施财，可主處理器（如第中央處理單元126)處理外部程岸，以提供圖片解碼及切片標頭解碼，與某些實施例中，因或解碼系統200從圖片進行h.264位元流解析，而ca^ 解碼操作是根據切片資料從巨圖塊著手進行，於某些實施 30 200803526 例中’因為CAVLC單元的可程式特性，可以於任何階段進行解碼。請參閱第六B圖，其為CAVLC單元530的SREG串流緩衝器/DMA引擎602的選擇元件部分及其他元件之方塊圖’其包含運算元暫存器661及663，分別接收SRC1 與SRC2值，再傳遞至暫存器656及667，CAVLC邏輯電路660就是第六A圖的模組及元件，不過沒有包括SREg 串*緩衝态/DMA引擎602、mbNeighCtx記憶體604、總暫存器606、以及區域暫存器608, SREG串流緩衝器/ DMA引擎618包含内部位元流緩衝器6〇2b，於一實施例中可為BigEndian格式之3 2位元暫存器及8個！ 2 8位元暫存裔。驅動軟體128發出的初始化指令於開始時設定sreq 串概緩衝态/DMA引擎602, 一旦啟動，便自動管理sreG 串流緩衝器/DMA引擎602的内部緩衝器6〇2b，SREG串流缓衝器/DMA引擎602保留待解析位元的位置。於一實施例中，SREG串流緩衝器/DMA引擎6〇2使用兩個暫存器，一個快速32位元正反器與一個較慢512 或1024位元圮憶體’位元流會使用位元，移位暫存器6〇2a 从位元進行操作，而位元流缓衝器6〇2b以位元組進行操作，可以節省能源。通常移位暫存器6〇2&運算的指令會使 :少許位元（如1〜3位元），當移位暫存器⑽使用超過一位元_資料，f料（位元㈣段）將從位元流緩衝器 6咖傳送給移位暫存器·a，然後緩衝器指標會減去傳送的位元組數量，當SREG串流緩衝器/DMA引擎6〇2的 31 200803526 DMA引擎偵測到使用256位元或更多位元時，便從記憶體提取256位元填滿位元流緩衝器6〇2b，如此CAVLC單元 530實行了一個簡單的循環緩衝器（256位元片段X 4)，以追蹤位元流緩衝器6〇2b並進行填充，於某些實施例中可以使用單一缓衝器，不過一個循環緩衝器需要更複雜的指標計算來跟上記憶體的速度。利用初始化指令達成與内部緩衝器602b互動，稱為 INIT^BSTR指令，於一實施例中，INITjgSTR指令（可由驅動軟體128發出）與INIT一CAVLC (或一ADE)指令幾乎同時發出，形成延遲(stall)，直到位元流資料進入緩衝器 602b，一旦資料到達緩衝器6〇2b，解除延遲狀況開始後面的程序’之後，如果缓衝器的儲存狀況低於預定門檻，SREG 位元流緩衝器/DMA引擎602的DMA引擎會繼續提取位兀流貢料存入缓衝器602b。如果已知位元流位置的位元組位址及位元補償，INIT一BSTR指令將資料載入内部位元流緩衝裔602b，並開始管理程序，每一次呼叫處理切片資料均會發出下列格式之指令： INIT—BSTR offset, RBSPbyteAddress 這個指令用於將資料載入SREG串流缓衝器/DMA 引擎602的内部緩衝器602b，於一實施例中，SRC2暫存裔663提供位元組位址(RBSPbyteAddress)，而SRC1暫存态661提供位元補償，如此，可以使用下列通用之指令格式： INITJBSTR SRC2.SRC1, 32 200803526Regarding parsing the bit stream, the bit stream is received from the data bus of the memory interface, and then buffered by the SREG stream buffer/DMA engine 618. The slice data parsing stage provides bit it stream decoding, and the bit stream (such as the bit stream) The elementary stream) consists of - or more pictures, which are cut into a trough head (four) (four) and a number of slices. The - slice usually contains a series of giant tiles, in the embodiment - an external program (ie CAVLC cell) 53〇 External) parsing the NAL low stream, decoding the slice play header, and transmitting the indicator pointing to the slice data (such as the beginning of the slice) 'typically, the driver software m processes the residual stream from the slice data' because this is the function provided by the application and the API. , the indicator transfer to the 贫 poor material position also involves the first byte address of the slice data (such as RBSPbyeAddress) and the bit compensation indicator (such as the bit bit or indicating the start of the bit stream or the position of the header such as sREGptr) Multiple bits), the initialization of the meta-flow will be difficult to explain. In some implementations, the main processor (such as the central processing unit 126) processes the external path to provide image decoding and slice header decoding, and In the embodiment, the decoding system 200 performs h.264 bit stream parsing from the picture, and the ca^ decoding operation is performed according to the slice data from the giant tile. In some implementations 30 200803526, the example is because of the CAVLC unit. Program features that can be decoded at any stage. Please refer to FIG. 6B, which is a block diagram of the selection component part of the SREG stream buffer/DMA engine 602 of the CAVLC unit 530 and other elements. The arithmetic unit registers 661 and 663 respectively receive the SRC1 and SRC2 values. And then transferred to the registers 656 and 667, the CAVLC logic circuit 660 is the module and component of the sixth A picture, but does not include the SREg string * buffer state / DMA engine 602, mbNeighCtx memory 604, total register 606, And the area register 608, the SREG stream buffer/DMA engine 618 includes an internal bit stream buffer 6〇2b, which in one embodiment can be a 32-bit register and 8 in the BigEndian format! 2 8 yuan temporary. The initialization command issued by the driver software 128 initially sets the sreq string buffer state/DMA engine 602, and once started, automatically manages the internal buffer 6〇2b of the sreG stream buffer/DMA engine 602, the SREG stream buffer The /DMA engine 602 reserves the location of the bit to be resolved. In one embodiment, the SREG stream buffer/DMA engine 6〇2 uses two registers, a fast 32-bit flip-flop and a slower 512 or 1024-bit memory bit stream. The bit shifting register 6〇2a operates from the bit, and the bit stream buffer 6〇2b operates in a bit group, which saves energy. Usually the shift register 6〇2& operation will make: a few bits (such as 1~3 bits), when the shift register (10) uses more than one bit_data, f material (bit (four) segment The slave stream buffer 6 is transferred to the shift register·a, and then the buffer indicator is subtracted from the number of bytes transferred, when the SREG stream buffer/DMA engine 6〇2 31 200803526 DMA When the engine detects the use of 256 bits or more, it extracts 256 bits from the memory to fill the bit stream buffer 6〇2b, so the CAVLC unit 530 implements a simple circular buffer (256 bits). Fragment X 4), to track and fill the bitstream buffer 6〇2b, in some embodiments a single buffer can be used, but a circular buffer requires more complex index calculations to keep up with the speed of the memory. . Using the initialization command to interact with the internal buffer 602b, referred to as the INIT^BSTR instruction, in one embodiment, the INITjgSTR instruction (which can be issued by the driver software 128) is issued almost simultaneously with the INIT-CAVLC (or ADE) instruction, forming a delay ( Stall), until the bit stream data enters the buffer 602b, once the data reaches the buffer 6〇2b, the program after the delay condition start is released, and if the buffer storage condition is lower than the predetermined threshold, the SREG bit stream buffer The DMA engine of the /DMA engine 602 will continue to extract the bitstream tribute into the buffer 602b. If the byte address and bit compensation of the bit stream location are known, the INIT-BSTR instruction loads the data into the internal bit stream buffer 602b and starts the hypervisor. Each call processing slice data will be issued in the following format. Instruction: INIT_BSTR offset, RBSPbyteAddress This instruction is used to load data into the internal buffer 602b of the SREG Stream Buffer/DMA Engine 602. In one embodiment, the SRC2 Temporary 663 provides a byte address. (RBSPbyteAddress), while the SRC1 temporary state 661 provides bit compensation, so the following general instruction formats can be used: INITJBSTR SRC2.SRC1, 32 200803526

其中，這個指令中的SRC1以及SRC2及其他訊號是對應内部暫存器661及663的值，但是不限於這些暫存器，於一實施例中，使用256位元組排列之記憶體提取來存取位元流資料，並將其寫入緩衝暫存器並傳送至SREG串流缓衝器/DMA引擎602的32位元移位暫存器602a，於一實施例中，在這些暫存器或緩衝器進行運算之前，位元流緩衝器602b内的資料是以位元組方式排列，此資料排列可藉由排列指令實施，亦稱之為ABST指令，ABST指令會排列位元流缓衝器6〇2b内的資料，在解碼過程中，排列位元 (如填充位元）最後將被丟棄。當移位暫存器602a使用資料，内部緩衝器6〇2b便會填充資料，換句話說，SREG串流緩衝器/DMA引擎6〇2 的内部缓衝器602b類似以3為模(modulo)之循環緩衝器，將資料輸入SREG串流緩衝器/DMA引擎6〇2的％ :元暫存器602a，CAVIX單元53〇 (如CAVLC邏輯模組嶋）可以使用READ指令從移位暫存器、6〇2a讀取資料，read 指令之格式如下：The SRC1 and SRC2 and other signals in the instruction are values corresponding to the internal registers 661 and 663, but are not limited to the registers. In one embodiment, the memory is extracted using 256-bit arrays. The bitstream data is fetched and written to the buffer register and passed to the 32-bit shift register 602a of the SREG stream buffer/DMA engine 602, in one embodiment, in these registers Before the buffer is operated, the data in the bit stream buffer 602b is arranged in a byte group manner. The data arrangement can be implemented by an arrangement instruction, which is also called an ABST instruction, and the ABST instruction arranges the bit stream buffer. The data in the device 6〇2b, in the decoding process, the arrangement of the bits (such as padding bits) will eventually be discarded. When the shift register 602a uses the data, the internal buffer 6〇2b fills the data. In other words, the internal buffer 602b of the SREG stream buffer/DMA engine 6〇2 is similarly modulo 3 (modulo). The circular buffer, input data into the SREG stream buffer / DMA engine 6 〇 2%: meta-register 602a, CAVIX unit 53 〇 (such as CAVLC logic module 嶋) can use the READ instruction from the shift register Read data at 6〇2a. The format of the read command is as follows:

KbAD DST，SRC1，其中DST對應於—輸出或目標暫存器，於―實施例中， srci暫存器661包含不具正負號的整數值n，經過峨d 指令^移位暫存器術a獲得n位元，當從32位元移位暫存“02a消耗了 256位元的資料（如解碼一個或多個語法成分）’自動開始提取動作以獲得另一個256位元二料，將其寫人内部缓_ _的暫存器，接著進入移位二 33 200803526 存器602供下一循環使用。於某些實施例中，如果對應於一符號解碼之移位器602a的資料已被使用了預定數量的位元或位元組，而二部緩衝ϋ 6G2b沒有再接_任何#料，則cavlc邏輯命路660可以進行延遲，以便執行其他的執行緒（例如= CAVLC解碼程序無關之執行緒），像是頂點著色器操作了使用SREG串流緩衝器/DMA引擎6〇2的譲八引敬可以減少所需的緩衝ϋ數量，以補償記憶體延遲（例如: 於某些_處理單元中，會到三百多職），#使用了位元流，可以請求流入排在後面的位元流資料，如果位元^ 料太少使得位元錢_ 6G2b杨下溢位的黯（例如已知讓訊號從CAVLC單元53〇流至處理器管線的週期數），可傳遞延遲信號給處理器管線，暫停操作，等候資料到達位元流缓衝器602b。另外，SREG串流緩衝器/DMA弓丨擎6〇2原本便有處理錯誤位元_能力’舉_子，目綠元流錯誤，有可能沒有偵剩切片結尾記號，這種侧錯誤可能會導致解碼完全錯誤’並用到後來的圖樣或切片的位元，sreg串流缓衝器/DMA引擎6〇2記錄使用的位元數，如果使用的位元數大於預設的門檻值（可針對每一切狀變），則結束處理程序並將除去的信號送到處理器（如主處理器），然後處理益執行編碼嘗試從錯誤中回復。兩個有關位元流存取的指令為INpSTR及mpTRB指令，INPSTR及inpTRB齡用於偵測是否在切片或巨圖 34 200803526 塊中有出現特別的樣式(pattern，如資料開始或結束樣式），不需進行位元流就能開始讀取位元流，於一實施例中，指KbAD DST, SRC1, where DST corresponds to the -output or target register. In the embodiment, the srci register 661 contains an integer value n that is not signed, and is obtained by the 峨d instruction ^ shift register N-bit, when shifting from 32-bit temporary storage "02a consumes 256-bit data (such as decoding one or more syntax components)" automatically starts the extraction action to obtain another 256-bit binary material, write it The internal internal buffer __, then enters the shift two 33 200803526 602 for the next cycle. In some embodiments, if the data corresponding to a symbol decoded shifter 602a has been used The predetermined number of bits or bytes, and the two buffers ϋ 6G2b are not connected to any of the materials, the cavlc logic 660 can be delayed to execute other threads (eg = CAVLC decoder independent of the thread) ), such as the vertex shader operation using the SREG stream buffer / DMA engine 6 〇 2 引引可以 can reduce the number of buffers required to compensate for memory delay (for example: in some _ processing unit , will go to more than 300 jobs), #用了The bit stream can be requested to flow into the bit stream data that follows, if the bit material is too small, so that the bit _ 6G2b yang overflows (for example, it is known to let the signal flow from the CAVLC unit 53 to the processor). The number of cycles of the pipeline) can pass a delay signal to the processor pipeline, suspend operation, and wait for the data to arrive at the bitstream buffer 602b. In addition, the SREG stream buffer/DMA bow engine 6〇2 originally has a processing error bit. Meta_capability's _ child, the cyber cell flow error, there may be no left-end slice end mark, this side error may cause the decoding to be completely wrong' and use the later pattern or slice bit, sreg stream buffer / DMA engine 6 〇 2 record the number of bits used, if the number of bits used is greater than the preset threshold (can be changed for everything), then end the processing and send the removed signal to the processor (such as The main processor) then processes the code execution attempt to reply from the error. Two instructions for bit stream access are INpSTR and mpTRB instructions, and INPSTR and inpTRB are used to detect whether in slice or giant image 34 200803526 Appear Other pattern (pattern, such as a beginning or end data pattern), without performing the bit stream can start reading the bitstream, in one embodiment, means

令順序為INPSTR、INPTRB、然後是READ指令，INPSTR 指令包含下列格式： INPSTR DST，於一實施例中，檢視位元流並將移位暫存器6〇2&的最南有效16位元送至目標(DST)暫存器的較低16位元，目標暫存态的較南16位元包含sREGbitptr值，資料不會從移位暫存器602a移出做為運算結果，可以根據下式例示虛擬碼施行指令： MODULE INPSTR (DST)The order is INPSTR, INPTRB, and then the READ instruction. The INPSTR instruction contains the following format: INPSTR DST, in one embodiment, the bit stream is viewed and the most south valid 16 bits of the shift register 6〇2& To the lower 16 bits of the target (DST) register, the south 16 bits of the target temporary state contain the sREGbitptr value, and the data is not removed from the shift register 602a as the operation result, and can be instantiated according to the following formula Virtual code execution instruction: MODULE INPSTR (DST)

OUTPUT [31:0] DST DST = {ZE (sREGbitptr)，sREG [msb: msb-15]};OUTPUT [31:0] DST DST = {ZE (sREGbitptr), sREG [msb: msb-15]};

ENDMODULE 與位元流有關的另一個指令是inptrb指令，檢視原始位元組順序酬載(raw byte sequence payload，RBSP)尾端位元（如位元組排列資料流），INPTRB指令用於讀取位元流缓衝器602b，可為下列格式： INPTRB DST. 於INPRB運算中，沒有從移位暫存器6〇2b移出位元，如果移位暫存器602b的最高有效位元包含1〇〇(非限定），則包含了 RBSP停止位元，剩下的位元組就都是零位元，町以根據下式例示虛擬碼施行指令： MODULE INPTRB(DST) 35 200803526 OUTPUT DST; REG [7:0] P; P = sREG [msb: msb-7];ENDMODULE Another instruction related to the bit stream is the inptrb instruction, which looks at the raw byte sequence payload (RBSP) tail bits (such as the byte array data stream), and the INPTRB instruction is used to read The bit stream buffer 602b can be in the following format: INPTRB DST. In the INPRB operation, no bit is shifted out from the shift register 6〇2b, if the most significant bit of the shift register 602b contains 1〇〇 (unqualified), the RBSP stop bit is included, and the remaining bytes are all zero bits. The town executes the virtual code execution instruction according to the following formula: MODULE INPTRB(DST) 35 200803526 OUTPUT DST; REG [ 7:0] P; P = sREG [msb: msb-7];

Sp = sREGbitptr; T [7:0] = (P » sp) « sp; DST[l]-(T--0x80)? 1: 0;Sp = sREGbitptr; T [7:0] = (P » sp) « sp; DST[l]-(T--0x80)? 1: 0;

DST[0]二！（CVLC一BufferBytesRemaining > 〇); ENDMODULE READ指令用於排列位元流緩衝器602内的資料。現已說明CAVLC單元530的位元流缓衝器操作，再來是CAVLC運算的初始化，尤其是初始化記憶體、暫存器結構以及解碼引擎（如CAVLC邏輯電路660)，在切片起始處’於解碼對應於第一巨圖塊的語法成分之前，初始化暫存器結構、總暫存器606、區域暫存器608、以及 CAVLC解碼引擎’於一實施例中，驅動軟體128發出 INIT—CAVLC指令進行這個初始化動作，init_CAVLC指令可以具有下列指令格式： INIT—CAVLC SRC2, SRC1 其中’ SRC2包含切片貧料待解碼之位元數目，將這個值舄入内部 CVLC一bufferBytesRemaining 暫存器： SRC 1 [ 15:0]二 mbAddrCurr， SRC1 [23:16]二 mbPerLine， SRC 1 [24]二 constrained—intrajpredflag， SRC 1 [27:25]二 NAL—unit—type (NUT)， 36 200803526 SRCl [29:28] = dm)ma—formatjdc (於一實施例中， chroma—format-idc值為1時對應4:2:〇格式，於其他實施例可使用其他取樣機制）' SRC1 [31:20]=未定義關於INIT—CAVLC指令，將SRC 1值寫入總暫存器6〇6 的對應攔位，利用INIT指令，另將SRC2值寫入内部暫存裔（如 CVLC—bufferByteRemaining )， CVLC—bufferByteRemaining暫存器用於恢復錯誤位元流，舉個例子，解碼開始時，CAVLC單元530 (如SREG位元流緩衝器/DMA引擎602 )針對一切片記錄有關位元流中的緩衝位元，位元流使用後，CAVLC單元53〇計數並更新 CVLC—bufferByteRemaining值，如果這個值低於〇，這表示缓衝裔或位元流有錯誤，此時迅速終止處理，並返回廡用程式控制或驅動軟體128控制，進行恢復。請參閱第六C圖，INIT一CAVLC指令也可初始化 CAVLC單元530的各儲存結構，如記憶體 6〇4、左側 mbNeighCtx 暫存器 684、目前 mbNeighCtx 暫存器686，於一實施例中，mbNeighCtx記憶體61〇的巨圖鬼基準相鄉内谷§己憶體排列成一記憶體陣列，以儲存有關巨圖塊列的資料，目前mbNeighCtx暫存器686用於儲存目剷解碼之巨圖塊，而左侧mbNeighCtx暫存器684用於儲存先前解碼之（左側）巨圖塊，另外，利用上方指標6幻左側指標685、及目前指標687 (在第六C圖中以箭頭表不）指向mbNeighCtx記憶體6〇4、左側mbNeighCtx暫存 37 200803526 器684、以及目前mbNeighCtx暫存器686，當解碼目前之巨圖塊時，解碼之資料儲存於目前咖獅仏暫存器 686 ^已知解碼之内容性質時，根據 CAVLC—TOTC指令從前讀碼巨圖塊時職集之資訊來解碼目前的巨圖塊，亦即左侧巨圖塊儲存於左側 mbNeighCtx暫存器684並利用左側指標685進行指向，而上方巨圖塊儲存於陣列元素[i] 681中並利用上方指標683 進行指向。DST[0] two! (CVLC-BufferBytesRemaining >〇); The ENDMODULE READ instruction is used to arrange the data in the bit stream buffer 602. The bitstream buffer operation of the CAVLC unit 530 has been described, followed by initialization of the CAVLC operation, particularly the initialization memory, the scratchpad structure, and the decoding engine (e.g., CAVLC logic circuit 660) at the beginning of the slice. Initializing the scratchpad structure, the total register 606, the region register 608, and the CAVLC decoding engine before decoding the syntax components corresponding to the first macroblock. In one embodiment, the driver software 128 issues the INIT-CAVLC The instruction performs this initialization. The init_CAVLC instruction can have the following instruction format: INIT—CAVLC SRC2, SRC1 where 'SRC2 contains the number of bits to slice the poor material to be decoded, and this value is written to the internal CVLC-bufferBytesRemaining register: SRC 1 [ 15:0] two mbAddrCurr, SRC1 [23:16] two mbPerLine, SRC 1 [24] two constrained-intrajpredflag, SRC 1 [27:25] two NAL-unit-type (NUT), 36 200803526 SRCl [29:28 ] = dm) ma_formatjdc (in one embodiment, the chroma_format-idc value of 1 corresponds to 4:2: 〇 format, other sampling mechanisms can be used in other embodiments) ' SRC1 [31:20] = not Definition The INIT-CAVLC instruction writes the SRC 1 value to the corresponding block of the total register 6〇6, and uses the INIT instruction to write the SRC2 value to the internal temporary storage (such as CVLC_bufferByteRemaining), and the CVLC_bufferByteRemaining register. In order to recover the error bit stream, for example, when decoding starts, the CAVLC unit 530 (such as the SREG bit stream buffer/DMA engine 602) records the buffer bits in the bit stream for a slice, after the bit stream is used. The CAVLC unit 53 counts and updates the CVLC_bufferByteRemaining value. If the value is lower than 〇, this indicates that the buffered or bit stream has an error, and the processing is quickly terminated, and the application control or driver software 128 control is returned. Carry out recovery. Referring to the sixth C diagram, the INIT-CAVLC instruction may also initialize the storage structures of the CAVLC unit 530, such as the memory 6〇4, the left mbNeighCtx register 684, and the current mbNeighCtx register 686. In one embodiment, mbNeighCtx Memory 61〇's giant map ghosts are in the middle of the valley, and the memory array is arranged to store the data about the giant tile column. Currently, the mbNeighCtx register 686 is used to store the giant tile of the target shovel decoding. The left mbNeighCtx register 684 is used to store the previously decoded (left) giant tile, and in addition, the upper indicator 6 magic left indicator 685, and the current indicator 687 (in the sixth C diagram, arrow), point to mbNeighCtx Memory 6〇4, left mbNeighCtx temporary storage 37 200803526 684, and current mbNeighCtx register 686, when decoding the current giant tile, the decoded data is stored in the current 咖仏仏 register 686 ^ known decoding In the nature of the content, according to the CAVLC-TOTC command, the current giant tile is decoded from the information of the former code reading block, that is, the left giant tile is stored in the left mbNeighCtx register 684 and utilizes the left side. Standard directivity 685, and stored in block upward giant array element [i] 681 and 683 using the index for pointing upward.

INIT 一 CAVLC指令用於初始化與目前巨圖塊（如 mbNeighCtx記憶體陣列604之元素）相鄰之巨圖塊有關的上方及左侧彳683及685 ’舉個例子，左侧指標685可以設為0而上方指標683可以設為1，另外，INIT CAVLC 指令還會更新總暫存器606。於一實施例中，mbNeighCtx記憶體604包含具有12〇個元素之陣列’標示為mbNeighCtx[0]、mbNeighCtx[ 1 ]… mbNeighCtx[119]，每一圖片寬度最多能儲存12〇個巨圖塊 (因HDTV為1920 X 1080像素），熟悉此技藝者可利用不同大小的其他陣列結構。舉個例子，要判斷相鄰巨圖塊（如左側巨圖塊）是否存在（有效），CAVLC 一 TOTC指令必須進行一運算（如 mbCurrAddr%mbPerLine)，檢查結果是否為〇，於—實施例中，進行下列算式： x mbPerLine mbCurrAddr - ：{jnbCnrrAddr%mbPerLine) mbCurrAddr mbPerLine 38 200803526 mbCuirAddr代表對應於待解碼二進位符號的目前巨圖塊位置，mbPerLine代表每一列的巨圖塊數量，上面的計曾用到一除法、一乘法、以及一減法。％考慮下式： mbCurrAddr ε [Ο ： max MB -1] • 其中，maxMB是8192，而mbPerLine = 120，可利用乘、去一及由儲存於晶片上記憶體的表單（如120x11位元表）杳 _ 找之(1/mbPerLine)進行除法，如果 mbCurrentAddr 是 13 : 元5則使用13 xll乘法器，於一實施例中，將乘法運算的結果取整數，儲存較上方的13位元，進行13x7的乘法運异，儲存較低的13位元，最後進行13位元的減法運算以決定“a”，整個運算程序需要2個週期，可以儲存這個結果給其他運算使用，每當mbCurrAddr改變就計算一次。於某些實施例中不進行模數(m〇dul〇)運算，改以執行單兀（如執行單元420a，420b等等）内的著色器邏輯電路 φ 提供第一個mbAddrCurr值，其位於第一切片之第一行，舉個例子，這個著色器邏輯電路可以進行下列計算： mbAddrCurr - absoluteMbAddrCurr - n x mbPerLine 使用CWRITE指令可以「移動」mbNeighCtx記憶體 604的内容，CWRITE指令的格式可以是： CWRITE SRC1，其中，SRC1 [15:0] = mbAddrCurr，CWRITE 指令從目前 mbNeighCtx暫存器686的適當欄位複製到mbNeighCtx[] 結構 604 的上方 mbNeighCtx[i]以及左側 39 200803526 mbNeighCtx[i-l]，KmbAddrCUrr%mbPerLine = = 0)，& 侧mbNeighCtxLeft 684標記為不存在（如初始化成〇)，可以利用CWRITE指令「移動」mbNeighCtx記憶體604、區域暫存器608、以及總暫存器606的内容，舉個例子， CWRITE指令移動mbNeigliCtx記憶體604的相關内容到第i個巨圖塊的左侧及上方區塊（如jj^NeighCtxfi]或目前巨圖塊），並清空mbNeighCtx暫存器686，如前所述，與 mbNeighCtx記憶體604相關的兩個指標是左侧指標685及上方指標683，CWRITE指令之後，上方索引增加〗，而目前巨圖塊的内容則移至陣列604的上方位置及左側位置、，上述系統可以減少記憶體陣列的讀取/寫入埠的數量至一個讀取/寫入琿。利用INSERT指令可以更新mbNeighCtx記憶體604、局部暫存器608、以及總暫存器606的内容，INSERT指令的格式可為： INSERT DST,#Imm，SRCl 於此INSERT指令，#Imm包含10位元數字，資料的前5 位元寬度和較高5位元指定將插入資料的位置，輸入參數具有下列格式：The INIT-CAVLC instruction is used to initialize the upper and left sides 683 and 685 of the giant tile adjacent to the current giant tile (such as the element of the mbNeighCtx memory array 604). For example, the left indicator 685 can be set to The upper indicator 683 can be set to 1, and the INIT CAVLC instruction also updates the total register 606. In one embodiment, the mbNeighCtx memory 604 includes an array of 12 elements, labeled mbNeighCtx[0], mbNeighCtx[1]... mbNeighCtx[119], and each picture can store up to 12 huge blocks ( Since the HDTV is 1920 x 1080 pixels, other array structures of different sizes can be utilized by those skilled in the art. For example, to determine whether an adjacent giant tile (such as the left giant tile) exists (valid), the CAVLC-TOTC instruction must perform an operation (such as mbCurrAddr%mbPerLine) to check whether the result is 〇, in the embodiment , the following formula: x mbPerLine mbCurrAddr - :{jnbCnrrAddr%mbPerLine) mbCurrAddr mbPerLine 38 200803526 mbCuirAddr represents the current giant tile position corresponding to the binary symbol to be decoded, mbPerLine represents the number of giant tiles in each column, the above calculation used To a division, a multiplication, and a subtraction. % Consider the following formula: mbCurrAddr ε [Ο : max MB -1] • where maxMB is 8192, and mbPerLine = 120, you can use multiply, go to and use the form stored in the memory on the chip (such as 120x11 bit table)杳_ Find it (1/mbPerLine) for division. If mbCurrentAddr is 13: Element 5, use 13 xll multiplier. In one embodiment, the result of the multiplication operation is taken as an integer, and the upper 13 bits are stored for 13x7. The multiplication is different, the lower 13 bits are stored, and the 13-bit subtraction is finally performed to determine "a". The entire operation requires 2 cycles. This result can be stored for other operations, and is calculated whenever mbCurrAddr is changed. once. In some embodiments, the modulus (m〇dul〇) operation is not performed, and the color filter logic circuit φ in the execution unit (e.g., execution unit 420a, 420b, etc.) provides the first mbAddrCurr value, which is located at In the first line of a slice, for example, the shader logic can perform the following calculations: mbAddrCurr - absoluteMbAddrCurr - nx mbPerLine Use the CWRITE command to "move" the contents of the mbNeighCtx memory 604. The format of the CWRITE instruction can be: CWRITE SRC1, where SRC1 [15:0] = mbAddrCurr, the CWRITE instruction is copied from the appropriate field of the current mbNeighCtx register 686 to the upper mbNeighCtx[i] of the mbNeighCtx[] structure 604 and the left side 39 200803526 mbNeighCtx[il], KmbAddrCUrr% mbPerLine = = 0), & side mbNeighCtxLeft 684 is marked as non-existent (eg, initialized to 〇), and the contents of the mbNeighCtx memory 604, the region register 608, and the total register 606 can be "moved" using the CWRITE command. For example, the CWRITE command moves the relevant content of the mbNeigliCtx memory 604 to the left and upper blocks of the i-th macroblock (eg jj^NeighCtxfi) Or the current giant block), and empty the mbNeighCtx register 686, as previously mentioned, the two indicators associated with the mbNeighCtx memory 604 are the left indicator 685 and the upper indicator 683, after the CWRITE instruction, the upper index is increased, While the contents of the giant tile are moved to the upper and left positions of the array 604, the above system can reduce the number of read/write ports of the memory array to one read/write buffer. The contents of the mbNeighCtx memory 604, the local register 608, and the total register 606 can be updated by using the INSERT instruction. The format of the INSERT instruction can be: INSERT DST, #Imm, SRCl. This INSERT instruction, #Imm contains 10 bits. The number, the first 5 bit width and the higher 5 bits of the data specify the location where the data will be inserted. The input parameters have the following format:

Mask = NOT(0xFFFFFFFF«#Imm[4:0])Mask = NOT(0xFFFFFFFF«#Imm[4:0])

Data = SRC 1 & Mask SDATA = Data«#Imm[9:5] SMask = Mask«#Imm[9:5] 輪出DST可以下式表示： 200803526 DST - (DST & NOT(sMask)) I SDATA 舉個例子，可利用INSERT指令（如INSERT $mbNeighCtxCurrent_l，#ImmlO, SRC1)寫入目前巨圖塊，這個操作不會影響左側指標685及上方指標683 (亦即只寫入目前位置）。 INSERT指令可以寫入目前mbNeighCtx 686，左側指才示685指向的陣列元素與相鄰（相鄰於目前jj^NeighCtx ) 陣列元素（即mbNeighCtx[i-l])相同，當發出CWRITE指令’目前mbNeighCtx結構的全部或一些内容會複製到左侧指標685及上方指標683所指向的元素，同時上方指標增加1(如每一行巨圖塊的模數值），在複製操作的同時（或之後），以0值清空目前mbNeighCtx陣列元素。保留於mbNeighCtx記憶體604的資料結構如下： mbNeighCtxCurrent[01:00] : 2?b : mbType mbNeighCtxCurrent[65:02] : 4’b : TC[16] mbNeighCtxCurrent[81:66] : 45b : TCC[cb][4] mbNeighCtxCurrent[97:82] : 4’b : TCC[cr][4] 當執行CWRITE指令，會更新mbNeighCtx[]相鄰資料以及初始化目前mbNeighCtx 686。Data = SRC 1 & Mask SDATA = Data«#Imm[9:5] SMask = Mask«#Imm[9:5] The round-trip DST can be expressed as: 200803526 DST - (DST & NOT(sMask)) I SDATA For example, you can use the INSERT instruction (such as INSERT $mbNeighCtxCurrent_l, #ImmlO, SRC1) to write the current giant tile. This operation will not affect the left indicator 685 and the upper indicator 683 (that is, only write the current position). The INSERT instruction can be written to the current mbNeighCtx 686. The left side indicates that the array element pointed to by 685 is the same as the adjacent (next to the current jj^NeighCtx) array element (ie mbNeighCtx[il]), when issuing the CWRITE instruction 'current mbNeighCtx structure All or some of the content will be copied to the element pointed to by the left indicator 685 and the upper indicator 683, while the upper indicator is increased by 1 (such as the modulus value of each row of giant tiles), at the same time (or after) the copy operation, with a value of 0 Clear the current mbNeighCtx array element. The data structure retained in mbNeighCtx memory 604 is as follows: mbNeighCtxCurrent[01:00] : 2?b : mbType mbNeighCtxCurrent[65:02] : 4'b : TC[16] mbNeighCtxCurrent[81:66] : 45b : TCC[cb ][4] mbNeighCtxCurrent[97:82] : 4'b : TCC[cr][4] When the CWRITE instruction is executed, the mbNeighCtx[] neighbor data is updated and the current mbNeighCtx 686 is initialized.

現已描述CAVLC單元530使用的内容記憶體結構，接下來說明CAVLC單元530及CAVLC一TOTC指令如何利用相鄰内容資訊計算TotalCoeff(TC)，TotalCoeff用來決定要使用哪一個CAVLC表來解碼符號，通常CAVLC解瑪是利用H.264規格書的可變長度解碼表（之後稱為CAVLC 200803526 表），其t根據先前解碼符號之内 CAVLC表，因此，每一個效㈣解碼的The content memory structure used by the CAVLC unit 530 has been described. Next, how the CAVLC unit 530 and the CAVLC-TOTC instruction calculate TotalCoeff(TC) using adjacent content information, which is used to determine which CAVLC table to use to decode symbols, is explained. Usually CAVLC solves the problem by using the variable length decoding table of the H.264 specification (hereinafter referred to as CAVLC 200803526 table), which is based on the CAVLC table within the previously decoded symbol, and therefore, each effect (four) decoded

付唬可月匕雷用到不同的CAVLC t，弟六D圖顯示一個基本的表單結構，其為可變尺寸2D _’提供—個「表單」_(每-個表單對應-個符號）， =了個符號都是霍夫曼編碼，霍夫曼編碼存成下列表單結構·唬唬匕匕用用用用用用用用用用用用弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟弟= The symbols are Huffman codes, and the Huffman codes are stored in the following form structure.

struct Table { unsigned head; struct table { unsigned val; unsigned shy; }table[]; }Table[]; 下面搖述根據各前置碼（prefix c〇ding)的配對 (MatchVLC功能)方法，通常CAVLC表分成可變長度部分和固定長度部分，因此利用固定尺寸索引查找可以簡化配對’於MatchVLC功能中，進行rEAD操作不會從移位暫存斋602a移出位元流’READ操作與前面說明的READ指令（用於位元流缓衝态602b)不同，後者是針對位元流的。於matchVLC功能中，從位元流緩衝器602b複製一些位元 (fixL)，接著於指定表單中查找，指定表單中的每一個項目包含一係數對（dublet，如值與位元數），這個位元數可用於處理位元流。 42 200803526Struct table { unsigned head; struct table { unsigned val; unsigned shy; }table[]; }Table[]; The following is a method of pairing (MatchVLC function) according to each preamble (prefix c〇ding), usually CAVLC table Divided into variable length part and fixed length part, so the fixed size index search can simplify the pairing'. In the MatchVLC function, the rEAD operation does not shift out the bit stream from the shift register 602a. READ operation and the READ instruction described above (for the bit stream buffer state 602b) is different, the latter is for the bit stream. In the matchVLC function, some bits (fixL) are copied from the bit stream buffer 602b, and then searched in the specified form, each item in the specified form contains a coefficient pair (dublet, such as value and number of bits), this The number of bits can be used to process the bit stream. 42 200803526

VV

'FUNCTION MatchVLC(Table, maxldx) < , , INPUT Table;, ^ INPUT paaxldk;、-' ’十 v ；： _|_l議__議1藝議義議議盡議議議__議議議i議議鐘靈議乾 :Idxl，CLZ(sREG>; //count number/.of heading zeros ' ; ，工dxr = (Idxl > maxldx) ? maxldx : Idxl; 、'‘ ~ 、 fixL；- Table[Idxl].head; 、， ' % SHL{sREG,工 dxl+#l” //shift buffer Idxl十1 bit left (fixL) 0 : READ(fixL) 、〆 ' ' ， (valr shv) = Table[Idxl][Idx2]; ' ： : — s ' ^ . f ' ? ^ ;SHL(sREG, shv); - 強;沒玫與丨:¾¾:濟 ______ 議議 ______ 圍^^ ^ ^ return； val; …〆' _I_S:議玆鍾__議議譯画1鐘:讓崩議1議3露鐘議___謹·:运___^ 議_讓_議l_®_議議_______議議霧賴議議:1議_議鎌議__雲 ENDFUNCTON ’ ' ' < _ii證窮錢魏__雜:____择_强讓第六D圖是前述表單結構的例示2D陣列之方塊圖，用於解釋CAVLC解碼内容中之MatchVLC功能，這個例子是H.264規格書的表9_5(nC =二]):'FUNCTION MatchVLC(Table, maxldx) < , , INPUT Table;, ^ INPUT paaxldk;, -' '10v;: _|_l Discussion__ Discussion 1 Art Discussion and Dispute Resolution __ i discussion bell spirit: Idxl, CLZ (sREG>; //count number/.of heading zeros '; , work dxr = (Idxl > maxldx) ? maxldx : Idxl; , '' ~ , fixL;- Table[ Idxl].head; , , ' % SHL{sREG, worker dxl+#l // //shift buffer Idxl ten 1 bit left (fixL) 0 : READ(fixL) , 〆' ' , (valr shv) = Table[Idxl] [Idx2]; ' : : — s ' ^ . f ' ? ^ ; SHL(sREG, shv); - Strong; no rose and 丨: 3⁄43⁄4: ______ ______ 围 ^^ ^ ^ return; val; ...〆' _I_S: 议兹钟__ Discussion and translation of 1 clock: Let the collapse of 1 discussion 3 露 _ _ _ _ _ _ _ _ _ _ _ _ _ l_ _ _ _ _ _ _ Discussion on the fog: 1 discussion _ discussion __ cloud ENDFUNCTON ' ' ' < _ii certificate poor money Wei __ miscellaneous: ____ choice _ strong let the sixth D picture is the block diagram of the aforementioned form structure 2D array Figure, used to explain the MatchVLC function in CAVLC decoding content, this example is Table 9_5 of the H.264 specification (nC = two):

Coeff tokeri ' :人〜 … \ / TtailingOnes $議_接鐵__纖;:_謹_ TotalCoefT Head Value :、 ’. : :: ? Shift ； 1 1 1 0 33 0 …V、；>;之、、 '?r:'vvif f、：：',,'， …人 ' 、 0 ：〇〔、广 - 讓纖驟1^_黎_·纖鐵纖__ ' Ί:::'、4 001 2 2 0 66 0 、、丄' … ^ ·? 、％，： 000100 0 ΙΙ^ΒΒΙΙΙΒβ ，i 夕.、X 鱗發議纖轉爾 :、、之 ' ^ 5' Ία、、？' 卿101 O S/ 、〆、… ' y ,::V,、Ί、沒__纖遵雜鑛遂__猶纖繁錢 ί . ^ , ΒΗ^Β -/ 、、、 99 麵繼纏纏戀藝 '、、、、广…、’ >> ν;' 人 ' h、？''、' .f ' 、 Ά … 000110 二 \ :W — iiHi纏議義襲^®|Ι1Ι__βΙ \ '' :、議麵襲雜戀1釀赚義: '7' 1 ' : '':-ν' = ^^βββ画圓_111匿麵議麵___ '34' _讓^^_議震ΐΜϋ_Ι^Λ_^^鑛1德、、' ' ::-Ί 乂；ν·>·：·；：：·：·：·：.·.：、''、' „ ' 1 000111 :: _iliiiif 議顏 s ^ f 議_繊襲議鱺議義囊議義_ 、：乂、< …，' ^ ? 讎顧纖 i^eilii β·^·· ::s 領麟藤绿纖丨:織丨丨總的鑛擊丨譲礙錢錄!_寒纖麵讓黎鐵丨丨芻顯釋雜賴丨纖齡母潘激#:¾織:::::絲:嫌微玫汶:滅綠:媒轉驗 ___議 _|_ 議廳 ___ι__ιβ^β^ 廳纖纖!辯議， :、','、': 、、 f 、、2 ,:、上:'':,」 ' 、 ^ 、 S 、 Ύ 、〇 000010 0 4 1 4 1 000011 0 3 3 1 :；oooooio / :r / i w 又、、、' 广 ':、Κ Ί: ：2'^；γ；；，~ f 、一-'v、 ί ' 、 ν - β^·βι^β Ί、、：〆 '，:： lieiia^s :d :/w 61 ι . * ΊΛ、… .'-------- 鑛變凝:凝驅究丨:窗纖纖 1議 rV-：： o d 贿 ι; : / 、心 >:UK:、： ' - :入 I 人，？ ' Ί ,、〆、j ν;、ν 二、，r 二 Γ，/ ·'、:、，，、3''、、、-''，： Λ ' 八，囊淺麵钂顯關:襲纖 ββΙίρΐΙΒ® _ιβ襲薄 1圓_誦麵 - <'：Λ S ' 、、'乂 -------— βι^β 43 200803526 00000010 2 4 1 68 1 00000011 1 4 36 1 Λ , ο , r、、、' 、 -0000000 ；> " 1«麵_議111遲___議_ to…'ν ' Θ V’’v,义、' ^ s , ί οο :' I a; 鐵SS韻f •Ά Γ:: \ 、，、 ^ - 就虛擬碼來說，此表可以下式表示：Coeff tokeri ' :人〜 ... \ / TtailingOnes $议_接铁__纤;:_谨_ TotalCoefT Head Value :, '. : :: ? Shift ; 1 1 1 0 33 0 ...V,;>; ,, '?r:'vvif f,::',,', ...人', 0: 〇[,广- let the fiber 1^_黎_·纤铁纤__ ' Ί:::', 4 001 2 2 0 66 0 , , 丄 ' ... ^ ·? , %, : 000100 0 ΙΙ^ΒΒΙΙΙΒβ , i 夕., X 鳞发发纤尔尔:,,,,,,,,,,,,,,,,,,,,,,,,,, '卿101 OS/, 〆,... ' y ,::V, Ί, _ _ 纤遵遂遂 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Love art ', 、, 广,...' >>ν; '人' h,? '', ' .f ' , Ά ... 000110 2 \ : W — iiHi 缠义 ^ ® ® ® ® ® ® ® ® ® ® ® ® ® ® ® ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ν' = ^^βββ画圆_111 面面面面___ '34' _让^^_议震ΐΜϋ_Ι^Λ_^^矿1德,, ' ' ::-Ί 乂;ν·>·: ·;::·:·:::··:, '', ' „ ' 1 000111 :: _iliiiif 颜颜 s ^ f _ 繊鲡鲡鲡鲡议 _ _ _ _ & & & & & & & & , ' ^ ? 雠顾纤i^eilii β·^·· ::s collar lin vine green fiber 丨: 丨丨丨丨丨丨矿矿矿矿矿矿丨丨丨丨丨丨丨丨丨丨丨丨 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _丨丨丨龄潘潘激 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :, ', ', ': , , f , , 2 , : , upper : '':," ' , ^ , S , Ύ , 〇000010 0 4 1 4 1 000011 0 3 3 1 :;oooooio / :r / iw again, , , '广广':,Κ Ί: :2'^;γ; ; ,~ f , one-'v, ί ' , ν - β^·βι^β Ί,,:〆',: : lieiia^s :d :/w 61 ι . * ΊΛ,... .'-------- Mine Condensation: Condensation Drive: Window Fiber 1 rV-:: o d bribe ι; : / , heart >:UK:,: ' - : into I,? ' Ί , , 〆 , j ν ; , ν 2 , , r Γ , / · ' , : , , , , 3 ' ' , , , - ' ' , : Λ ' 八 , 囊浅钂 : : : : :纤ββΙίρΐΙΒ® _ιβ 薄薄1圆_诵面- <':Λ S ' , , '乂-------- βι^β 43 200803526 00000010 2 4 1 68 1 00000011 1 4 36 1 Λ , ο , r, , , ' , -0000000 ;>" 1«面_议111迟___议_to...'ν ' Θ V''v, meaning, ' ^ s , ί οο :' I a; Iron SS rhyme f • Ά Γ:: \ , , , ^ - For virtual code, this table can be expressed as:

TabIe9-5[8] - { ^ 〇，{{33,0}}, • 〇，{{〇,〇}}，〇，{{66,0}}， 2，{{2,2}，{99,2}，{34,2}，{1，2}}， 1，{{4，1}，{3，1}}， 1，{{67, 1}，{35, 1}}， 1，{{68，1}，{36，1}}，〇, {{1〇〇,〇}} }；馨上述虛擬碼可以表示成第六D圖的2D表單，利用這 • 個表單結構，上述MatchVLC功能可以用於cAVLC解碼， _ MatchVLC功能會計算位元流中從最高位起連續〇的數目 (countleading zero，CLZ)，以存取對應已知語法成分的表單；另外，當CLZ值大於macldx，則MatchVLC功能啟動參數化清除(parameterized clear zero)操作，然後 maxldx 回復（在第六D圖的表單中為〇〇〇〇〇〇〇)。MatchVLC功能以及表單結構的另一個好處便是不需要多個指令來進行處理，只要下列的 MatchVLC 片段·· Idxl = CLZ(sREG); //count 44 200803526 number of leading zeros，and Idxl 二（Idxl > maxldx)? maxldx : Idxl。利用下列MatcliVLC片段移除已使用的位元：SHL(sREG，Idxl+#1); //shift buffer Idxl+1 bit left。利用下列MatchVLC片段讀取子陣列標頭：fixL = Table[Idxl].head，and Idx2 = (!fixL)? 0 : READ(fixL)。前方連續〇的數目可能相同，但是尾端位元的大小不同，於一貫施例中可利用CA SEX-type狀態敘述(case statement)(使用較多記憶體但是較簡單的碼結構）。利用（val，shv) = Table[Idxl][Idx2]以及 SHL(sREG，shv) 從表單得到真實值，也可知道這個語法成分使用的實際位兀數，從位元流移出這些位元，然後將語法成分值放回目標暫存器。剷面已描述位元流解析、初始化解碼引擎及記憶體結構、以及VLC配對方法及表單結構，現回到第六入圖描述CAVLC解碼引擎（如CAVLC邏輯電路660)及程序，一旦載入位元流、解碼引擎、記憶體結構、以及暫存器，驅動軟體128發ά CAVLC—而c指令致能⑽ff趾:模組610，CAVLC—TOTC指令格式可為： CAVLC—TOTC DSI；S1，TabIe9-5[8] - { ^ 〇,{{33,0}}, • 〇,{{〇,〇}}, 〇,{{66,0}}, 2,{{2,2},{ 99, 2}, {34, 2}, {1, 2}}, 1, {{4, 1}, {3, 1}}, 1, {{67, 1}, {35, 1}}, 1,{{68,1},{36,1}}, 〇, {{1〇〇,〇}} }; The above virtual code can be represented as a 2D form of the sixth D map, using this form structure The MatchVLC function described above can be used for cAVLC decoding. The _ MatchVLC function calculates the number of consecutive leading zeros (CLZ) from the highest bit in the bit stream to access the form corresponding to the known syntax components. In addition, when the CLZ value Greater than macldx, the MatchVLC function initiates a parameterized clear zero operation, and then maxldx replies (in the form of the sixth D diagram, 〇〇〇〇〇〇〇). Another benefit of the MatchVLC function and the form structure is that it does not require multiple instructions for processing, as long as the following MatchVLC fragments are · Idxl = CLZ(sREG); //count 44 200803526 number of leading zeros, and Idxl II (Idxl &gt ; maxldx)? maxldx : Idxl. The used bits are removed using the following MatcliVLC fragments: SHL(sREG, Idxl+#1); //shift buffer Idxl+1 bit left. Subarray headers are read using the following MatchVLC fragments: fixL = Table[Idxl].head, and Idx2 = (!fixL)? 0 : READ(fixL). The number of consecutive turns in the front may be the same, but the size of the tail bits is different. The CA SEX-type state statement can be used in a consistent example (using more memory but a simpler code structure). Use (val,shv) = Table[Idxl][Idx2] and SHL(sREG,shv) to get the real value from the form. You can also know the actual number of bits used by this syntax component, remove these bits from the bit stream, and then Put the syntax component values back into the target scratchpad. The spade has described the bitstream parsing, initializing the decoding engine and memory structure, and the VLC pairing method and form structure. Now returning to the sixth inset to describe the CAVLC decoding engine (such as CAVLC logic circuit 660) and the program, once loaded The stream, the decoding engine, the memory structure, and the scratchpad, the driver software 128 sends CAVLC - and the c command enables (10) ff toe: module 610, the CAVLC-TOTC instruction format can be: CAVLC - TOTC DSI; S1,

其中S1和DST分別為輪入暫存器及内部輸出暫存器，具有下列格式： W SRC1 [3:0]-blkIdx SRC1 [18:16]-blkCat SRC1 [24] - iCbCr 45 200803526 剩下的位元未定義，輪出格式如下： DST [31:16] - TrailingOnes DST[15:0] - TotalCoeff 因此’ coeff^oken模組610接收對應於 mbType (表示色度通道是否有在處理，如icbcr)、以及 v blkWx (如區塊索引，因為圖形可能切成許多區塊）的資 - 汛，當存取位元流緩衝器602b之一巨圖塊，blkldx表示特定位置處理的是8x8像素區塊或是4χ4像素區塊，這類馨資几疋由驅動权體所提供，coeff—token模組610包含一查找表(look-up table)，根據前述輸入 coeffjoken 模組的查找表付到尾端l(TrailingOnes)及全體係數 (TotalCoeff) ’尾端1表示一列中有多少丨，全體係數則表示從位元流拉出之資料片段有多少運作/層級係數對，S1 and DST are the wheeled register and the internal output register, respectively, with the following format: W SRC1 [3:0]-blkIdx SRC1 [18:16]-blkCat SRC1 [24] - iCbCr 45 200803526 The bit is undefined and the rounding format is as follows: DST [31:16] - TrailingOnes DST[15:0] - TotalCoeff Therefore 'coeff^oken module 610 receives corresponding to mbType (indicates whether the chroma channel is being processed, such as icbcr ), and v blkWx (such as block index, because the graphics may be cut into many blocks) - when accessing a macro block of bit stream buffer 602b, blkldx indicates that the specific position is processed by 8x8 pixel area Block or 4χ4 pixel block, this kind of credit is provided by the driver. The coeff-token module 610 includes a look-up table, which is based on the lookup table of the input coeffjoken module. TrailingOnes and TotalCoeff 'The end 1 indicates how many flaws are in a column, and the total coefficient indicates how many operational/hierarchical coefficient pairs are extracted from the data stream.

TrailingOnes 以及 TotalCoeff 將分別輸入 CAVLC—Level 模組614及CAVLC—ZL模組618，TrailingOnes亦同時輸入 CAVLC一L0模組616,其對應從位元流缓衝器602b取出之 _ 第-層級（如DC值）。， CAVLC一level模組614記錄符號的字尾長度（如尾端， 1的數目），並結合LevelCode計算儲存於層級陣列622及運作陣列624的層級值(level[Idx])，CAVLCJLevel模組614 根據CAVLC_LVL指令運算，CAVLC一LVL指令的格式如下： CAVLC—LVL DST，S2, S1，其中= 46 200803526 51 = Idx (16-bit)， 52 二 suffixLength (16-bit)，and DST = suffixLength (16-bit). suffixLength表示字元碼長度’來自驅動軟體128的輸入會提供指^ suffixLength的資訊’科，於—實施例中，因為更新了 suffixLength值’ DST和S2可以由同樣的暫存 to取得。這裡亦可使用轉遞暫存器（保留特定模組於内部產生之資料），如第六B圖之F1 665及F2 667，一個指令及對應的模組是否使用轉遞暫存H會於指令相轉遞旗標表不，代表轉遞暫存器的符號有F1 (使用轉遞來源〗之值，於一實施例中可以指令中的位元26表示）以及F2 (使用轉遞來源2之值，於一實施例中可以指令中的位元27表不）’如果使用轉遞暫存器，CAVLC一LVL指令會有下列例示格式： CAVLC—LVL.F 1 .F2 DST，SRC2, SR1，其中，如果F1或F2設為1，則指定的轉遞來源將成為輸入，轉遞暫存器F1對應於CAVLCJLevel模組614所產生的層級索引（level[Idx])，經過一增量(increment)模組後輸入多工器630,轉遞暫存器F2對應於CAVLCJLevel模組614 所產生的suffixLength，並將輸入多工器628，多工器603 及多工器628的其他輸入還有EU暫存器輸入（第六A圖中標示為EU )，說明如下。 CAVLCJLevel模組614還有另一個輸入levelCode，是 47 200803526 由 CAVLC—LevelCode 模組 612 所提供，CAVLCJLevelCode 模組612及CAVLC_Level模組614聯合運算解碼層級值 (在調整大小（scaling)之前的變換係數（transform coefficient)值），致能 CAVLC—LevelCode 模組 612 的指令格式如下：TrailingOnes and TotalCoeff will be input into CAVLC-Level module 614 and CAVLC-ZL module 618, respectively. TrailingOnes also inputs CAVLC-L0 module 616, which corresponds to the _-level (such as DC) taken from bit stream buffer 602b. value). The CAVLC-level module 614 records the suffix length of the symbol (eg, the end, the number of 1), and calculates the level value (level[Idx]) stored in the hierarchical array 622 and the operational array 624 in combination with the LevelCode, the CAVLCJLevel module 614. According to the CAVLC_LVL instruction operation, the format of the CAVLC-LVL instruction is as follows: CAVLC-LVL DST, S2, S1, where = 46 200803526 51 = Idx (16-bit), 52 two suffixLength (16-bit), and DST = suffixLength (16 -bit). suffixLength indicates the character code length 'The input from the driver software 128 will provide the information of the suffixLength'. In the embodiment, since the suffixLength value is updated, DST and S2 can be obtained by the same temporary storage. . Here, you can also use the transfer register (retain the internal generated data of a specific module), such as F1 665 and F2 667 in Figure B, whether an instruction and the corresponding module use the transfer temporary H will be in the instruction. The phase transfer flag indicates that the symbol representing the transfer register has F1 (using the value of the transfer source), which can be represented by bit 26 in the instruction) and F2 (using the transfer source 2). The value, in one embodiment, can be indicated by bit 27 in the instruction. 'If using the transfer register, the CAVLC-LVL instruction will have the following instantiation format: CAVLC-LVL.F 1 .F2 DST, SRC2, SR1, Wherein, if F1 or F2 is set to 1, the specified forwarding source will be the input, and the forwarding register F1 corresponds to the hierarchical index (level[Idx]) generated by the CAVLCJLevel module 614, after an increment (increment) After the module is input to the multiplexer 630, the transfer register F2 corresponds to the suffixLength generated by the CAVLCJLevel module 614, and the input multiplexer 628, the multiplexer 603 and the other inputs of the multiplexer 628 are also EU. The register input (labeled EU in Figure 6A) is explained below. The CAVLCJLevel module 614 has another input levelCode, which is 47 200803526. It is provided by the CAVLC-LevelCode module 612. The CAVLCJLevelCode module 612 and the CAVLC_Level module 614 jointly calculate the decoding level value (the transform coefficient before scaling). The transform coefficient)), the instruction format of the CAVLC-LevelCode module 612 is as follows:

CAVLC LC SRCL — j 其中，SRC1 = suffixLength (16-bit)，如果使用轉遞暫存器 F1 665，則指令表示如下： CAVLC 一LVL.Fl SRC1, 如果設定F1，則轉遞SRC1將做為輸入，配合第六A圖，如果設定 F1 (如 FI = 1)，則 CAVLC—LevelCode 模組 612 使用轉遞SRC1值（如CAVLC—Level模組614的 suffixLength)做為輸入，不然（如pi =〇)，EU暫存器之值將做為輸入。現在回到 CAVLCJLevel 模組 614，suffixLength 輸入可以經由多工器628從CAVLCJLevel模組614轉遞，也可以經由EU暫存器提供至多工器628，另外Idx輸入同樣可以經由多工器630從CAVLC_Level轉遞（可以藉由增量模組進行增量或自動增量），也可以經由EU暫存器提供至多工器630。CAVLCJLevel模組614還直接從 CAVLCJLevelCode 模組 612 接收 levelCode 輸入，除了傳送給轉遞暫存器的輸出，CAVLCJLevel模組614還提供層級索引（level[idx])輸出給層級陣列622。如前所述，將TrailingOnes輸出（如DC值）傳送至 48 200803526 CAVLC—LO模組616，藉由下列指令致能〇八¥1^一1^〇模組： CAVLC LVLO SRC，其中，SRC = trailingOnes(coeff token)，CAVLC—L〇模組 616的輸出包括輸出給層級陣列622的層級索引 (Level[Idx])，係數值編碼成正負號（sign)與大小 (magnitude)，CAVLC—L0模組616提供係數的正負值， CAVC—Level模組614提供的大小值與cVLC_L0模組6工6 提供的正負值結合，寫入層級陣列622,利用層級索引 (level[idx])指定寫入位置，於一實施例中，係數的每一子區塊是4x4矩陣（區塊是8x8)，還不是掃猫(raster)順序，這個陣列稍後轉換成4x4矩陣，換句話說解碼的係數層級及運作並不是掃描格式，利用層級一運作資料，可以重建 4x4矩陣（但是為Z字掃描順序），然後重新排列成掃描順序的4 X 4矩陣。將coeff—token模組610的輸出TotalCoeff傳送給 CAVLC一ZL模組618，藉由下列指令致能CAVLC一ZL模組 618 ：、 CAVLC—ZL DST，SRC1，其中，SRC1 = maxNumCoeff( 16-bit)而 DST = ZerosLeft(16-bit)，maxNumC〇eff(H.264 標準)做為指令的來源值，換句話說，maxNumCoeff是由軟體設定的，於某些實施例中，maxNumCoeff儲存於硬體中，變換係數編碼成 (層級，運作)係數對，代表編碼成〇的係數（層級）數目， CAVLC—ZL 模組 618 提供兩個輸出 ZerosLeft 及 Reset(reset 49 200803526 =0)給多工器640及642，多工器64〇亦從CAVLC—Run 模組620接收轉遞暫存器F2，多工器642從CAVLC=Run 模組620接收增量（經由增量模組）的轉遞暫存器之值ρι。 CAVLC 一 Run模組620分別從多工器64〇及642接收 ZerosLeft及Idx輸入，並輸出運作索引(Run[IdxD至運作陣列624，如前所述，因為會使用運作_長度編碼進行更進一步的壓細，因此係數編碼成(層級，運作)對，舉個例子，假設具有數值10 12 12 15 19 1 1 1 〇〇〇〇〇〇 1 〇，會編碼成（10,0)(11^5,0)(19,0)(12)(0))(10)^ 通常比較短，索引便是層級索引的對應索引，藉由下列指令致能CAVLCJRom模組620 : CAVLC—RUN DST，S2，S1，其中，因為已更新ZerosLeft值，因此DST及S2可從同的暫存器取得，CVLC一Rim的不具正負號數值如下： 51 = Idx( 16-bit)， 52 = ZerosLeft(16-bitX DST = Zerosleft( 16-bit) 由第六A圖可知，如果使用轉遞暫存器，cavlc_run 指令的格式如下： CAVLC.F1.F2 DST，SRC2, SRC1，其中，如果設定F1或F2，就表示對應的轉遞來源將做為輸入。至於兩個暫存器陣列，層級陣列622對應於層級，而運作陣列624對應於運作，每一個陣列都包括16個元素， 50 200803526 層級陣列622的每一個元素都包含16位元具正負號之值，而運作陣列624的每一個元素都包含4位元不具正負號之值，利用下列指令分別從運作陣列624及層級陣62^妹取運作及層級值： # READ—LRUN DST，其中’於-實施例中，DST包括4個128位摘連續暫存器（如EU臨時或共用暫存器），這個操作會讀取（：八乂1^ 單元530内的層級暫存器622及運作暫存器624,並將其儲存於目標暫存器DST，當讀取運作並將其儲存於暫存器中，運作值會轉換成16位元不具正負號之值，舉個例子，萷2個暫存為保留16個16位元層級（亦即陣列儲存第一筆16個係數）值，而第三及第四暫存器則保留16個16 位元運作值，如果超過16個係數，將其解碼至記憶體，於一實施例中，依照下列順序將值寫入：於第一暫存器中，最低有效16位元包含LEVEL[0]、位元16-31包含LEVa⑴ 等等，以此類推直到位元112-127包含LEVEL[7];於第二暫存器中最低有效16位元包含！^\^1^[8]〜，運作值亦使用同樣的排列方法。用於清除運作陣列624及層級陣列624暫存器的另— 個指令格式如下： CLR一LRUN· 前述解碼系統200 (如CAVLC單元530)的軟體（著色哭程式）以及硬體操作（如模組）可以利用下列虛擬碼表示： 51 200803526CAVLC LC SRCL — j where SRC1 = suffixLength (16-bit), if the transfer register F1 is used, the instruction is as follows: CAVLC - LVL.Fl SRC1, if F1 is set, then transfer SRC1 will be used as input In conjunction with Figure 6A, if F1 is set (eg, FI = 1), the CAVLC-LevelCode module 612 uses the SRC1 value (such as the suffixLength of the CAVLC-Level module 614) as input, otherwise (eg pi = 〇 ), the value of the EU register will be used as input. Returning now to the CAVLCJLevel module 614, the suffixLength input can be forwarded from the CAVLCJLevel module 614 via the multiplexer 628, or can be provided to the multiplexer 628 via the EU register, and the Idx input can also be transferred from the CAVLC_Level via the multiplexer 630. The delivery (which can be incremental or automatic increment by the incremental module) can also be provided to the multiplexer 630 via the EU register. The CAVLCJLevel module 614 also receives the levelCode input directly from the CAVLCJLevelCode module 612. In addition to the output passed to the transfer register, the CAVLCJLevel module 614 also provides a hierarchical index (level[idx]) output to the hierarchical array 622. As previously mentioned, the TrailingOnes output (eg, DC value) is transmitted to the 48 200803526 CAVLC-LO module 616, which is enabled by the following command: CAVLC LVLO SRC, where SRC = trailingOnes (coeff token), the output of the CAVLC-L〇 module 616 includes a hierarchical index (Level[Idx]) output to the hierarchical array 622, the coefficient values are encoded into a sign and a magnitude, and the CAVLC-L0 mode The group 616 provides the positive and negative values of the coefficients. The size value provided by the CAVC-Level module 614 is combined with the positive and negative values provided by the cVLC_L0 module 6 and is written to the hierarchical array 622, and the write position is specified by the level index (level[idx]). In an embodiment, each sub-block of the coefficient is a 4x4 matrix (the block is 8x8), not a raster sequence, which is later converted into a 4x4 matrix, in other words, the decoded coefficient level and The operation is not a scan format. Using the hierarchical-operational data, a 4x4 matrix can be reconstructed (but in a z-scan order) and then rearranged into a 4 X 4 matrix of scan order. The output TotalCoeff of the coeff_token module 610 is transmitted to the CAVLC-ZL module 618, and the CAVLC-ZL module 618 is enabled by the following command: , CAVLC-ZL DST, SRC1, where SRC1 = maxNumCoeff (16-bit) While DST = ZerosLeft (16-bit), maxNumC〇eff (H.264 standard) is used as the source value of the instruction. In other words, maxNumCoeff is set by the software. In some embodiments, maxNumCoeff is stored in the hardware. The transform coefficients are encoded into (level, operational) coefficient pairs representing the number of coefficients (hierarchy) encoded into 〇, and the CAVLC-ZL module 618 provides two outputs ZerosLeft and Reset (reset 49 200803526 =0) to the multiplexer 640 and The multiplexer 64 also receives the transfer register F2 from the CAVLC-Run module 620, and the multiplexer 642 receives the incremental (via the incremental module) transfer register from the CAVLC=Run module 620. The value ρι. The CAVLC-Run module 620 receives the ZerosLeft and Idx inputs from the multiplexers 64A and 642, respectively, and outputs the operational index (Run[IdxD to the operational array 624, as previously described, as the operation_length coding is used for further processing). Compact, so the coefficients are coded into (level, operational) pairs, for example, assuming a value of 10 12 12 15 19 1 1 1 〇〇〇〇〇〇1 〇, which is encoded as (10,0) (11^5 , 0)(19,0)(12)(0))(10)^ Usually shorter, the index is the corresponding index of the hierarchical index, and the CAVLCJRom module 620 is enabled by the following instructions: CAVLC-RUN DST, S2, S1, where, since the ZerosLeft value has been updated, DST and S2 can be obtained from the same register. The unsigned value of CVLC-Rim is as follows: 51 = Idx( 16-bit), 52 = ZerosLeft(16-bitX DST = Zerosleft( 16-bit) As shown in Figure 6A, if the transfer register is used, the format of the cavlc_run instruction is as follows: CAVLC.F1.F2 DST, SRC2, SRC1, where if F1 or F2 is set, it means The source of the transfer will be used as input. As for the two register arrays, the hierarchical array 622 corresponds to the hierarchy and operates. Column 624 corresponds to operation, each array includes 16 elements, 50 200803526 Each element of hierarchical array 622 contains a 16-bit signed value, and each element of operational array 624 contains 4 bits that are not positive or negative. The value of the number uses the following instructions to take the operation and level values from the operational array 624 and the hierarchical array: # READ—LRUN DST, where 'in the embodiment, the DST includes four 128-bit continuous registers ( Such as EU temporary or shared register), this operation will read (: gossip unit 530 in the level register 622 and the operation register 624, and store it in the target register DST, when reading Take the operation and store it in the scratchpad, the operation value will be converted into a 16-bit non-signal value. For example, 萷2 temporary storage is reserved for 16 16-bit levels (that is, the array is stored first). The pen has 16 coefficients), while the third and fourth registers retain 16 16-bit operational values. If more than 16 coefficients are exceeded, they are decoded into memory. In one embodiment, the values are in the following order. Write: In the first scratchpad, the least significant 16-bit contains LEVEL[0], bits 16-31 contain LEVa(1), etc., and so on until bits 112-127 contain LEVEL[7]; the least significant 16 bits in the second register contain !^\^1^[ 8] ~, the operating value also uses the same arrangement method. The other instruction format for clearing the operational array 624 and the hierarchical array 624 register is as follows: CLR-LRUN· The software (coloring crying program) of the aforementioned decoding system 200 (such as CAVLC unit 530) and hardware operations (such as modules) ) can be represented by the following virtual code: 51 200803526

Residnal_blQck_cavlc( coefi^evel, maxNumCoeff) {Residnal_blQck_cavlc( coefi^evel, maxNumCoeff) {

CLR—LEVEL一RUN ' ,， ____ ：Λ} ：；- : ： -：.v · ^ if( TotalCoeff( coeffJoken ) > 10. && TrailingOnes( coeff token) < 3 ) if(TotalC0eff(coeffJoken)>0) { suffixLength = 1CLR—LEVEL-RUN ' , , ____ :Λ} : ;- : : -:.v · ^ if( TotalCoeff( coeffJoken ) > 10. && TrailingOnes( coeff token) < 3 ) if(TotalC0eff( coeffJoken)>0) { suffixLength = 1

Else :sufBxLength.- 0 CAVLC_levelO〇; for( I = TrailingOnes(coeff_taken); I < TotalCoeflF( coeff token ); i-H-){ CAVLC一leveldode (levelCode, suffi芩Lehgth); CAVLC一level《suffixLength, i,levelCode)Else :sufBxLength.- 0 CAVLC_levelO〇; for( I = TrailingOnes(coeff_taken); I < TotalCoeflF( coeff token ); iH-){ CAVLC-leveldode (levelCode, suffi芩Lehgth); CAVLC-level “suffixLength, i, levelCode)

CAVLC二ZerosLeft {ZerosLeft, maxNtimCoeff) for(i =,0；i<TotalCoeff( coeff_token) - 1; i-Η-) { \ CAVLC—run(i, ZerosLeft> READ一LEVEL一RUN (level, run) fun[ TotalC6cfF( coeff_token ) — 1 ] = zerosLeft Λ coeffNum._ for( i = TotalCoeff( coeff_token ) - 1; i >= 〇; i— )、{ coeffNum += run[ i ] + 1 coeffLevel[ coefiNum ] = leveli i ]CAVLC 二ZerosLeft {ZerosLeft, maxNtimCoeff) for(i =,0;i<TotalCoeff( coeff_token) - 1; i-Η-) { \ CAVLC—run(i, ZerosLeft> READ-LEVEL-RUN (level, run) fun [ TotalC6cfF( coeff_token ) — 1 ] = zerosLeft Λ coeffNum._ for( i = TotalCoeff( coeff_token ) - 1; i >= 〇; i— ), { coeffNum += run[ i ] + 1 coeffLevel[ coefiNum ] = Leveli i ]

應強調的是，本發明所舉的上所實施例或「較佳」實鉍例僅為可能之施行範例，僅用以清楚說明本發明之原理:即便對上述實施例施以變化和修飾，然皆不脫此中所逑系統及方法之精神和制，所有此等修飾及變化應涵括於本案之範圍内，受如附申請專利範圍保護。 52 200803526 【圖式簡單說明】這裡所揭露實施_各方觀點可參考下列 ❶、入之瞭解，圖式中的元件並未限定其尺寸比例 ;說明本發明之原則’各圖中相似的標號代表相=; 第-圖：圖形處理器系統實施例之方塊圖，種解碼系統（及方法）實施例。她订夕例例示處理環境之方_，其中可施行多種解碼系第三圖：第二圖例示處理環境内之選擇元件方塊圖。第四圖：第二圖與第三圖例示處理環境内之計算核心圖，其中可施行多種解碼系統實施例。 a Γί中圖二四轉核心内之執行單元的選擇元件方塊圖，其中可施仃夕種解碼系統實施例。第五Β圖：執行單元龍路徑之方塊®，其巾可施行多種解碼系統實施例。，六Α圖：第五圖所示解_統實施例之方塊圖。第六B圖：第六A圖解碼系統的位樣緩衝器實施例之方塊圖。第/、C圖·第/、A圖解碼系統之内容記憶體結構配合相關暫存器實施例之方塊圖。弟/'f ® · $於CAVLC解碼的解碼系統所使用表單結構貫施例之方塊圖。 53 200803526 【主要元件符號說明】本案圖式中所包含之各元件列式如下：It should be emphasized that the above-described embodiments or "preferred" embodiments of the present invention are only intended to be illustrative, and are merely illustrative of the principles of the invention. The spirit and system of the systems and methods described herein are not excluded from the scope of this case and are covered by the scope of the patent application. 52 200803526 [Simplified description of the drawings] The implementation of the present disclosure is based on the following stipulations. The elements in the drawings do not limit the size ratio; the principles of the present invention are shown in the drawings. Phase =; Figure 1: Block diagram of an embodiment of a graphics processor system, embodiment of a decoding system (and method). She exemplifies the processing environment _, in which a variety of decoding systems can be implemented. The third figure: The second figure illustrates a block diagram of selected components in the processing environment. Fourth: The second and third figures illustrate computational core diagrams within a processing environment in which various decoding system embodiments can be implemented. a 选择中中中中中中转转转选择选择选择选择选择选择选择选择选择选择选择选择选择选择选择选择选择选择选择选择The fifth diagram: the block of the unit dragon path is implemented, and the towel can implement various decoding system embodiments. , Six Diagrams: The block diagram of the solution shown in the fifth figure. Figure 6B: Block diagram of a bit buffer embodiment of the decoding system of the sixth A. The content memory structure of the / / C picture / / / A picture decoding system is matched with the block diagram of the associated register embodiment. The block diagram of the form structure used by the decoding system used by the CAVLC decoding decoding system. 53 200803526 [Description of main component symbols] The components included in the diagram of this case are listed as follows:

施執仃早70集合控制及頂點/串流快 208洽图总妗早兀、曰EU、、泉 302紋理過濾單元施仃仃 early 70 set control and vertices / stream fast 208 contact map total 兀 early 曰, 曰 EU,, spring 302 texture filter unit

100圖形處理器系統 104顯示介面單元 110記憶體介面單元 118匯流排介面單元 124系統記憶體 128驅動軟體 202圖形處理器 304像素打包元件 308寫回單元 402執行單元輸入 406記憶體存取單元 41〇記憶體介面仲裁器 102顯示裝置 106區域記憶體 114圖形處理單元 122晶片組 126中央處理單元 2〇〇解碼系統 204計算核心 306命令流處理器 310紋理位址產生器 404執行單元輸出 408 L2快取記憶體 412執行單元集合 413接線 420執行單元 504指令快取記憶體控制器 506執行緒控制器 508緩衝器 510共用暫存器檔案 512執行單元資料路徑 514執行單元資料路徑先進先出缓衝器 54 200803526100 graphics processor system 104 display interface unit 110 memory interface unit 118 bus interface unit 124 system memory 128 driver software 202 graphics processor 304 pixel packaging component 308 writeback unit 402 execution unit input 406 memory access unit 41 Memory interface arbiter 102 display device 106 region memory 114 graphics processing unit 122 chipset 126 central processing unit 2 decoding system 204 computing core 306 command stream processor 310 texture address generator 404 execution unit output 408 L2 cache Memory 412 execution unit set 413 connection 420 execution unit 504 instruction cache memory controller 506 thread controller 508 buffer 510 shared register file 512 execution unit data path 514 execution unit data path FIFO buffer 54 200803526

516述部暫存器檔案 520資料輪出控制器 526暫存器檔案 530 CAVLC 單元 534向量整數算術邏輯單元 536特殊目的單元 538多工器谓暫存器檔* 542運算訊號線 544目前訊號線 602 f夕位暫存态_串流缓衝器/直接記憶體存取引擎 604巨圖塊相鄰内容記憶體 606總暫存器 608區域暫存器 61〇係數符記模組 612層級碼模組 518純量暫存器檔案 524執行緒任務介面 528多工器 532向量浮點單元 614層級模組 618零層級模組 622層級陣列 626、628、630、640 616層級〇模組 620運作模組 624運作陣列 642多工器516 description section register file 520 data rotation controller 526 register file 530 CAVLC unit 534 vector integer arithmetic logic unit 536 special purpose unit 538 multiplexer called register file * 542 operation signal line 544 current signal line 602夕位暂 _ _ stream buffer / direct memory access engine 604 giant block adjacent content memory 606 total register 608 area register 61 〇 coefficient register module 612 level code module 518 scalar register file 524 thread task interface 528 multiplexer 532 vector floating point unit 614 level module 618 zero level module 622 level array 626, 628, 630, 640 616 level 〇 module 620 operation module 624 Operating array 642 multiplexer

660 CAVLC邏輯電路 665、667轉遞暫存器 683上方指標 685左侧指標 687目前指標 661、663運算元暫存器 681陣列元素 684 左侧 mbNeighCtx 686 目前 mbNeighCtx 658反饋接線 55660 CAVLC logic circuit 665, 667 transfer register 683 upper indicator 685 left indicator 687 current indicator 661, 663 operand register 681 array element 684 left mbNeighCtx 686 current mbNeighCtx 658 feedback wiring 55

Claims

200803526 X. Patent application scope: 1. A decoding system, which includes: a variable of a shader, a software programmable core processing unit, and a content-adaptive length C coding. a CAVLC) unit that performs CAVLC decoding of a video stream and provides a decoded data output.

Among them, the CAV L C solution is implemented by means of a hard pulse recording. 2. The system of claim 1, wherein the cavlc decoding is performed by the hardware programmed in the graphics processing unit with the content of the graphics processing unit. 3. The system of claim 1, wherein the cAVLc unit further comprises a coefficient register (coeff__t〇ken) module for receiving the macro block information 'the first instruction according to one of the shaders ( CAVLC_TOTC) provides end 1 information and overall coefficient information. 4. The system of claim 3, wherein the CAVLC unit further comprises a level (CAVLC_Level) module for receiving the tail 1 information and the level code information, and the second instruction of the shader (CAVLC-LVL), providing suffix length information and level index (Level[Idx]) information. 56. The system of claim 4, wherein the CAVLC unit further comprises a level code (CAVLC-LevelCode) module for receiving the length information of the suffix, which is one of the shaders. The instruction (CAVLC_LC) provides the hierarchical code information to the hierarchical module. 6. The system of claim 5, wherein the hierarchical code module receives the suffix length information from a transfer register or an execution unit register. 7. The system of claim 5, wherein the hierarchical code module receives the suffix length information and the hierarchical index information from a transfer register or an execution unit register, the hierarchical index information It is incremental. 8. The system of claim 4, wherein the CAVLC includes a level 〇 (CAVLC-L0) module for receiving the tail end 1 The four instructions (CAVLCJLVL0) provide a second level index (Level[Idx]) information to the hierarchical array. 9. The system of claim 8, wherein the CAVLC unit further comprises a zero-level (CAVLC-ZL) module for receiving the coefficient of the whole coefficient and a maximum number of coefficients, which should be colored. One of the fifth commands (CAVLC-ZL) provides a left-side information and a reset value to the first multiplexer and the second multiplexer. The system of claim 9, wherein the CAVLC unit further comprises a CAVLCJRun module for receiving the left from the first multiplexer and the second multiplexer The side 0 information and the second index information, in response to one of the shader sixth instructions (CAVLC_RUN), provide a working index (Rini[Idx]) to the -^ operation array. The system of claim 1, wherein the first multiplexer and the second multiplexer receive the left side information from the first transfer register and the second transfer register respectively And the second index information. 12. The system of claim 10, wherein the hierarchical array and the operational array provide a decoding level value and a decoding operation value according to a seventh instruction (REad_lrun) of the shader, and corresponding to the shader One of the eighth instructions (CLRJLRUN) is cleared. 13. The system of claim 1, wherein the CAVLC uses a bit within an instruction to determine whether a previous operation result stored in an internal register is available, or is in a source operand. Whether the data can be used by one or more modules in the current calculation. 14. The system of claim 1, wherein the CAVLC unit is further a <direct 5 direct memory access (DMA) access group comprising a one-bit stream buffer And a DMA engine module that performs a one-finger 58 200803526 for each slice in response to the shader, automatically repeating the predetermined number of bits when a predetermined number of bits within the bit stream have been used The element corresponds to the video stream. 15. The system of claim 14, wherein the CAVLC unit delays the DMA engine module due to the possibility of a downward overflow in the bit stream buffer. 16. The system of claim 14, wherein the DMA engine module is configured to record the number of used bits in the bit stream buffer, and the number of the bits is greater than one. The predetermined value, the bit stream buffer operation is suspended, and control is transferred to a host processor. 17. A decoding method comprising the steps of: loading a shader into a programmable core processing unit having a CAVLC unit; executing the shader on the CAVLC unit to decode a ® video stream with CAVLC; • Provide a decoded data output. 18. The method of claim 17, wherein the CAVLC decoding is performed by hardware programmed in a graphics processing unit in conjunction with hardware programmed in a graphics processing unit data path. 19. The method of claim 17, further comprising the step of: 59 200803526 one of the CAVLC units is a coeff "oken" module that receives the macro block information; (CAVLC-TOTC), providing tail end 1 information and overall coefficient information; the CAVLC-level module (CAVLC-Level) module receives the tail end 1 information and the level code information; CAVLC-LVL), providing suffix length information and level index (Level[Idx]) information; the CAVLCJLevelCode module of the CAVLC unit receives the suffix length information; and the third instruction of one of the shaders (CAVLC JLC), providing the level code shoulder to the level module. The method of claim 19, wherein the hierarchical code module receives the suffix length information and the hierarchical index information from a transfer register or an execution unit register, the hierarchical index information It is incremental. 21. The method of claim 19, further comprising the step of: the CAVLC unit one level 0 (CAVLC-L0) module receiving the tail end 1 information; and the fourth instruction of the shader (CAVLC-LVL0) provides second level index information (Level[Idx]) to a hierarchical array. 22. The method of claim 21, further comprising the steps of: 200803526 One of the CAVLC modules of the zero level (CAVLC_ZL) module receives the total coefficient information and a maximum number of coefficients information; The fifth instruction (cAVLC_ZL) provides a left side 0 information and a reset value to the first multiplexer and the second multiplexer; one of the CAVLC unit operation (CAVLc_Run) modules respectively from the first multiplexer And the second multiplexer receives the left side information and the second index information; and provides a running index (Run[Idx]) to an operational array according to one of the shader sixth instructions (CAVLC_RUN). The method of claim 22, wherein the first multiplexer and the second multiplexer receive the left side from the first transfer register and the second transfer temporary storage, respectively. Information and the second index. The method of claim 22, wherein the hierarchical array and the operational array provide a decoding level value and a decoding operational value in response to a seventh instruction (READ_LRUN) of the shader. The method of claim 22, wherein the hierarchical array and the operational array are emptied by an eighth instruction (CLRJLRUN) of the shader. 26. The method of claim 17, further comprising the step of: the CAVLC unit using a bit within an instruction to determine whether a previous operation result stored in a portion of the 2008 20082626 register is available, or Whether the data in a source operand can be used by one or more modules in the current operation. 27. The method of claim 17, further comprising the step of: in response to the shader executing one of the instructions for each slice, automatically repeating the fill when a predetermined number of bits within the bitstream have been used The predetermined number of bits are entered, the bit corresponding to the video stream. 10 28. The method of claim 27, further comprising the step of: delaying the use of the bit in the bitstream buffer due to the possibility of a downflow in the bitstream buffer. 29. The method of claim 27, further comprising the steps of: recording the number of used bits in the bit stream buffer, and suspending the number of bits in the bit stream buffer based on detecting that the number of bits is greater than a predetermined value The bit stream buffer operates and transfers control to a host processor. 62