TW200813884A - Decoding of context adaptive binary arithmetic codes in computational core of programmable graphics processing unit - Google Patents
Decoding of context adaptive binary arithmetic codes in computational core of programmable graphics processing unit Download PDFInfo
- Publication number
- TW200813884A TW200813884A TW96120896A TW96120896A TW200813884A TW 200813884 A TW200813884 A TW 200813884A TW 96120896 A TW96120896 A TW 96120896A TW 96120896 A TW96120896 A TW 96120896A TW 200813884 A TW200813884 A TW 200813884A
- Authority
- TW
- Taiwan
- Prior art keywords
- decoding
- cabac
- bit
- module
- unit
- Prior art date
Links
Landscapes
- Image Generation (AREA)
- Image Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Description
200813884 九、發明說明: 【發明所屬之技術領域】 本發明係有關於資料處理系統,尤指可程式之圖形處 理系統及方法。 【先前技術】 電腦綠圖乃是以電腦產生圖像、影像或其他圖形或· 像資訊之一門藝術和科學,目前的繪圖系統多包含數個< 面,例如微軟的Direct3D介面及OpenGL等等,如此可方 執行特定作業系統(如微軟的WINDOWS)的電腦上控讳 諸如圖形加速器或圖形處理單元(graphics prQeessin unit,GPU)等的多媒體硬體,圖像、影像之產生常被稱j 「描繪成像(rendering)」,此類操作的細節一般是由圖形力 速器進行,於三維(3D)電麟圖中,構成場景中物件表迁 ^或物體)之幾何形狀經轉變騎素(圖形單元 :於晝面緩衝區(frame buffer)中,接著顯示於顯; =質二或:群都有與表面外觀有關的特定视^ 義成物件或物:群的描等等’這些可⑽ 度及耗能’現已發展出許‘。== 200813884 車父少的位元數產生較佳影像的品質,例如g 264於準(又 稱為ISO動晝專家群組MPEG-4第十部)是一種^壓縮數 位視訊編碼標準,與MPEG-2相容之編碼相比,Η·264相 容之編碼僅需要差不多三分之一的位元數,即可儲存同樣 視頻品質的視訊,Η.264標準提供兩種熵(entr〇py)解碼程 序’》別疋内谷適應性二進位算術編碼(c〇ntext_adapHve binary arithmetic coding,CABAC)以及内容適應性可變長 度編碼(context-adaptive variable length coding,CAVLC), 關於CABAC,其解碼運算通常是依序處理,需要大量的 计异以得到範圍、補償、内容(c〇ntext)資訊等參數,目前 CABAC解碼結構可滿足消費者的部分需求,但是在設計 上仍有其限制。 【發明内容】 本發明揭露一種内容適應性二進位算術編碼 (context-adaptive binary arithmetic coding,CABAC)之解碼 系統及方法(之後簡稱為解碼系統),運用於圖形處理單元 (graphics processing unit,GPU)内之多執行緒(multithread) 平行計算核心,簡單地說,於一實施例中,本系統包含一 軟體可程式核心處理單元,其内具有一 CABAC單元以執 行一著色器(shader),該著色器可以執行視訊流的CABAC 解碼,並提供一解碼資料輸出。 方法實施例則包括將著色器載入具有CABAC單元之 6 200813884 ^式核心處料元,⑽从執行該著色⑽CABAC解 -視讯流,亚提供一解碼資料輸出。 ,、、、驾本技藝人士於檢視以下圖式及詳細說明之後,當 i f演出其他系統、方法、特徵及優點,所有此等推演的 二、•方法、特徵及優綱屬本發明之範®,受到如附申 言月專利範圍之保護。 【實施方式】 本發明揭示了多種内容適應性二進位算術編碼 —咖酬化c〇_ 系統及方法(之後將通稱為解碼系統),於一實施例中,解 碼系統係内嵌於圖形處理單元㈣沖化pro·—她, GP聊可程式、多執行緒、平行計算核心、之_個或多個執 行單元中,利用軟體結合硬體之方式來達成解碼功能,亦 即視訊解碼是以圖形處理單元編程(pr〇gramming)的内容 (context)配合施行於圖形處理單元資料路徑内的硬體所完 成,舉個例子,解碼運算或方法係由具有擴充指令集 (extended instruction set)的著色器(shader,如頂點著色 裔)、圖形處理單元的執行單元資料路徑、以及用於cabac 處理環境中的自動管理位元流緩衝器及内容模型分析 (context modeling)之附加硬體所共同完成,不像已知的舊 有系統,僅具有單純硬體或單純軟體之CABAC處理方 法’因此或多或少會遇到於先前技術中所提到的問題。 7 200813884 另外,自動位元流緩衝器具備一些優點,例如,一旦 位元"丨L緩衝裔的直接記憶體存取(此㈣mem〇ry咖咖, DMA)引擎得知位元流的位置(位址),便會自動管理位元 流而不需要進一步的指令,這樣的機制就跟傳統的微處理 器或數位U虎處理器(digkai sjgnai pr〇cess〇r,dsp)不同, 位兀流管理不再代表大量的間接費用,再則,透過記錄已 使用的位元數里,位元流緩衝器機制可以偵測和處理錯誤 的位元流。 本案解碼系統的另一個優點是可以減少指令延遲 (patency),因為CABAC解碼是非常連續的動作,不易利用 夕執行緒,因此在各種實施例中就會使用一種轉 減少等待延遲,例如暫存器轉遞㈣ister fGrwardii= ^200813884 IX. Description of the Invention: [Technical Field of the Invention] The present invention relates to a data processing system, and more particularly to a programmable graphics processing system and method. [Prior Art] The computer green image is a computer generated image, image or other graphic or image information. The current drawing system contains several <faces, such as Microsoft's Direct3D interface and OpenGL, etc. Therefore, it is possible to control multimedia hardware such as graphics accelerators or graphics processing units (GPUs) on a computer running a specific operating system (such as Microsoft's WINDOWS). The generation of images and images is often called j " "Drawing", the details of such operations are generally carried out by a graphic speed device, in the three-dimensional (3D) electric lining diagram, the geometry of the objects in the scene or the object is transformed into a figure (graphics) Unit: in the frame buffer, then displayed in the display; = quality two or: the group has a specific visual object or object related to the appearance of the surface: group description, etc. 'These can (10) degrees and Energy consumption has been developed by Xu. == 200813884 The number of bits in the car's father produces better image quality, such as g 264 in the standard (also known as ISO 昼 expert group MPEG-4 part 10) is A compression digital view Coding standard, compared to MPEG-2 compatible encoding, Η·264 compatible encoding requires only about one-third of the number of bits to store video of the same video quality. The .264 standard provides two entropies. (entr〇py) Decoding program 'C疋ntext_adapHve binary arithmetic coding (CABAC) and context-adaptive variable length coding (CAVLC), about CABAC The decoding operation is usually processed sequentially, which requires a large number of calculations to obtain parameters such as range, compensation, and content (c〇ntext) information. Currently, the CABAC decoding structure can meet some of the needs of consumers, but there are still designs in it. [Description of the Invention] The present invention discloses a content adaptive adaptive arithmetic coding (CABAC) decoding system and method (hereinafter referred to as a decoding system), which is applied to a graphics processing unit (graphics processing unit, Multithreaded parallel computing core within GPU). Simply put, in one embodiment, the system includes a soft A programmable core processing unit having a CABAC unit therein for performing a shader that performs CABAC decoding of the video stream and provides a decoded data output. The method embodiment includes loading the shader with The CABAC unit 6 200813884 ^-type core processing element, (10) from performing the coloring (10) CABAC solution-video stream, sub-providing a decoded data output. After reviewing the following drawings and detailed descriptions, if the show performs other systems, methods, features and advantages, all such derivations, methods, features and advantages belong to the scope of the present invention. , subject to the protection of the patent scope of the attached statement. [Embodiment] The present invention discloses a plurality of content adaptive binary arithmetic coding-systems and methods (hereinafter generally referred to as a decoding system). In an embodiment, the decoding system is embedded in a graphics processing unit. (4) Chonghua pro·—She, GP chat program, multi-thread, parallel computing core, _ or more execution units, use software to combine hardware to achieve decoding function, that is, video decoding is graphical The content of the processing unit (pr〇gramming) is done in conjunction with the hardware implemented in the data path of the graphics processing unit. For example, the decoding operation or method is a colorizer with an extended instruction set. (shader, such as vertex coloring), execution unit data path of the graphics processing unit, and additional hardware for automatic management of bit stream buffers and context modeling in the cabac processing environment, not Like the known old systems, only CABAC processing methods with simple hardware or simple software' are therefore more or less encountered in the prior art. The problem mentioned. 7 200813884 In addition, the automatic bit stream buffer has some advantages, for example, once the bit "丨L buffered direct memory access (this (4) mem〇ry café, DMA) engine knows the location of the bit stream ( The address is automatically managed by the bit stream without further instructions. This mechanism is different from the traditional microprocessor or digital U tiger processor (digkai sjgnai pr〇cess〇r, dsp). Management no longer represents a large amount of overhead, and again, by recording the number of bits used, the bitstream buffer mechanism can detect and process the wrong bitstream. Another advantage of the decoding system of the present invention is that the instruction delay can be reduced, because CABAC decoding is a very continuous action, and it is not easy to utilize the eve thread, so in various embodiments, a turn-reduction delay is used, such as a temporary register. Transfer (4)ister fGrwardii= ^
步解釋,便疋深管(deep-pipeline)及多執行緒處理器無法以 同一執行緒在每一週期執行指令,有些系統利用一般轉遞 (general fomarding),是藉由檢查前次產生的運算元 (fperand)位址以及指令運算元位址(如果相同,則使用前 次產生運算元),此種一般轉遞需要複雜的比較及多工動 作。在某些解碼純實施射’會使用不同的轉遞方式, 不管是利用前次計算結果(如保留在内部暫存器)還是來 ,運算元的資料,均·指令中的位元(例如總共2位元, 每-運算元使用〗位元)來編碼,藉由這種方式,可以減 少整體的延遲,改善處理器管線的效率。 訊標準部門(intemational 這裡描述的解碼系統可以利用已知的國際電信聯盟通Step interpretation, deep-pipeline and multi-threaded processors can't execute instructions in each cycle with the same thread. Some systems use general fomarding by checking the previous generated operations. The fperand address and the instruction operand address (if the same, the previous generation of the operand is used), this general transfer requires complex comparisons and multiplex operations. In some decoding pure implementations, 'will use different delivery methods, whether using the previous calculation results (such as retained in the internal register) or the data of the operation elements, all bits in the instruction (for example, total The 2-bit, per-operating element is encoded using the bit). In this way, the overall delay can be reduced and the efficiency of the processor pipeline can be improved. The standard department (induction) The decoding system described here can utilize the known International Telecommunications Union
Telecommunication Union 8 200813884Telecommunication Union 8 200813884
Telecommunication Standardization Sector,ITU-T) H.264 標 準,根據執行從圖形處理單元晝面緩衝器記憶體或主處理 斋(如中央處理單元(central processing unit,CPU))記憶 體所接收到的一個或多個指令組(如藉由預載入(prel〇ad) 等已知機制或是快取失敗等),多種解碼系統實施例即可進 行運算。 第一圖係圖形處理器系統〗〇〇實施例之方塊圖,其中 介紹了解碼系統及方法,於某些實施方式中,圖形處理器 系統100可為電腦系統,其中,圖形處理器系統1〇〇可包 含由顯示介面單元(display interface unit,DIU) 104驅動的 顯示裝置102以及區域記憶體i〇6 (可包含顯示緩衝器、 晝面緩衝器、紋理緩衝器、命令緩衝器等等),區域記憶體 106可以晝面緩衝器或儲存單元取代,區域記憶體透 過一個或多個記憶體介面單元(mein〇ry interface unit, MIU)110 連接至圖形處理單元(graphics processing unit, GPU)114,於一實施例中,記憶體介面單元11()、圖形處理 單元114、顯示介面單元1〇4三者連接至高速週邊組件互 連(peripheral component interconnect express,PCI-E)扭容之 匯流排介面單元(bus interface unit,BIU)118,於一實施例 中,匯流排介面單元118可以使用圖形位址重繪表(graphics address remapping table,GART),當然也可使用其他記憶 體繪圖機制,圖形處理單元114包含解碼系統200,稍後 會針對此部分作進一步的說明,雖然於某些實施例中將圖 形處理單元元114内的解碼系統200晝成一個元件,但是 9 200813884 圖形處理器系統100的繪 解碼系統200其實可以包含更多 示或未繪示元件。 匯流排介面單元;查拉s R A + 連接至日日片組122 (如北橋晶片 組)或開關,晶片细A人 • 、 匕δ介面電路(interface electromcs),以增強從中央Telecommunication Standardization Sector (ITU-T) H.264 standard, based on the implementation of the memory received from the graphics processing unit buffer memory or main processing fast (such as central processing unit (CPU)) Multiple instruction sets (such as by known mechanisms such as preload (prel〇ad) or cache failures, etc.), multiple decoding system embodiments can perform operations. The first diagram is a block diagram of an embodiment, in which a decoding system and method are described. In some embodiments, the graphics processor system 100 can be a computer system, wherein the graphics processor system The display device 102 and the area memory i〇6 (which may include a display buffer, a buffer buffer, a texture buffer, a command buffer, etc.), which are driven by a display interface unit (DIU) 104, may be included. The area memory 106 may be replaced by a face buffer or a storage unit, and the area memory is connected to a graphics processing unit (GPU) 114 through one or more memory interface units (MIUs) 110. In one embodiment, the memory interface unit 11 (), the graphics processing unit 114, and the display interface unit 1〇4 are connected to a high-speed peripheral component interconnect express (PCI-E) twist-capacity bus interface. A bus interface unit (BIU) 118. In one embodiment, the bus interface unit 118 can use a graphical address repainting table (graphics address re Mapping table, GART), of course, other memory drawing mechanisms can also be used. Graphics processing unit 114 includes decoding system 200, which will be further described later, although in some embodiments graphics processing unit 114 will be The decoding system 200 is organized into one component, but the graphics decoding system 200 of the 200813884 graphics processor system 100 may actually include more or no components. Bus interface unit; Chara s R A + connected to day group 122 (such as North Bridge chipset) or switch, wafer fine A person, 匕δ interface circuit (interface electromcs) to enhance from the center
Unit,C導26(又稱主 / 兀(副W P_SSing / ^ 处里為)接收到的訊號,並分離從 糸統記憶體124進出的^士骑访认 ^ ^ L ㈤的仏旒與從輪出入(I/O)裝置進出的訊 號,雖然這裡提到PCI_E匯户挑Unit, C guide 26 (also known as the main / 兀 (sub-W P_SSing / ^) is the signal received, and separate from the 记忆 memory 124 into the ^ 骑 骑 认 ^ ^ (5) 仏旒 and 从In and out (I/O) devices in and out signals, although mentioned here PCI_E remittance pick
η 排協定,不過也可使用其他 的連接及/或通財絲溝通域理1、㈣形處理單元 114 (如PCI、專用高速匯流排等),系統記憶體124還包 =動軟體la ’可利財央處理單元126將指令組或命 々傳送給圖形處理單元114内的暫存器。 在某些實施例中可再另外配置圖形處理單元,利用 PCI-E匯流排協定或其他通訊協定經由晶片組122連接至 第-圖的其他元件,於—實施例中,圖形處理單元可 以包含第-圖的所有树’ #然亦可剔除、新增或改變某 些兀件’例如’可另外增加連接至晶片多且122的南橋晶片 組0 請參閱第二圖,其為例示處理環境之方塊圖,其中應 用-解碼系統200 ’圖形處理單元114包含一圖形處理器 202,圖形處理器202貝,j包含多個執行單元㈣咖⑽也, EU)和t算核心2〇4,於一實施例中,計算核心2〇4包含内 後於執行單元資料路徑(executi〇n _ _摊,腦別的 解碼系統2GG,該資料雜分配至—個❹㈣行單元, 200813884 圖形處理器202還包含執行單元集合(executi〇n unit p〇〇1) 控制及頂點/串流快取記憶體單元2〇6(以後稱為EU集合 控制單元206)以及具有固定功能邏輯(例如,包含三角 形設定單元(triangle set-up unit,TSU)、柵格-圖塊產生器 (span_tile generator,STG)等)的繪圖管線2〇8,計算核心 204包含聯合的多個執行單元,以符合不同著色器程式的 著色裔任務之計算要求,所述著色器程式可包含頂點著色 器、幾何著色器、及/或像素著色器,使繪圖管線2〇8能 處理資料,计异核心204的著色器能進行解碼系統2〇〇的 大邛刀功能’下面將詳細說明圖形處理器的實施例,接著 說明解碼系統200的細節。 解碼系統可以硬體、軟體、韌體或其組合等方式實施, 於較佳實施例中,解碼系統200可包含硬體或軟體,利用 下列已知技術或其組合,例如:具有邏輯閘而可對資料信 唬進行邏輯功能的離散邏輯電路、具有適當組合邏輯閘的 才寸殊應用集成電路(applicati〇n specific integrated circuit, ASIC )、可程式化閘極陣列(programmable gate array, PGA )、~ 式可程式化閘極陣列(field programmable gate array,FPGA)等等元件。 睛茶考第三圖及第四圖,其為圖形處理器2〇2實施例 %擇元件之方塊圖,如前所述,解碼系統2〇〇可以是圖形 處理器202内的著色器,另外加上擴充指令組及其他硬體 元件,以下將說明圖形處理器202及對應程序之實施例, 雖然第二圖與第四圖並未繪出圖形處理所用到的全部元 200813884 弟—圖,可程式處理環境的中心為計算 心204 ’其包含解碼系統綱並可處理各種 可以執行或映射多種著色器程式,如頂點‘ :”程式等’多執行緒處理器的計算核心綱可以 在早一時脈週期内處理多個指令。η row agreement, but other connections and / or money can be used to communicate domain 1, (four) shaped processing unit 114 (such as PCI, dedicated high-speed bus, etc.), system memory 124 also package = dynamic software la ' can The profit processing unit 126 transmits the instruction set or the command to the register in the graphics processing unit 114. In some embodiments, the graphics processing unit may be additionally configured to be connected to other elements of the first diagram via the chipset 122 using a PCI-E bus protocol or other communication protocol. In an embodiment, the graphics processing unit may include -All trees of the graph' may also eliminate, add or change certain components 'for example' may additionally add a south bridge chipset 0 connected to the wafer and 122. Please refer to the second figure, which is a block illustrating the processing environment. Figure, wherein the application-decoding system 200' graphics processing unit 114 includes a graphics processor 202, a graphics processor 202, j includes a plurality of execution units (four) coffee (10) also, EU) and t computing core 2〇4, in one implementation In the example, the computing core 2〇4 includes the execution unit data path (executi〇n__, the brain decoding system 2GG, the data is allocated to the ❹(four) row unit, and the 200813884 graphics processor 202 also includes execution. Unit set (executi〇n unit p〇〇1) control and vertex/streaming cache unit 2〇6 (hereinafter referred to as EU set control unit 206) and with fixed function logic (for example, including triangle setting unit) a drawing pipeline 2〇8 of a triangle set-up unit (TSU), a raster-tile generator (STG), etc., the computing core 204 includes a plurality of joint execution units to conform to different shader programs. For the calculation of the coloring task, the shader program may include a vertex shader, a geometry shader, and/or a pixel shader, so that the drawing pipeline 2〇8 can process the data, and the colorizer of the different core 204 can perform the decoding system. 2 邛 邛 功能 ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' The decoding system 200 may include hardware or software, using the following known techniques or a combination thereof, for example, a discrete logic circuit having a logic gate to perform a logic function on a data signal, and an application having an appropriate combination of logic gates. Integrated circuit (ASIC), programmable gate array (PGA), ~-programmable gate array (field programmable gate array, FPGA) and other components. The third and fourth figures of the eye tea test, which is a block diagram of the embodiment of the graphics processor 2〇2, as described above, the decoding system 2〇〇 It may be a colorizer in the graphics processor 202, plus an extended instruction set and other hardware components. Embodiments of the graphics processor 202 and corresponding programs will be described below, although the second and fourth figures do not depict graphics. All the elements used in the processing 200813884 - the center of the programmable processing environment is the computing core 204 'which contains the decoding system and can handle various types of color program, such as vertex ':" program, etc. The computational core of the processor can process multiple instructions in the early clock cycle.
於第三圖中,圖形處理器202的相關元件包含計算核 心綱、紋理過濾單幻02、像素打包元件3〇4、命令流處 理為306、寫回單幻〇8、以及紋理位址產生器則,第三 圖中的EU集合控制單元2〇6也包含頂點快取記憶體及/ 或串流快取記憶體,另外,第三_紋理過濾、單元302提 供紋素(texel)資料給計算核心2〇4 (輸入a及B),於某些 實施例中,紋素資料為512位元資料。 像素打包元件304提供像素器著色輸入(ps輸入,輸 入C和D)給計算核心204,輸入同樣是512位元資料格 式,另外,像素打包元件3〇4向EU集合控制單元206請 求像素著色器任務,而EU集合控制單元206便會提供指 疋執行單元號碼(EU#)及執行緒號碼(執行緒#)給像素打包 元件304,因為像素打包元件304及紋理過濾單元302是 已知的技術,這裡便不再贅述,雖然第三圖顯示像素及紋 素封包為512位元之資料封包,但是可依各實施例根據圖 形處理器202所需的效能改變其大小。 命令流處理器306提供三角形頂點索引給EU集合控 制單元206,於第三圖的實施例中,索引為256位元之資 12 200813884 料,EU集合控制單元206組合從串流快取記憶體接收到的 頂點著色器輸入,並將這些資料送至計算核心204 (輸入 E); EU集合控制單元206亦組合幾何著色器輸入,並將 這些資料送至計算核心204 (輸入F) ; EU集合控制單元 206另外控制執行單元輸入(EU輸入)4〇2及執行單元輸出 (EU輸出)404 (第四圖),換句話說,EU集合控制單元2〇6 控制計算核心204的各輸入流與輸出流。 經過處理之後,計算核心2 〇 4提供像素著色器輸出(p s 輸出,輸出J1與J2)給寫回單元308,像素著色器輸出包 括色彩資訊,例如紅/綠/藍/透明度(RGBA)資訊,關於 實施例中的資料結構,像素著色器輸出可以是兩條512位 凡之資料流,其他實施例亦可使用其他的位元寬度。In the third figure, the relevant components of the graphics processor 202 include a computational core, a texture filtering single illusion 02, a pixel packing component 〇4, a command stream processing of 306, a writeback singularity 8, and a texture address generator. Then, the EU set control unit 2〇6 in the third figure also includes vertex cache memory and/or stream cache memory. In addition, the third_texture filter and unit 302 provide texel data for calculation. Core 2〇4 (enters a and B), in some embodiments, the texel data is 512-bit data. The pixel packing component 304 provides pixel shader input (ps input, inputs C and D) to the computing core 204, the input is also in the 512-bit data format, and in addition, the pixel packing component 3〇4 requests the pixel shader from the EU collective control unit 206. The task, and the EU collection control unit 206 provides the fingerprint execution unit number (EU#) and the thread number (thread #) to the pixel packing component 304, since the pixel packing component 304 and the texture filtering unit 302 are known techniques. Therefore, the third figure shows that the pixel and texel packets are 512-bit data packets, but the size of the graphics processor 202 can be changed according to the embodiment. The command stream processor 306 provides a triangle vertex index to the EU set control unit 206. In the embodiment of the third figure, the index is 256 bits. 12 200813884, the EU set control unit 206 combines and receives from the stream cache. Go to the vertex shader input and send the data to compute core 204 (input E); EU set control unit 206 also combines the geometry shader inputs and sends the data to compute core 204 (input F); EU set control Unit 206 additionally controls execution unit input (EU input) 4〇2 and execution unit output (EU output) 404 (fourth diagram). In other words, EU set control unit 2〇6 controls each input stream and output of computation core 204. flow. After processing, compute core 2 提供 4 provides pixel shader output (ps output, outputs J1 and J2) to writeback unit 308, which includes color information such as red/green/blue/transparency (RGBA) information. Regarding the data structure in the embodiment, the pixel shader output can be two 512-bit data streams, and other embodiments can use other bit widths.
除了像素者色器輸出,計异核心204亦會輸出紋理座 標(TC,輸出K1及K2)給紋理位址產生器310 ,其中包 括UVRQ資訊,紋理位址產生器31〇向計算核心2〇4的 L2快取記憶體408發出紋理描述符號請求(τ#請求,輸入 X) ’然後計算核心204的L2快取記憶體4〇8會輸出紋理 描述符號資料(τ#資料,輸出w)給紋理位址產生琴31〇, 因為紋理位址產生器310及寫回單元3〇8是已知的技術, ,此這裡不再贅述,再則,雖然畫中顯示urvq&rgba 是512位元之資料,但是此參數亦可隨不同實施例而做變 化,於第三圖的實施例中,匯流排分成兩條512位元通道, 同時傳輪4個像· 位元RGBA色彩值及128位元 13 200813884 繪圖管線208包含固定功能之圖形處理功能,例如,In addition to the pixel colorimeter output, the discriminating core 204 also outputs texture coordinates (TC, outputs K1 and K2) to the texture address generator 310, which includes UVRQ information, and the texture address generator 31 is directed to the computing core 2〇4 The L2 cache memory 408 issues a texture description symbol request (τ#request, input X)' and then computes the core 204's L2 cache memory 4〇8 to output texture description symbol data (τ# data, output w) to the texture The address generates the piano 31〇, because the texture address generator 310 and the write back unit 3〇8 are known techniques, and will not be described here, and again, although the picture shows that urvq&rgba is a 512-bit data. However, this parameter can also be changed according to different embodiments. In the embodiment of the third figure, the bus bar is divided into two 512-bit channels, and at the same time, four images of the bit RGBA color value and 128 bits 13 200813884 Drawing pipeline 208 includes graphics processing functions for fixed functions, for example,
因應從驅動軟體發出之繪製一三角形的命令,頂點資訊通 過計算核心204内的頂點著色器邏輯元件以進行頂點轉 換,物件將從物件空間種換成工作空間及/或螢幕空間的 二角开>,二角形通過計异核心204到達繪圖管線208的三 角形设定單元,結合圖元後進行已知的任務,例如產生邊 界盒(bounding box)、楝選(culling)、產生邊緣功能(edge function generation)及三角形層級剔除(triangle rejection)等,接著三角形設定單元再將資料傳遞至繪圖管 線208中具有圖塊產生功能的栅格及圖塊產生單元,因 此,資料物件被分割成圖塊(例如8χ8、16χ16等),並且傳 遞至其他的蚊功能單^,進行深度(z_值)處理,例如 ζ·值之高階(同樣的程序在高階時使用的位缝比低階少) 剔除’織將z-值傳回計算核心綱的像素絲器邏輯元 件’以根據所做理及管線資料騎像素著色器功能,叶 算ί心綱將ί處理之值輸出至位於_管線208内之目 標早兀’目標早7L在各快取記憶體將更_部值之前進行 α測試及模板測試。 請注意計算核心204的L2快取記憶體408以及EU , 合控制單元施之間有512位元的頂 (spm)資料的傳輸(輸入G) 皿1 y乃外,卄异核心204輸出! 個512位_师取記蝴vc)寫人 給:EU集合控制單元施做進一步的處理。出及 請參閱第四圖,其顯示計算核心綱的其他元侧 200813884In response to the command to draw a triangle from the driver software, the vertex information is converted to the vertex shader logic element in the core 204 for vertex conversion, and the object will be changed from the object space to the workspace and/or the opening of the screen space. The dipole reaches the triangle setting unit of the drawing pipeline 208 through the different core 204, and performs known tasks in combination with the primitive, such as generating a bounding box, culling, and edge function. Generation) and triangle level rejection, etc., and then the triangle setting unit transfers the data to the grid and tile generation unit having the tile generation function in the drawing pipeline 208, so that the data object is divided into tiles (for example, 8χ8, 16χ16, etc.), and passed to other mosquito function sheets ^, for depth (z_value) processing, such as the high order of ζ·values (the same procedure uses less bit seams in higher order than lower order) Pass the z-value back to the pixel outline logic component of the computational core class to ride the pixel shader function according to the pipeline and data. Gang ί processing of the output value to the target is located within the line 208 earlier Wu _ 'early target for α 7L before testing and stencil test values of the respective portions will be _ cache. Please note that the core 204's L2 cache memory 408 and EU, and the control unit have a 512-bit top (spm) data transmission (input G), and the different core 204 output! 512-bit _ teacher to take the butterfly vc) Writer: EU collection control unit for further processing. Please refer to the fourth figure, which shows the other meta-sides of the calculation core. 200813884
關元件’計算核心204包含具有—個或多個執行單元 420a〜4施(以後通稱執行單元伽)的執行單元集合(eu 集合)412,每-個執行單元可以在—個時脈週期内處 理多個指令,因此,執行單元集合化在尖峰時可以同時 f幾乎同時處理多個執行緒,儘管第四圖騎出8個執行 單兀(EU0〜EU7)’但是並不表示限制其數量為8,於立他 實施例可以增加或減少數量,其中至少一個執行單元⑷ 如ETO^Oa)具有-解碼系統2〇〇,詳細說明如下。 計算核心204純含錢、财取衫(me_y _ss MXU)4G6 ’記憶體存取單^撕藉由記憶體介面仲 裁為410與L2快取記憶體4〇8連接,L2快取 從即集合控制單元2〇6接收頂點快取記憶體溢出資料(輸 入G),亚提供頂職取記憶體溢出㈣( =控制單元挪,另外,L2快取記憶體彻從紋理;J 產生器3U)接收紋理描述符號請求(τ#請求,輸入 因應接收_辑求,提供紋理描述符號資料(Τ#資料, 輸出W)給紋理位址產生器31〇。 、 記憶體介面仲裁器410提供了區域視訊記憶體(如金 面緩衝器或區域記憶體觸)的控制介面,匯流排介面ς π 118則提供了系統的介面’其可為pci_E匯流排,纪 體介面仲裁器410和匯流排介面單元118做為記憶體及; 絲記憶體搁之_介面,料些實施财,u快^ f思體408藉由記憶體存取單元 ° 以及匯流排介面單元118連接,;e::= 15 200813884 會把從L2快取記憶體姻及其他區塊得到的虛擬記憶體 位址轉換成實際記憶體位址。 此體;I面仲裁$ 41〇提供L2快取記憶體的記憶體 存取(如讀/寫存取),可提取指令/常數/資料/紋理、 直接記憶體存取(如載入/儲存)、索引暫存存取、暫存器 祕出、頂點快取記體内容溢出等等。 , ^算核心204還包含執行單元輸入(EU輸入)402和 _ ^仃單兀輸出(EU輸出)姻’分別用於提供執行單元集 “12的輸入以及接收執行單元集合412的輸出,執行單 ^輸入搬和執行單元輸出404可以是交換開關(刪sbar) 或匯流排,或是其他已知的輸人及輸出機制。 ^執行單元輸入402從EU集合控制單元獅接收頂點 者色器輸人(輸人E)以及幾何著色II輸人(輸人F),、缺 後將資訊提供給執行單元集合M2,讓各執行單元42〇去 處理;另外’執行單元輸入4〇2接收像素著色器輸入(輸 • 及D)及紋素封包(輸人a及B),並將這些封包傳 运至執行單元集合412 ’讓各執行單it 42G去處理;再者, 執行單元輸入402從L2快取記憶體4〇8 取),織在錢時將处資訊提供給執行 第四圖實施例的執行單元輸出404會分成偶輸出4〇4a 和可輸出404b,執行單元輸出4〇4和執行單元輸入搬一 樣可為交換開關或匯流排,或是其他已知的架構,執行單 兀偶輸出404a處理偶執行單元伽、撕、條、 的輸出,喊行單元奇輪處理奇執行單元働、 16 200813884 4施、鹽、观的輸出,總而言之,兩 他和娜共同接收執行單元集合化的輪出,如= 及刪A資料,這些輪出可傳回L2快取記憶體猶,或= 從計异核心、2G4經由:η及j2輸出至寫回料 γ 由Κ1及Κ2輸出至紋理位址產生器31〇。 或疋! 4 ^行單元集合412的執行單元流通常包含數個層級, 如描以容層級、執行緒或任務層級、指The closing element 'computing core 204' includes an execution unit set (eu set) 412 having one or more execution units 420a-4 (hereinafter collectively referred to as execution unit gammas), each of which can be processed in one clock cycle Multiple instructions, therefore, the execution unit aggregation can simultaneously process multiple threads at the same time f at the same time, although the fourth figure rides 8 execution orders (EU0~EU7)' but does not mean that the number is 8 The U.S. embodiment can increase or decrease the number, wherein at least one execution unit (4) such as ETO^Oa has a decoding system 2, as described in detail below. Computation core 204 pure money, money take shirt (me_y _ss MXU) 4G6 'memory access list ^ tear by memory interface arbitration 410 and L2 cache memory 4〇8 connection, L2 cache slave collection control Unit 2〇6 receives the vertex cache memory overflow data (input G), sub-provids the top job memory overflow (four) (= control unit shift, in addition, L2 cache memory from the texture; J generator 3U) receives the texture The description symbol request (τ# request, the input provides the texture description symbol data (Τ# data, output W) to the texture address generator 31. The memory interface arbiter 410 provides the area video memory. The control interface (such as the gold surface buffer or the area memory touch), the bus interface ς π 118 provides the interface of the system 'which can be the pci_E bus, the interface interface arbiter 410 and the bus interface unit 118 as The memory and the silk memory are placed on the interface, and some of the implementations are implemented. u fast ^ f body 408 is connected by the memory access unit ° and the bus interface unit 118; e::= 15 200813884 L2 cache memory and other blocks get virtual The memory address is converted into the actual memory address. This body; I face arbitration $ 41 〇 provides L2 cache memory memory access (such as read / write access), extractable instructions / constant / data / texture, direct Memory access (such as load/store), index temporary access, scratchpad secret, vertex cache, content overflow, etc. The calculation core 204 also contains execution unit input (EU input) 402 and _ ^ 仃 兀 兀 output (EU output) </ br> is used to provide the input of the execution unit set "12 and receive the output of the execution unit set 412, the execution of the input and execution unit output 404 can be a switch (deletion sbar) Or busbars, or other known input and output mechanisms. ^Execution unit input 402 receives the vertex from the EU set control unit lion (input E) and geometric coloring II (input F) After the absence, the information is provided to the execution unit set M2, and the execution units 42 are removed for processing; and the 'execution unit input 4〇2 receives the pixel shader input (transmission and D) and the texel packet (input and B) and transport these packets to execution The meta-collection 412 'let each execution unit it 42G to process; further, the execution unit input 402 is taken from the L2 cache memory 4〇8), and the information is provided to the execution unit of the fourth diagram embodiment when the money is woven. The output 404 is divided into an even output 4〇4a and an outputtable 404b, and the execution unit output 4〇4 and the execution unit input can be exchange switches or busbars, or other known architectures, and the single-coupled output 404a is processed. Execution unit gamma, tear, strip, output, shouting unit odd wheel processing odd execution unit 働, 16 200813884 4 Shi, salt, view output, in short, two He and Na jointly receive the execution unit integration round, such as = and delete the A data, these rounds can be returned to the L2 cache memory, or = from the different core, 2G4 via: η and j2 output to write back γ output from Κ1 and Κ2 to the texture address generator 31 Hey. Or hey! The execution unit stream of the 4^ row unit set 412 usually contains several levels, such as the level of the hierarchy, the thread or the task level,
ί任=:Γ每:執行單元420可能准許兩個二内 合八中利用位70旗標或其他機制識別其描缘内容 2這個内容的任務開始之前,從.即集合控制單元寫 二?:資訊可為著色器種類、輸入/輸 各常數= ^ 420 . 21 ^令。於貝_巾,每-執行緒根據程式計數器提取一 EU集合控制單元206類似總任務 (data-driven)紐(如輸人職_了諸、像素、幾貝何= 動 :口執,20内的適當執行緒,舉例來說 ? 制^獅指派-個執行緒給執行單元集合412的執= = 420内的-個空執行緒位置,當—執行緒已開始 戈其他元件或模組(根據著色器種類)所 輪入的貧料會放置在共用暫存緩衝器中。 所 通常圖形處理器202使用可程式頂點、幾何、及像素 17 200813884 緩衝器,不再把這些元件當成具有不同設計及指令組的各 別固疋功月而各別執行或操作這些元件,而是取 合的執仃單70 42Ga、42Gb...42Gn 3己合統-指令組執 订除了執仃單70 42(^(這個執行單元包含解碼系統2〇〇, ,此具有額外的功能)之外,每一個用於程式運算的執行 ^ 420之设計與結構均相同,於一實施例中,每一個執 二單兀42〇可以進仃多執行緒運算,當頂點著色器、幾何 者色器、像素著色器等產生不同的著色器任務,這些著色 器任務將送至個別的執行單元420去執行,於一實 中,解碼系統細可使用一頂點著色器,與其他執行單元 同,例如,執行單元42〇a使用-解碼系統, 乂疋,、他執仃早疋(如第四圖之侧b)所沒有的,因為 碼系統200管理一個或多個對應的内部缓衝器 % 係藉由接線祀及執行單元輸入搬自記憶體存= 元406取得資料。 于取早 當生成了個別的任務,E U集合控制單元2 〇 6 些任務給不_行單元的可職行緒矛^ ^集合控制單元細再管理相關執行緒的釋放,二矛:點 ::’EU集合控制單元施負責指派頂點著色器、幾何 色益及像素著色器的任務給執行單元樣的執行緒 δ己錄相關的任務及執行緒,具體來說,Eu : 施會有所有執行單元樣的執行緒及記憶體的= 裡不多做說明),EU集合控制單元施會乂 緒指派給哪—個任務使用、知道哪—個執行緒的任 18 200813884 要釋放、知道佔用多少的共用暫存器檔案記憶體 (register file memory register)、知道每一個執行=—:器 可用空間。 丁早兀有多少 因此,如果已將一個任務指派給一個執行的一 420a ’ EU集合控制單元2〇6會將這個執行緒如 中,然後將全部的共用暫存器播案記憶體減去每 '綠 緒用掉的暫存器檔案機體(f00tpring)數量 j執行 著色器、幾何著色器及像素著色器的狀 一個著色純段可以有不_機體大小,例如,^ 器執行緒可以要| 1〇個共用暫存器檔案暫存二^色 色器執行緒可以僅要求5個暫存器。 素著 當一執行緒完成其被指派的工作,運 行單元420便會發出一訊號給Eu ^ _仃、,者的執 集合控制單…會更新資源表二= 用,,行緒共用暫存器標案空間的數量 間,當所有的執行緒都處於忙碌中或所n: 案記憶體都已分配完(或是保留的暫存器空間^存= 容納額外的執行緒),則該執行單元420算是已、系:隹 合控制單元206不會再指派新的執行緒給該執^單=集 …母一個執行單元内部亦有一個執行 以管理或標註每-個執行緒是在使时(或了 可用的,就這一點而言,於一者於 丁中)或疋 執行解嶺細的功能時制】=著色器正 止幾何著色器舆像素著色器在此二6可以防 19 200813884 第五A圖說明具有前述圖形處理器202及計算核心 204特徵的執行單元420a,其包含内嵌有解碼系統200的 執行單元資料路徑512,具體來說,第五A圖是一執行單 το 420a的方塊圖,於一實施例中,其包含指令快取記憶體 控制器504、與指令快取記憶體控制器5〇4連接的執行緒 4工希J 506緩衝态508、共用暫存器槽案(comin〇n registerί任=:Γ: Execution unit 420 may permit two two-in-ones to use the bit 70 flag or other mechanism to identify the content of its content 2 before the task begins, from the collection control unit to write two?: The information can be shader type, input/output constant = ^ 420 . 21 ^ command. In the shell, each thread executes a similar set of data-driven buttons according to the program counter. (eg, input _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The appropriate thread, for example, the lion assignment - the thread to the execution unit set 412 == 420 within the empty thread position, when the thread has started other components or modules (based on The shader type) is placed in the shared scratch buffer. Typically, the graphics processor 202 uses programmable vertex, geometry, and pixel 17 200813884 buffers, and these components are no longer considered to have different designs and Each component of the command group has its own merits and functions, and each of these components is executed or operated. Instead, the execution order of the command group is 70 42Ga, 42Gb, ... 42Gn, and the command group is executed in addition to the order 70 42 ( ^ (This execution unit contains the decoding system 2, which has additional functions), the design and structure of each execution 420 for the program operation are the same, in one embodiment, each one Single 兀42〇 can enter multiple thread operations, when the vertices Color shaders, geometry shaders, pixel shaders, etc., produce different color shader tasks that are sent to individual execution units 420 for execution. In one implementation, the decoding system can use a vertex shader, The other execution units are the same as, for example, the execution unit 42A uses a decoding system, 乂疋, he does not have it (as in side b of the fourth figure) because the code system 200 manages one or more corresponding The internal buffer % is obtained by the connection port and the execution unit input from the memory bank = element 406. When an individual task is generated early, the EU collective control unit 2 〇 6 tasks are given to the non-row unit. The job control thread spear ^ ^ collection control unit finely manages the release of the relevant thread, two spears: point:: 'EU collection control unit is responsible for assigning vertex shader, geometric color and pixel shader tasks to the execution unit The thread δ has recorded the relevant tasks and threads. Specifically, Eu: the implementation of all the execution unit-like threads and memory = not much explanation), the EU collection control unit is assigned to where Tasks use, know which one of the threads of the 181813884 to release, know how much shared register file memory (register file memory register), know each execution = -: available space. So much, if a task has been assigned to an implementation of a 420a 'EU collection control unit 2〇6 will this thread as in, then all the shared scratchpad broadcast memory minus each 'green' The number of the scratchpad file body (f00tpring) is j. The shader, geometry shader, and pixel shader are executed. A shaded pure segment can have a size of _body. For example, the thread can be used. The memory file temporary storage 2 color filter thread can only require 5 registers. When a thread completes its assigned work, the running unit 420 will send a signal to Eu ^ _仃, the person's control set list... will update the resource table 2 = use, the line share the register Between the number of cases, when all the threads are busy or n: the file memory has been allocated (or the reserved scratchpad space = accommodate additional threads), then the execution unit 420 It is already, the system: the control unit 206 will not assign a new thread to the executable = set ... the mother also has an execution inside the execution unit to manage or label each thread is in time (or Available, in this regard, in one of the Dingzhong) or 疋 解 细 细 的 】 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = The figure illustrates an execution unit 420a having the aforementioned graphics processor 202 and computing core 204 features, which includes an execution unit data path 512 embedded with a decoding system 200. Specifically, the fifth A diagram is a block diagram of an execution single το 420a. In an embodiment, Containing instruction cache controller 504, connected to the instruction cache memory controller 5〇4 the thread J 506 4 ENGINEERING Greek state buffer 508, a common groove pattern register (register comin〇n
file ’ CRF)51〇、與執行緒控制器5〇6及缓衝器5〇8及共用 暫存裔檔案510連接之執行單元資料路徑(Εϋ data patti, EUDP)512、執行單元資料路徑先進先出、緩衝器(行如β如 t FIFO)514、述部暫存器槽案register , )516 、、屯里暫存态槽案(scalar register file,SRF)518、 貝料輸出控制器520以及執行緒任務介面524,如前所述, 執行單兀420從執行單元輸入4〇2接收輸入,然後提供輸 出給執行單元輸出404。 ^執行,控制器506提供整個執行單元42〇a的控制功 能’包括管理每一個執行緒及判斷功能,例如決定如何執 行t執行緒,EUDP 512包含解碼系統200,可進行各種的 計算,包含像是浮點運算計算邏輯單元㈣hmetie _ umt ’ ALU)、移位邏輯功能等邏輯電路。 口…1料輸出控制器52G可將完成之資料移至某些與執行 接=例如EU集合控制單元2°6的 狄二、 ' —寫回單元308等等,EUSP512傳送「任 Ί i制為520包含儲存部分,以儲存完成的任務(如 20 200813884 2::,分選擇任務,接著根據著 :: 疋的暫存器位置,從丑用蕲六。0认也 μ谷所才日 資料項目,梦德脾心暫存“案510讀出所有的輸出 ' ^目’然後將讀送至執行單元輪出撕。 別符給=!=24^㈣42Ga完成之任務識 栌制置-/、σ拴制 ’任務識別符會通知eu集合 “住:06有一特定執行單元内有執行緒資源,可指派 新的任務給該執行單元(如4施)。 ㈠曰派 實施例中,緩衝器爾(如常數緩衝器)可以分 ^個區塊,每—個區塊有16個128位元水平向量常數 的位置,菩多哭枯田、咬卜卜 J里书数 位詈: 鼻元與一索引存取一常數緩衝器 ^^ 以是包含%位元或接近32位元不具 正負旒的整數常數的暫存器。 7决取。己|^體控制器5〇4是執行緒控制器鄕的介 有執行緒控制器讀取請求(如從指令記憶體 换订著色器碼),指令快取記憶體控制器5〇4會查找 仏戴表(树出),進行擊中/不中⑽/miss)測試,舉個例 子’如果請求的指令位於指令快取記憶體控制器5〇4的快 取此體中則表示擊中,如果所欲請求的指令將從Μ快 取記憶體408或記憶體⑽提取則表示不巾,如果擊中, 而同B心又有攸執行單元輸入4〇2發出的請求,則指令快取 記憶體控制器504即可同意請求,這是因為指令快取記憶 ,控制器504的指令快取記憶體只有一個讀寫埠,而執行 單7G輸入4〇2具有最高之優先權;相反地,如果不中,而 21 200813884 L2快取記憶體408内有可取代的區塊並有空間存在EUm> FIFO 514 ’則指令快取記憶體控制器504可同意請求。於 一實施例中,指令快取記憶體控制器504的快取記憶體包 含32組,每一組有4個區塊,每一個區塊帶有2位元狀態 訊號,可代表三種狀態,分別是無效、載入、或有效狀態, 在區塊載入L2 f料之前,區塊是「無效」狀態,當等候 L2資料日守,疋「載入」狀態,當完全載入L2資料時,則 成為「有效」狀態。File ' CRF ) 51〇, the execution unit data path (Εϋ data patti, EUDP) 512 connected to the thread controller 5〇6 and the buffer 5〇8 and the shared temporary file 510, the execution unit data path advanced first Output, buffer (such as β such as t FIFO) 514, the description of the register register, 516, scalar register file (SRF) 518, the shell output controller 520 and The thread task interface 524, as previously described, executes an input unit 420 that receives input from the execution unit input 4〇2 and then provides an output to the execution unit output 404. Execution, the controller 506 provides the control function of the entire execution unit 42A, including managing each thread and decision function, for example, determining how to execute the t thread, and the EUDP 512 includes the decoding system 200, which can perform various calculations, including images. It is a logic circuit such as floating-point arithmetic calculation logic unit (4) hmetie _ umt ' ALU), shift logic function. The port ... output controller 52G can move the completed data to some execution and execution = for example, the EU set control unit 2 ° 6 Di Di, ' - write back unit 308, etc., EUSP512 transmits "Ren Ί i system 520 contains the storage part to store the completed tasks (such as 20 200813884 2::, select the task, and then according to:: 疋 暂 位置 , , , , 从 丑 丑 蕲 。 。 μ μ μ μ μ μ μ μ μ , Mengde temperament temporary storage "case 510 read all the output '^目' and then send the reading to the execution unit round and tear. Do not give ==== 24^(4) 42Ga completed task identification system -/, σ The 'task identifier' will inform the eu collection "live: 06" has a specific execution unit with a thread resource, and can assign a new task to the execution unit (such as 4). (1) In the embodiment, the buffer ( Such as the constant buffer) can be divided into blocks, each block has 16 128-bit horizontal vector constant positions, Bodhi crying the field, biting the Buji J-book number 詈: nose element and an index access A constant buffer ^^ is an integer constant containing % bits or nearly 32 bits without positive and negative 旒7 取. 己|^ Body controller 5〇4 is the thread controller 鄕 interface with a thread controller read request (such as from the instruction memory to change the shader code), instruction cache memory The controller 5〇4 will look up the 仏 wearing table (tree out) and perform the hit/miss (10)/miss) test, for example, if the requested instruction is located in the cache of the instruction cache controller 5〇4 In the body, it indicates a hit. If the command to be requested is to be extracted from the cache memory 408 or the memory (10), it means no wipe, if it is hit, and the same B heart has the execution unit input 4〇2. The request, the instruction cache controller 504 can agree to the request, because the instruction cache memory, the controller 504 instruction cache memory has only one read/write port, and the execution single 7G input 4〇2 has the highest value. Priority; conversely, if not, and 21 200813884 L2 cache memory 408 has a replaceable block and there is space EUm> FIFO 514 'the instruction cache controller 504 can agree to the request. In an embodiment, the cache memory of the instruction cache controller 504 is cached. Contains 32 groups, each group has 4 blocks, each block has a 2-bit status signal, which can represent three states, respectively, invalid, loaded, or valid state, before the block is loaded with L2 f material The block is in the "invalid" state. When it waits for the L2 data to be in the "load" state, it becomes "active" when the L2 data is completely loaded.
透過fUDP路徑512可對述部暫存器槽案516進行讀 寫’執行單元輸人4G2做為進入資料與執行單it 42〇a的介 面,於一實施例中’執行單元輸人402包含一 8項目先進 先出緩衝器以緩衝進入資料,執行單元輸入術亦可將資 料达至指令快取記憶體控制器5〇4的指令快取記憶體及常 數緩衝器。爾’執行單元輸人搬也可保留著色器内容。 執行早兀輸出姻做為將輸出資料從執行 控制單元寫、L2快取記憶體彻、及寫回 早7G 、"面,於一實施例中,執行單元輸出4〇4包含 :4項目絲先崎_,肋接收仲贿求,並緩衝輸 ίί二ί合控制單元2%的資料,執行單元輸出404包 =種功m讀裁齡絲記龍讀取 出寫入請求、EUDP讀/寫請求。 、科輪 共用暫存器檔案510用於儲存輸入 資料’於-實施例中,共用暫存器檀㈣心: ㈣的m X 128位元暫存器播案及一讀一寫和二 22 200813884 埠,一項一寫埠係供EUDP 512使用,用於指令執行啟動 的讀寫存取,偶執行緒共享記憶頁〇、2、4、6,奇執行緒 則共享圮憶頁1、3、5、7,執行緒控制器5〇6配對不同執 行緒的指令,並確認共用暫存器檔案的記憶體沒有讀或寫 备己憶頁衝突。 讀寫埠則供執行單元輸入402及資料輸出控制器52〇 使用,以載入初始執行緒輸入資料以及將最終執行緒輸出 寫至EU集合控制單元資料緩衝器及L2快取記憶體4〇8或 其他模組,執行單元輸入402及執行單元輸出4〇4共享一 讀寫I/O #,於一實施例中,寫入比讀出具有更高的優先 權,512位元輸入資料進入4個不同的記憶頁,以避免將 資料載入共用暫存器檔案510時發生衝突,2位元通道索 引資料與512位元對齊基準位址(aligned base address)— 起通過以指定輸入資料的開始記憶頁,舉個例子,如果開 始通道索引為卜則記憶頁1載入從最低有效位元(least significant bit,LSB)起算的第一個 128 位元,下一個 128 位70則載入記憶頁2,以此類推,假設執行緒基準記憶頁 補償為0,最後一個128位元則載入記憶頁〇,請注意執行 緒ID的兩個最低有效位元用於產生一記憶頁補償,以隨機 排列每一個執行緒的開始記憶頁位置。The facsimile path 512 can be read and written to the staging register slot 516. The execution unit input 4G2 is used as the interface for entering the data and executing the single unit 42a. In an embodiment, the execution unit input 402 includes a The 8 item first-in first-out buffer buffers the incoming data, and the execution unit input operation can also reach the instruction cache memory and constant buffer of the instruction cache memory controller 5〇4. The implementation of the unit's execution unit can also retain the shader content. Performing the early output as the output data from the execution control unit, the L2 cache memory, and the write back to the early 7G, " face, in one embodiment, the execution unit output 4〇4 contains: 4 item silk First Saki _, rib receives the bribe request, and buffers the data of the control unit 2%, the execution unit output 404 package = seed m read the age of silk record dragon read out the write request, EUDP read / write request . The keel shared register file 510 is used to store the input data 'in the embodiment, the shared register TAN (four) heart: (4) the m X 128 bit temporary register broadcast case and one read and write and two 22 200813884埠, a write-write system is used for EUDP 512, for read and write accesses of instruction execution, even threads share memory pages 2、, 2, 4, 6, and odd threads share 圮 页 1, 3, 5, 7, the thread controller 5〇6 pairs the instructions of different threads, and confirms that the memory of the shared scratchpad file does not read or write the memory page conflict. The read/write buffer is used by the execution unit input 402 and the data output controller 52 to load the initial thread input data and write the final thread output to the EU collection control unit data buffer and the L2 cache memory 4〇8. Or other modules, the execution unit input 402 and the execution unit output 4〇4 share a read/write I/O #. In one embodiment, the write has a higher priority than the readout, and the 512-bit input data enters 4 A different memory page to avoid conflicts when loading data into the shared scratchpad file 510. The 2-bit channel index data is aligned with the 512-bit aligned base address to specify the beginning of the input data. Memory page, for example, if the channel index is started, memory page 1 loads the first 128 bits from the least significant bit (LSB), and the next 128 bits 70 is loaded into the memory page. 2, and so on, assuming that the thread reference memory page compensation is 0, the last 128 bits are loaded into the memory page, please note that the two least significant bits of the thread ID are used to generate a memory page compensation, to random row Each thread of memory start page location.
CRF暫存器索引及執行緒Π)可用於建立一獨一無二 的邏輯位址’以標籤配對(tag matching)共用暫存器構案51〇 的讀寫資料,舉個例子,位址可以排成128位元,就跟共 用暫存器檔案記憶頁的寬度-樣,ϋ由結合8位元的CRF 23 200813884 暫存器索引以及5仿 13位元位址,每—個二:丁緒1D,可以建立獨-無二的 有兩個犯位元項目(字^行有—個標籤’每一行則 中,並將CRF旁引的母一字元儲存於4個記憶頁 記憶頁補償,個最低有效位元力认目前執行緒的 只補1貝从建立記憶頁選擇。 器檔=對二法可讓不同執行緒的暫存器共享共用暫存 丘用效利用記憶體,EU集合控制單元206記錄CRF register index and thread Π) can be used to create a unique logical address 'tag matching (tag matching) shared register structure 51 〇 read and write data, for example, the address can be arranged into 128 The bit, just like the width of the shared scratchpad file memory page, is composed of a combination of 8-bit CRF 23 200813884 register index and 5 imitation 13-bit address, each - 2: Ding Xu 1D, can There are two erroneous bit items (words and lines with one label) in each line, and the parent word of the CRF side is stored in 4 memory page memory page compensation, the least effective The bit force recognizes that the current thread only fills 1 Bay from the establishment of the memory page selection. The file file = the second method allows the different registers of the scratchpad to share the shared temporary memory utilization memory, and the EU collection control unit 206 records
Γ 420子㈣案51G的記髓使用程度,確鋪程執行單 兀420a的新任務時有足夠的空間。 私-目兩執行緒的目標CRF索引佔全部crf暫存器 的大^ ’在執^緒控制器5G6著手進行執行緒及著色器^ ^之4,輸入貢料就應該存放於共用暫存器檔案510中, 田執行、、者執行結束,資料輸出控制器52〇從共用暫存器構 案510讀取輸出資料。 河述執,單元之f施例具有时解碼祕的 EUDP 512,第五B圖說明一 EUDP 512之實施例,EUDP 512 包3暫存斋檔案526、多工器528、向量浮點(FP)單元532、 向里整數异術邏輯(ALU)單元534、特殊目的單元5%、多 工器538、暫存器檔案54〇、以及解碼系統2〇〇,解碼系統 2〇〇包含一個或多個CABAC單元530,可以解碼一個或多 個串流,舉個例子,單一 CABAC單元53〇可以解碼單一 串流,兩個CABAC單元530 (如虛線所示,但為簡潔之 故未繪出其連接關係)可以同時解碼兩個串流等等,為了 清楚說明,之後的敘述僅針對使用單一 CABAC單元53〇 24 200813884 ,解碼祕200之操作’制财推迪超過—紅 單元。 如圖所示,EUDP 512白人# η 已3對應於CABAC解碼單元 530、向量浮點單元532、向量彻單元別、特殊目:單 元536的一些平行資料路徑,每一個單元均可根據接收到 的指令執行對應的運算,暫存器擋案526接收運算元(標 不為SRC1及SRC2),於—實施例中,暫存器 Φ =五A〆圖所示之共用暫存器檔案51〇、述部暫存器播案 5i6、及/或純量暫存器樓案518,請注意於某些實施例 I,亦可使用更多的運算元運算(功能)訊號線% ^元53㈣6接收運算訊號的手段’目前訊號線⑽連 时至多工☆ 528 ’可傳送編碼成指令之當前值,供每一個 p 530〜536進行小整數值的整數運算,指令解碼器(未 提,運算元、運算(功能)訊號、以及目前訊號, 念枓路徑(可以包含寫回階段)末端的多工器538選擇正 路㈣輸出結果’送至暫存㈣案54(), 案540包令一日栌^从 π 飞什时才田 ^目“件’可以是暫存器播案526或其他暫 4、注意’於一實施例中’當來源及目標暫存器包含 =同70件’指令的位元具有來源及目標s件選擇,供多工 為處理來自/送至適當暫存ϋ檔案的資料。 因此’執行單元42〇a可以視為一多階管線(如4階管 :’士,有4個算術邏輯單元)’ CABAC解碼運算於4個 :二中發生’需要延遲好讓CABAC解碼執行緒動作, 牛歹,子,當位元流缓衝器發生向下溢位(underfl〇w)、等 25 200813884 候初始化内容記鋪、隸將位福载人fif〇緩衝器及 sREG暫存ϋ⑽後轉)、及/或處理咖已超過預 檻時間等,可以在執行階段加入延遲。 如前所述’與某些實施例中,解碼系統2〇〇利用單一 執行單元420a同時解碼_位元&,舉個例子,根據_個 擴充指令組^解碼系統可以使用兩個資料路徑(如新增另 CABAC單元530)同時進行兩個串流的解碼,當然也 可解碼較多或較少的串流(那麼就會使賴多或較少的、資 料路徑)’當牵涉到多個串流,某些解碼系統綱並不限二 同時解碼,另外,在某些實施例中’單一 cabac單元娜 可以執行多重同時串流解碼。 於貝施例中,當解碼系統200使用兩個資料路徑、 兩個執行緒便可關時運行,舉個例子,在兩串流解ς實 知例中’限制執行緒的數量為兩個,第一執行緒(如執行 緒〇)指派給解碼系統200的第一記憶頁(即CABAc單 凡=30)’第二執行緒(如執行緒丨)則指派給解碼系統加〇 的第一s己憶頁(即第五B圖的虛線CABAC單元),於某些 實施例中,兩個或多個執行緒可以運行於單一記憶頁,另 外:雖然此處顯示解碼系統2〇〇是内嵌於£11〇1>512,亦可 包含其他的元件,像是EU集合控制單元2〇6内的邏輯電 路。 現已說明執行單元420a、EUDP512、以及CABAC單 元53〇的某些實施例,下面簡單解釋CABAC解碼,然後 说明解碼系統200的一些實施例,通常H.264 CABAC解 26 200813884 碼私序可以包括解析弟一$吾法成分的編碼位元流、初如化 第一語法成分的内容變數及解碼f丨擎、以及二進位化 (binarization),然後,針對每一個二進位解碼,程序還包括 獲得一内容模型(content model)以及二進位解碼各组法成 分’直到獲得有意義的字碼(codeword)配對,更進一+解 釋,解碼系統200解碼語法成分,而每一語法成分可以代 表量化係數、動作向量、及/或預測模式、或其他有關巨 圖塊(macroblock)的參數’用以表示影像或視訊的特殊圖場 (field)或圖框(frame),每一個語法成分可以包含一系列的一 進位符號或二進位值,騎-個二進位符齡被解碼成^ 或1值’解碼系統2〇〇根據輸入二進位符號的發生機率控 制輸出位元長度。 工 已知當某些符號(稱為主要符號)比其他符號更容易 發生時,CABAC編碼器可提供高效率編碼方法,這些主 要符號可以較小位元/符號比進行編碼,編碼器持續⑽ 進入貧料的頻率統計資料,適#地調整編碼演 内容模型^有較高可能性的二進位符號稱為高可能=號 :=symbGl ’ M⑼’㈣他符號則為低可 ΪΓ _: ίsymbo1 ’LPS) ’二進位符號與其内容模型 連結,母―⑽模型對應於LPS機率以及—Mps值。 為了決定每-個二進位符號,解碼系統扇 或接收一對應範圍、補償及肉突y 、 號種類和相鄰圖塊(如目前、付 巨圖塊)決疋的内谷而從複數個可能的環境模型中選擇, 27 200813884 内容辨識符(context identifier)可經由内容模型決定,從而 得到MPS值以及驗解碼程式_碼脾之目前狀態,範 圍則表不一個區間,每經過一次二進位解碼就會縮小一次 範圍。 、Γ 420 (4) The degree of use of the 51G in the case of the case, it is sufficient to have enough space for the new task of the 420a. The target CRF index of the private-eye two threads accounts for the large number of all crf registers. In the implementation of the controller 5G6, the thread and the colorizer ^^4 are started. The input tribute should be stored in the shared register. In the file 510, the field execution ends and the execution ends, and the data output controller 52 reads the output data from the shared register structure 510. The implementation of the unit, the unit of the example has the time to decode the secret EUDP 512, the fifth B diagram illustrates an EUDP 512 embodiment, EUDP 512 package 3 temporary storage file 526, multiplexer 528, vector floating point (FP) Unit 532, inward integer arithmetic logic (ALU) unit 534, special purpose unit 5%, multiplexer 538, scratchpad file 54〇, and decoding system 2〇〇, decoding system 2〇〇 includes one or more The CABAC unit 530 can decode one or more streams. For example, a single CABAC unit 53 can decode a single stream, and two CABAC units 530 (shown by dashed lines, but the connection is not drawn for brevity). The two streams can be decoded at the same time, etc. For the sake of clarity, the following description is only for the use of a single CABAC unit 53〇24 200813884, and the operation of the decoding secret 200 is more than the red unit. As shown, EUDP 512 white #η has 3 corresponding to some parallel data paths of CABAC decoding unit 530, vector floating point unit 532, vector unit, special unit: unit 536, each unit can be received according to The instruction executes the corresponding operation, and the register file 526 receives the operation element (not referred to as SRC1 and SRC2). In the embodiment, the register Φ = the common register file 51 shown in FIG. The temporary register broadcast 5i6, and / or the scalar register 518, please note that in some embodiments I, you can also use more operand operations (function) signal line % ^ yuan 53 (four) 6 receive operation The means of signal 'current signal line (10) continuous time to multiplex ☆ 528 ' can be transmitted into the current value of the instruction, for each p 530~536 small integer value integer operation, instruction decoder (not mentioned, operation element, operation (Function) signal, and the current signal, the multiplexer 538 at the end of the chanting path (which can include the write-back phase) selects the positive path (4) The output result is sent to the temporary storage (4) case 54 (), the case 540 package order one day 栌 ^ from π 什 才 才 才 才 ^ ^ ^ ^ ^ ^ " Broadcasting 526 or other temporary 4, note that in the embodiment, when the source and destination registers contain = the same 70 pieces of instructions, the bits have source and target s pieces for multiplex processing to process from/send To the appropriate temporary storage of the file information. Therefore 'execution unit 42〇a can be regarded as a multi-stage pipeline (such as 4th-order tube: 'senior, there are 4 arithmetic logic units)' CABAC decoding operation in 4: two occur 'Need a delay to let CABAC decode the thread action, burdock, child, when the bit stream buffer occurs downward overflow (underfl〇w), etc. 25 200813884 waiting for the initialization of the content, the singer The buffer and sREG temporary storage (10) are transferred, and/or the processing coffee has exceeded the expected time, etc., and the delay can be added during the execution phase. As described above, in some embodiments, the decoding system 2 utilizes a single Execution unit 420a simultaneously decodes _bits & for example, according to _ a set of expansion instructions, the decoding system can use two data paths (such as adding another CABAC unit 530) to simultaneously decode two streams, of course Can decode more or less streams (then "Multiple or less, data path" 'When multiple streams are involved, some decoding systems are not limited to simultaneous decoding. In addition, in some embodiments, 'single cabac unit can perform multiple simultaneous streams. Decoding. In the example of the Bayesian, when the decoding system 200 uses two data paths and two threads to run off, for example, in the case of two streams of solutions, the number of restricted threads is two. The first thread (such as the thread) is assigned to the first memory page of the decoding system 200 (ie, CABAc single = 30). The second thread (such as the thread) is assigned to the decoding system. A suffix page (ie, the dashed CABAC unit of Figure 5B), in some embodiments, two or more threads can run on a single memory page, additionally: although the decoding system 2 is shown here Embedded in £11〇1>512, it can also contain other components, such as the logic in the EU collective control unit 2〇6. Some embodiments of execution unit 420a, EUDP 512, and CABAC unit 53A have been described. The following briefly explains CABAC decoding, and then illustrates some embodiments of decoding system 200, typically H.264 CABAC solution 26 200813884 code private order may include parsing The coded bit stream of the $1 method component, the content variable of the first syntax component and the decoding, and the binarization. Then, for each binary decoding, the program further includes obtaining one. The content model and the binary decoding of each set of component components 'until a meaningful codeword pairing is obtained, further + interpretation, the decoding system 200 decodes the syntax components, and each syntax component can represent a quantized coefficient, an action vector, And/or prediction mode, or other parameters relating to macroblocks, which are used to represent a particular field or frame of an image or video, each of which may contain a series of carry symbols. Or binary value, the ride-two binary age is decoded into ^ or 1 value 'decoding system 2〇〇 according to the probability of occurrence of the input binary symbol Output bit length. It is known that CABAC encoders provide highly efficient coding methods when certain symbols (called primary symbols) are more likely to occur than other symbols. These primary symbols can be encoded with a smaller bit/symbol ratio, and the encoder continues to enter (10). Frequency statistics of poor materials, suitable for adjusting the coded content model ^ There is a higher probability of the binary symbol called high probability = number: = symbGl ' M (9) ' (four) his symbol is low _ _: ίsymbo1 'LPS The 'binary symbol is linked to its content model, and the parent-(10) model corresponds to the LPS probability and the -Mps value. In order to determine each binary symbol, the decoding system fan or receives a corresponding range, compensation and fleshy y, number category and adjacent tiles (such as current, heavy block) In the choice of environment model, 27 200813884 The context identifier can be determined by the content model, so as to obtain the MPS value and the current state of the decoding program _ code spleen, the range is not an interval, every time the binary decoding is performed Will narrow the scope once. ,
區間分為兩個子範圍,分別對應Mps值和Lps機率, ,範圍及已知内容模型所指定的LPS機率相乘可得Lps子 範,’將範_去LPS子範_可得MPS子範圍,補償 則^決謂碼二進位值的鮮,通f是從編碼位元流中取 出财9位元進行初始化,對於—已知的二進位符號解碼及 内谷模型,如果補償小於MPS子範圍,則二進位值為鹏 值,下-次解碼所使用的範圍便為Mps子範圍,相反地, =進位,則為LPS,將MPS值的反值放在相關的内容模型 *同WF個㈣便設為LPS子範圍,解碼程序的結果 :.、、:連串的二進位值,將用於判斷此串值是否符合有意義 概要敘述解碼系統綱的運算與cabac解碼的關 係’下列敘述提出於CABAC解碼程序_容巾之解碼系 ^的。種7〇件’可將符合實際應用的各種變形列入考 f # ^°下騎使用的許多術語是出自 所、:為了㈣之故不再贅述,除非是有助於瞭解 所述^同程序及/或元件,才會再做進—步之說明。 A圖至第六F圖是說明解碼系統及相關元件 之/圖\其中緣出之解碼系統具有單- CABAC單 -(於第八A圖至第六F圖,所使用之cabac單元 28 200813884 530二解碼系統2⑻互換),因此於實施例中,解碼系統 200可解,單_位元流,同樣的原則可應用至具有多個 CA^AC單元的解碼系統2〇〇,可同時解碼多個(如兩個) 瓜簡單地說,第六A圖是解碼系統200的選擇元件, 第/、=圖則為第六A圖選擇元件加上其他元件的功能方塊 第六C圖則為說明解碼系統細提供的串流緩衝器功 此之方塊圖,第六D圖與第圖是說明解碼系統2〇〇的 内谷圯fe體功能之方塊圖,而第六£圖是說明用於解碼一 巨圖塊的例tf機制之方塊圖,雖然下列敘述是有關巨圖塊 解碼的内容,但是此原則可應用至各種圖塊解碼。 明參閱弟六A圖’解碼系統200包含CABAC單元 530 ’ CABAC單元530具有CABAC邏輯模組66〇以及記 憶體模組650,於一實施例中,CABAC邏輯模組66〇包含 三個模組,分別是CABAC單元530内的二進位化(bind) 模組620、取得内容(GCTX)模組622、以及二進位算術解 碼(BARD)引擎624,BARD引擎624更包含狀態索引 (pStateldx)暫存器 602、MPS 值(valMPS)暫存器 604、碼長 範圍(codlRange)暫存器606、以及碼長補償(c〇dl〇ffset^ 存器608,CABAC單元530的記憶體模組65〇包括巨圖塊 相鄰内容(mbNeighCtx)記憶體610 (亦稱為内容記憶體陣 例(context memory array))、區域暫存器612、總暫存器 614、以及移位暫存器(SREG)•串流緩衝器/直接記憶體存 取(DMA)引擎618 (亦稱為DMA引擎模組,將於第六c 圖中做進一步之說明),另外還有未繪出之暫存器,於—實 29 200813884 施例中,mbNeighCtx記憶體610包含如第六D圖之陣列 結構,之後會有更進一步之說明,記憶體模組650還包含 二進位字串暫存器616。 CABAC單元530與執行單元420a的介面包括目標匯 流排628、兩個來源匯流排(SRC1 632和SRC2 630)、命 令及執行緒資訊匯流排634、以及延遲/重置匯流排636, 目標匯流排628上的資料可以直接或間接(如經由中間快 取記憶體、暫存器、緩衝器、或記憶體)傳送至圖形處理 翁單元114内部或外部的視訊處理單元,目標匯流排628上 的資料可以是微軟的DX API格式或其他格式,這些資料 包含係數、巨圖塊芩數、動作資訊、及/或IpCM取樣或 疋其他貧料,CABAC單元530還包括由位址匯流排638 和貧料匯流排640組成的記憶體介面,從位址匯流排638 得到位址後,便可以藉由從資料匯流排64〇得到的資料進 仃位兀流讀的存取,於—實施例中,資料匯流排64〇上 • ❾胃料可以包括未加密視訊流,其中包括各種訊號參數及 其他貝料與格式,於某些實施例中,可以使用載入—儲存 操作來存取位元流資料。 在開始說明CABAC單元53〇的各元件之前,簡單說 曹 M CABAC ^^it 420a ^ it 苇根據切片(slice)形式,驅動軟體⑽(第一圖)準備 CABAC著色态亚將其載入執行單元4施,該著 色器使用鮮指令組加上BIND指令、gctx指令、以及 RD才曰7可以進行位元流之解碼,因| cmAC單元 30 200813884 530使用的内容表(context table)可以根據切片種類改變, 所以每一切片均要載入,於一實施例中,在發出其他指令 箣’ CABAC著色器執行的第一個指令包含iNT—CTX和 INIT一ADE,這兩個指令使CABAC單元530開始解碼一 CABAC位元流,並將位元流從串流解碼點開始載入F正〇 緩衝為’稍後將說明這兩個指令。 關於解析位元流,從記憶體介面的資料匯流排64〇接 收位元流,然後由SREG串流緩衝器/DMA引擎618進行 秦 緩衝,切片資料解析階段提供位元流解碼,位元流(如nal 位疋流)包括一張或多張圖片,將其切割成圖檔頭(header) 及許多切片(slice) ’ 一張切片通常包含一系列的巨圖塊,於 一實施例中,外部程序(即CABAC單元53〇外部)解析 NAL位元流、解碼切片檔頭、傳送指向該切片資料(如切 片開始處)的指標’硬體(加上軟體)可以從圖形解析圧%^ 位^流m實施例中,CABAC編碼僅出現於切 片資料與巨圖塊,通常,驅動軟體128從切片資料處理位 元流,S為這是應餘式及API提供的功能,指向次 料位置的指標傳遞還牽涉到切片資料的第—位元組位Z (如RBSPbyeAddress)和指出位元流開始或標頭位置士 ^ SREGpt〇的位元補償指標(如一個位元或多個位元),^ * 元流的初始化將於稍後解釋,於某些實施例中,可以利用 ^處理器(如第-圖的中央處理單元126)處理外部程 提供圖片解碼以及切片標頭解碼,與某些實施例中 解碼系統20G的可程式特性,可以於任何階段進行解碼: 200813884 凊爹閱第六C圖,其為CABAC單元530的SREG串 流缓衝1§、/^]\^㈣618賴擇元件部分及其他元件之 方塊圖,其包含·元暫存ϋ 662及664,分別從匯流排 632及630接收SRC1與SRC2值,再傳遞至暫存器咖 及668 ^其他元件則如有關第六A圖之說明,除非說明需 要,為簡潔之故不再贅述,SREG串流緩衝器/DMA引擎 618包含内部位元流緩衝器618七,於一實施例中可為 BigEndian格式之32位元暫存器及_ 128位元暫存器: 驅動軟體發it{的初始化指令於開始時蚊SREG串流緩衝 器/DMA引擎618, -旦啟動,便自動管理戲心串流緩 衝态/DMA引擎618的内部緩衝器618b,SREG串流緩衝 二/DMA引擎618保留待解析位元的位置,於一實施例 中’ SREG串流緩衝器/DMA引擎618使用兩個暫存器, 一個快速32位元正反器與一個較慢512或1〇24位元記憶 體,位元流會使用位元,移位暫存器618a以位元進行操 作,而位元流緩衝器618b以位元組進行操作,可以節省能 源通系移位暫存裔618a運算的指令會使用少許位元(如 \〜3位元),當移位暫存器618a使用超過一位元組的資料, 貧料(位元组片段)將從位元流緩衝器618b傳送給移位暫 存器618a,然後緩衝器指標會減少傳送的位元組數量,當 SREG串流緩衝器/DMA引擎618的DMA引擎偵測到使 用256位元或更多位元時,便從記憶體提取256位元填滿 位兀流緩衝器618b,如此CABAC單元530實行了 一個簡 單的循環缓衝斋(256位元片段X 4),以追蹤位元流緩衝 32 200813884 恭618b並進行填充,於某些實施例中可以使用單一緩衝 器’不過一個循環緩衝器需要更複雜的指標計算來跟上記 憶體的速度。 利用初始化指令達成與内部緩衝器618b互動,稱為 画T—BSTR指令,於一實施例中是由驅動軟體128發出 INIT一 BSTR指令以及其他之後說明的指令,如果已知位元 流位置的位元組位址及位元補償,INIT一BSTR指令將資料 載入内部位元流緩衝器618b,並開始管理程序,每一次呼 秦 叫處理切片資料均會發出下列格式之指令·· INIT—BSTR offset, RBSPbyteAddress 這個指令用於將資料載入SREG串流緩衝器/DMA 引擎618的内部緩衝器618b,SRC2暫存器664提供位元 組位址(RBSPbyteAddress),而SRC1暫存器662提供位元 補償,如此,可以使用下列通用之指令格式: INIT一BSTR SRC2,SRC1, 其中,這個指令中的SRC1以及SRC2及其他訊號是對應 ⑩ 内部暫存器662及664内的值,但是不限於這些暫存器, 於一實施例中,使用256位元排列之記憶體提取來存取位 元流資料,並將其寫入緩衝器暫存器並傳送至SREG串流 , 緩衝器/DMA引擎618的32位元移位暫存器618a,於一 v 實施例中,在這些暫存器或緩衝器進行運算之前,位元流 緩衝器618b内的資料是以位元組方式排列,此資料排列可 藉由排列指令實施,亦稱之為ABST指令,ABST指令會 排列位元流缓衝器618b内的資料,在解碼過程中,排列位 33 200813884 元(如填充位元)最後將被丟棄。 當移位暫存器618a使用資料,内部緩衝器618b便會 填充資料,換句話說,SREG串流緩衝器/DMA引擎618 的内部緩衝器618b類似以3為模(m〇dul〇)之循環緩衝器, 並輸入SREG串流緩衝器/DMA引擎618的32位元暫存 器618a,CABAC邏輯模組660可以使用rEAD指令從移 位暫存器618a讀取資料,READ指令之格式如下: READ DST, SRC1, 其中DST對應於一輸出或目標暫存器,於一實施例中, SRC1暫存器662包含不具正負f虎的整數值n,經過仙仙 ,令,從移位暫存器618a獲得n位元,當從32位元暫存 器618a消耗了 256位元的資料(如解碼一個或多個語法成 分),自動開始提取動作以獲得另一個256位元的資料,將 其寫入内部緩衝器618b的暫存器,接著進人移位暫存器 618a供下一循環使用。 w 。於某些實施例中’如果對應於一符號解碼之移位暫存 器618a的資料已被使用了預定數量的位元或位元組,二 部緩衝器6勵沒有再接收到任何資料,則cabac邏輯模 組660可以經由延遲/重置匯流排_進行延遲,以便執 行其他的執賴(例如與CABAC解碼料無關之執 緒)’像是頂點著色器操作。 使用SREG串流緩衝器/DMA引擎618 # dma 可以減少所需賴_數量,關償記憶體賴(例如, 於某些圖形處理單元中,會到三百多週期),當使用了位元 34 200813884 流’可以請求流入排在後面的位元流資料,如果位元流資 料太少使得位元流緩衝器618b有向下溢位的風險(例如已 知讓訊號從CABAC單元530流至處理器管線的週期數), 可傳遞延遲信號給處理器管線,暫停操作,等候資料到達 位元流緩衝器618b。 蠢The interval is divided into two sub-ranges, which correspond to the Mps value and the Lps probability respectively. The range and the LPS probability specified by the known content model are multiplied to obtain the Lps sub-norm. 'Fill the _ to the LPS sub-norm to obtain the MPS sub-range The compensation is determined by the code binary value, and the f is the initialization of the 9-bit kernel from the coded bit stream. For the known binary symbol decoding and the inner valley model, if the compensation is less than the MPS sub-range , the binary value is the Peng value, the range used for the next-time decoding is the Mps sub-range, and conversely, the = carry, the LPS, the inverse of the MPS value is placed in the relevant content model * with WF (four) It is set to the LPS sub-range, and the result of the decoding process: ., , : a series of binary values, which will be used to determine whether the string value meets the meaningful summary description of the relationship between the decoding system and the cabac decoding. CABAC decoding program _ the towel decoding system ^. A kind of 7-pieces can be used to test various deformations in accordance with the actual application. ##° Many terms used in riding are from the place: for the sake of (4), it will not be repeated, unless it is helpful to understand the same procedure. And / or components, will be done again - step description. Figures A through 6F are diagrams illustrating the decoding system and related components/maps in which the decoding system has a single-CABAC single- (in the eighth to sixth F-pictures, the used cabac unit 28 200813884 530 The second decoding system 2 (8) is interchanged, so in the embodiment, the decoding system 200 can be solved, the single_bit stream, the same principle can be applied to the decoding system 2 with multiple CA^AC units, and multiple decodings can be simultaneously performed. (e.g., two) The melon is simply said, the sixth picture A is the selection component of the decoding system 200, the /, = diagram is the function block of the sixth A picture selection element plus other elements, and the sixth C picture is for decoding The block diagram of the stream buffer provided by the system is finely illustrated. The sixth figure and the figure are block diagrams illustrating the function of the internal system of the decoding system, and the sixth figure is for decoding. The block diagram of the example tf mechanism of the giant tile, although the following description is about the decoding of the giant tile, this principle can be applied to various tile decoding. The CABAC unit 530 has a CABAC logic module 66 and a memory module 650. In one embodiment, the CABAC logic module 66 includes three modules. They are a binary module 620, a get content (GCTX) module 622, and a binary arithmetic decoding (BARD) engine 624 in the CABAC unit 530, respectively. The BARD engine 624 further includes a state index (pStateldx) register. 602, MPS value (valMPS) register 604, code length range (codlRange) register 606, and code length compensation (c〇dl〇ffset 608, memory module 65 of CABAC unit 530 includes giant Tile adjacent content (mbNeighCtx) memory 610 (also known as context memory array), region register 612, total scratchpad 614, and shift register (SREG) string Stream buffer/direct memory access (DMA) engine 618 (also known as DMA engine module, which will be further explained in Figure 6), in addition to the unillustrated register, 29 200813884 In the example, the mbNeighCtx memory 610 includes an array structure as in the sixth D diagram, after which To further illustrate, the memory module 650 further includes a binary string register 616. The interface between the CABAC unit 530 and the execution unit 420a includes a target bus 628, two source buses (SRC1 632 and SRC2 630), Command and thread information bus 634, and delay/reset bus 636, the data on target bus 628 can be transmitted directly or indirectly (eg, via intermediate cache, scratchpad, buffer, or memory) To the video processing unit inside or outside the graphics processing unit 114, the data on the target bus 628 may be Microsoft's DX API format or other formats, including data, macro block parameters, motion information, and/or IpCM. For sampling or other poor materials, the CABAC unit 530 also includes a memory interface consisting of the address bus 638 and the lean bus 640. After the address is obtained from the address bus 638, the data can be accessed by the data bus 64. 〇 The obtained data is entered into the 兀 stream read access. In the embodiment, the data bus is 64 • • The stomach material may include an unencrypted video stream, including various signal parameters and Other materials and formats, in some embodiments, a load-store operation can be used to access the bitstream data. Before beginning to illustrate the various components of the CABAC unit 53, simply say Ca CABAC ^^it 420a ^ It 苇 according to the slice form, the driver software (10) (first picture) prepares the CABAC color state to load it into the execution unit 4, the shader uses the fresh instruction group plus the BIND instruction, the gctx instruction, and the RD. 7 The bit stream can be decoded, since the context table used by the |cmAC unit 30 200813884 530 can be changed according to the slice type, so each slice is loaded, and in one embodiment, other instructions are issued. The first instruction executed by the CABAC shader consists of iNT-CTX and INIT-ADE. These two instructions cause CABAC unit 530 to start decoding a CABAC bit stream and load the bit stream from the stream decoding point. The buffer is set to 'The two instructions will be explained later. Regarding the parsing of the bit stream, the bit stream is received from the data bus 64 of the memory interface, and then buffered by the SREG stream buffer/DMA engine 618, and the bit stream decoding stage provides bit stream decoding, bit stream ( Such as nal bit turbulence) includes one or more pictures, which are cut into headers and a number of slices. A slice usually contains a series of giant tiles, in one embodiment, external The program (ie, CABAC unit 53 〇 external) parses the NAL bit stream, decodes the slice header, and transmits the indicator pointing to the slice data (such as the beginning of the slice). The hardware (plus software) can be parsed from the graph 圧%^ bit^ In the embodiment of the stream m, the CABAC code only appears in the slice data and the giant tile. Generally, the driver software 128 processes the bit stream from the slice data, and S is the function provided by the remainder and the API, and the indicator pointing to the position of the secondary material. The pass also involves the first bit position Z of the slice data (such as RBSPbyeAddress) and the bit compensation indicator (such as a bit or multiple bits) indicating the start of the bit stream or the position of the header ^ SREGpt〇, ^ * Initialization of the meta stream As will be explained later, in some embodiments, an external process may be utilized to provide picture decoding and slice header decoding using a processor (such as central processing unit 126 of the first figure), and in some embodiments decoding system 20G The programmable feature can be decoded at any stage: 200813884 See the sixth C diagram, which is the SREG stream buffer of the CABAC unit 530. 1§, /^]\^(4) 618 depends on the component part and other components The figure includes the meta-storage ports 662 and 664, and receives the SRC1 and SRC2 values from the bus bars 632 and 630, respectively, and then transfers them to the scratchpad coffee and 668^ other components as described in the sixth diagram, unless stated otherwise. Needlessly, for the sake of brevity, the SREG stream buffer/DMA engine 618 includes an internal bit stream buffer 618, which in one embodiment can be a 32-bit register and a _128 bit in the BigEndian format. The scratchpad: The initialization command of the driver software sends it{ at the beginning of the mosquito SREG stream buffer/DMA engine 618, and when it is started, it automatically manages the internal buffer 708b of the trick stream buffer state/DMA engine 618, SREG Stream buffered two/DMA engine 618 retains the bits to be resolved The location, in one embodiment, 'SREG Stream Buffer/DMA Engine 618 uses two registers, a fast 32-bit flip-flop with a slower 512 or 1 〇 24-bit memory, bit stream Bits are used, shift register 618a operates in bits, and bit stream buffer 618b operates in bytes, which saves energy. The shift of the temporary register 618a operation uses a small number of bits. (e.g., \~3 bits), when the shift register 618a uses more than one tuple of data, the poor material (byte segments) will be transferred from the bit stream buffer 618b to the shift register 618a, The buffer indicator then reduces the number of bytes transferred. When the DMA engine of the SREG stream buffer/DMA engine 618 detects the use of 256 bits or more, it extracts 256 bits from the memory. The bit stream buffer 618b, such that the CABAC unit 530 implements a simple loop buffer (256 bit segments X4) to track the bit stream buffer 32 200813884 and fill it, in some embodiments Use a single buffer' but a circular buffer needs to be more complicated Standard calculation up to speed memorized body. The internal buffer 618b is interacted with by an initialization instruction, referred to as a T-BSTR instruction. In one embodiment, the INIT-BSTR instruction is issued by the driver software 128 and other instructions are described later, if the bit position of the bit stream is known. Tuple address and bit compensation, INIT-BSTR instruction loads the data into the internal bit stream buffer 618b, and starts the management program. Each time the call is processed, the slice data will be issued in the following format. · INIT-BSTR Offset, RBSPbyteAddress This instruction is used to load data into the internal buffer 618b of the SREG stream buffer/DMA engine 618, the SRC2 register 664 provides the byte address (RBSPbyteAddress), and the SRC1 register 662 provides the bit. Compensation, so, the following general instruction formats can be used: INIT-BSTR SRC2, SRC1, where SRC1 and SRC2 and other signals in this instruction are values corresponding to 10 internal registers 662 and 664, but are not limited to these In one embodiment, the 256-bit memory extraction is used to access the bitstream data and write it to the buffer register and to the SREG string. The 32-bit shift register 618a of the buffer/DMA engine 618, in an embodiment, before the operations of the registers or buffers, the data in the bitstream buffer 618b is a bit The group arrangement, the data arrangement can be implemented by the arrangement instruction, also known as the ABST instruction, the ABST instruction will arrange the data in the bit stream buffer 618b, and in the decoding process, the bit 33 200813884 (such as the padding bit) Yuan) will be discarded at the end. When the shift register 618a uses the data, the internal buffer 618b fills the data. In other words, the internal buffer 618b of the SREG stream buffer/DMA engine 618 is similar to a loop of 3 (m〇dul〇). The buffer is input to the 32-bit register 618a of the SREG stream buffer/DMA engine 618. The CABAC logic module 660 can read data from the shift register 618a using the rEAD instruction. The format of the READ command is as follows: READ DST, SRC1, where DST corresponds to an output or target register. In an embodiment, the SRC1 register 662 includes an integer value n that is not positive or negative, and passes through the shift register 618a. Obtaining n bits, when 256-bit data is consumed from the 32-bit scratchpad 618a (such as decoding one or more syntax components), the extraction operation is automatically started to obtain another 256-bit material, which is written The register of internal buffer 618b is then entered into shift register 618a for use in the next cycle. w. In some embodiments 'if the data corresponding to a symbol decoded shift register 618a has been used for a predetermined number of bits or bytes, and the second buffer 6 does not receive any further data, then The cabac logic module 660 can delay via the delay/reset bus _ to perform other reliances (eg, threads that are unrelated to CABAC decoding), such as vertex shader operations. Using the SREG Stream Buffer/DMA Engine 618 # dma can reduce the number of required _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 200813884 The stream 'can request to flow into the following bitstream data, if the bitstream data is too small, the bitstream buffer 618b is at risk of a downflow (eg, it is known to have the signal flow from the CABAC unit 530 to the processor). The number of cycles of the pipeline) can pass a delay signal to the processor pipeline, suspend operation, and wait for data to arrive at the bitstream buffer 618b. stupid
另外’ SREG串流緩衝器/DMA引擎618原本便有處 理錯誤位元流的能力,舉個例子,因為位元流錯誤,有可 能沒有偵測到切片結尾記號,這種偵測錯誤可能會導致解 碼完全錯誤,並用到後來的圖樣或切片的位元,sreg串 流緩衝ϋ/DMA引擎618記錄烟驗元數,如果使用的 位元數大於預設的門檻值(可針對每一切片改變),則結束 處理程序並將除去的信號送到處理器(如主處理器),然後 處理器執行編碼嘗試從錯誤中回復。 /請同時參閱第六A圖與第六㈣,進一步說明caba 單元53〇的功能,尤其是初始化解碼引擎(即bard引含 域組伽)及内容變數’在切片起始處,於解碼對應a 弟-巨圖塊的語法成分之前,初始化内容狀態以及bar 模組624,於-實施例中,驅動軟體i28發出兩個^ mIT—ctx以及INIT_ADE進行這個初始化 们令纽動CABAC解碼模式並初綠 ,或夕_谷表(㈣端儲存或儲存於 是麵),丽_CTX指令可以具有 INIT—CTX SRC2, SRC1 運算元SRC1具有 因應INIT一CTX指令’根據位元位置 35 200813884 與 H.264 巨圖塊參數有關的值:cabac_init_idc、mbPerLine、 constrained—intra_pred一 flag 、NAL—unit—type (NUT)、 MbaffFlag 等,請注意 constrained_intrajpred_flag、 NAL_unitJ;ype (NUT)、以及 MbaffFlag 對應於 H.264 巨圖 塊參數,另外,根據位元位置,運算元SRC2具有下列值: SliceQPY 以及 mbAddrCurr,進一步解釋,執行 INIT_CTX 指令(即初始化CABAC内容表)需要cabac_init_idc以及 sliceQPY (如量化)參數,不過,要初始化整個CABAC # 引擎需要三個指令,即INITJBTSR、INIT_CTX、以及 INIT—ADE,因此,SRC 1及^ SRC2 (如全"部64 4立兀或兩 個32位元)中的可用位元可以傳遞其他用於CABAC相鄰 内容的參數,因此兩個來源暫存器SRC1 662以及SRC2 664可以包含下列值: SRC1[15:0] = cabac—initidc, SRC 1 [23:16] = mbPerLine SRC 1 [24] = constrained—intra_pred—flag ❿ SRC1 [27:25] = NAL—unit—type (NUT)In addition, the 'SREG Stream Buffer/DMA Engine 618 originally has the ability to handle the error bit stream. For example, because the bit stream is incorrect, the slice end mark may not be detected. This detection error may result in The decoding is completely wrong, and the bits of the subsequent pattern or slice are used. The sreg stream buffer/DMA engine 618 records the smoke number, if the number of bits used is greater than the preset threshold (can be changed for each slice) Then, the handler is terminated and the removed signal is sent to the processor (such as the main processor), and then the processor performs an encoding attempt to reply from the error. / Please refer to the sixth A picture and the sixth (four) at the same time to further explain the function of the caba unit 53〇, especially the initialization decoding engine (ie the bard reference field group gamma) and the content variable 'at the beginning of the slice, in the decoding corresponding a Before the syntax component of the megablock, the content state and the bar module 624 are initialized. In the embodiment, the driver software i28 issues two ^mIT_ctx and INIT_ADE for this initialization, and the CABAC decoding mode is initialized and green. , or 夕_谷表 ((4) end storage or storage in the face), Li _CTX command can have INIT-CTX SRC2, SRC1 operand SRC1 has response to INIT-CTX instruction 'according to bit position 35 200813884 and H.264 giant map The values related to the block parameters: cabac_init_idc, mbPerLine, constrained_intra_pred_flag, NAL_unit_type (NUT), MbaffFlag, etc. Please note that constrained_intrajpred_flag, NAL_unitJ; ype (NUT), and MbaffFlag correspond to H.264 giant block parameters. In addition, according to the bit position, the operand SRC2 has the following values: SliceQPY and mbAddrCurr, further explained, executing the INIT_CTX instruction (ie, initialization) The CABAC table of contents) requires the cabac_init_idc and sliceQPY (such as quantization) parameters. However, to initialize the entire CABAC # engine requires three instructions, namely INITJBTSR, INIT_CTX, and INIT_ADE, so SRC 1 and ^ SRC2 (such as the full " The available bits in 64 4 or two 32-bit elements can pass other parameters for CABAC neighboring content, so the two source registers SRC1 662 and SRC2 664 can contain the following values: SRC1[15:0 ] = cabac—initidc, SRC 1 [23:16] = mbPerLine SRC 1 [24] = constrained—intra_pred—flag ❿ SRC1 [27:25] = NAL—unit—type (NUT)
SRC1[28] = MbaffFlag SRC1[31:29]=未定義 , SRC2[15:0] = SliceQPY • SRC2[31:16] = mbAddrCurrSRC1[28] = MbaffFlag SRC1[31:29]=undefined, SRC2[15:0] = SliceQPY • SRC2[31:16] = mbAddrCurr
SliceQPY的值是用於初始化位元流缓衝器618b内之一狀 態機(未繪出)。 雖然前文已討論各種已知之圖形與切片參數,另外提 36 200813884 供一些關於CABAC單元530之參數,於一實施例中, cabac一init—idc的疋義疋針對未編碼為i_picture⑴和切換 I_picture(SI)之切片(I),換句話說,cabac—init_idc只能針對 P、SP以及B切片進行定義,當接收到J和SI切片, cabac—initjdc為預設值,舉個例子,當欲初始化將進46〇 個内谷(如I以及SI切片),可以將cabac—丨仙-说設為3 (因為根據H.264規格’ cabac一init—idc的值只能是〇〜2), 致能2位元指示該切片為I或si。 參 CABAC單元530也可以使用INIT—CTX指令初始化 區域暫存器612以及mbNeighCtx記憶體610陣列結構或 元件,如與暫存相鄰巨圖塊有關之暫存器,請參閱第六D 圖,於一實施例中,mbNeighCtx記憶體610位於圖的上方, mbNeighCtx記憶體610的巨圖塊基準相鄰内容記憶體排列 成一記憶體陣列,以儲存有關巨圖塊列的資料, mbNeighCtx記憶體610包含陣列元素他漸处仏队i,R i,i+l,…119]601,每一個元素可儲存一列12〇個巨圖塊中 籲 的一個巨圖塊(因HDTV為ΐ92〇 χ ι〇8〇像素),目前 mbNeighCtx 603用於儲存目前解碼之巨圖塊,而左側 mbNeighCtx 605用於儲存先前解碼之(左側)巨圖塊,另 , 外,利用指標607a、6〇7b和607c (在第六D圖中以箭頭 • 表不)指向暫存器603、6〇5和陣列元素601,當解碼目前 之巨圖塊日守,解碼之資料儲存於目前的,當 已知CABAC解碼之内容性質時,根據前次解碼巨圖塊時 所蒐集之貧訊來解碼目前的巨圖塊,亦即左侧巨圖塊儲存 37 200813884 於左侧mbNeighCtx 605並利用指標607b進行指向,而上 方巨圖塊儲存於陣列元素[i]中並利用指標6〇7c進行指向。 繼續解釋初始化指令,INIT—CTX指令用於初始化與 目前巨圖塊(如mbNeighCtx記憶體610陣列之元素)相 鄰之巨圖塊有關的上方及左侧指標607c及607b,舉個例 子’左侧指標607b可以設為〇而上方指標可以設為},另 外,INIT一CTX指令還會更新總暫存器614。 關於初始化内容表,因應呼叫INIT一CTX,CABAC單 元530建立一個或多個内容表,亦稱為ctx—TABLE,於 一實施例中,内容表可以是4x460 x 16位元表(8位元給 m,8位元給η,具正負號的值)或其他資料結構,内容表 的每一個項目包含從pStateldx暫存器602及valMPS暫存 器604存取之pStateldx值及valMPS值。 INIT—ADE指令起始化BARD模組624,亦稱為解碼 引擎,於一實施例中,完成INIT一BTSR指令後呼叫 画T—ADE,於執行INIT一ADE指令之後,CABAC單元530 建立兩個暫存器,分別是C0CjlRange暫存器6〇6以及 codlOffset暫存器608,伴隨下列指令或數值: codlRange = 0x01FE 以及 codlOffset = ZeroExtend (READ(#9)5 #16) 於一實施例中,這些變數可以是9位元數值,關於 codlOffset指令,從位元流缓衝器618b讀取9位元,〇延 伸(zero-extended)則儲存於16位元c〇dl〇fftet暫存器608 中’某些實施例亦可使用其他數值。BARD模組624使用 38 200813884 =存於暫存& 6G6及6G8之數值,以決定要輸出〇或卜 虽一進位解碼完成,這些值將進行更新。 „除了初始化C〇dlRan辟暫存器606以及codloffset暫存 时608,INIT-ADE操作同時初始化二進位字串暫存器 6於貝把例中,一進位字串暫存器616可以是%位 元暫存器,從BARD模組624接收每一輸出位元,當然亦 _ 可使用其他尺寸之暫存器。 、當巨圖塊編碼成I—PCM資料,BARD模組624也會被 初始化’已知I一PCM資料包含像素資料,根據Η·264規格, 亚沒有將轉換或預測模型應用至原始視訊資料,舉個例 子,I一PCM可應用至無損編碼。The value of SliceQPY is used to initialize a state machine (not shown) within bit stream buffer 618b. While various known graphics and slice parameters have been discussed above, a further reference is made to 2008 13884 for some parameters relating to the CABAC unit 530. In one embodiment, the cabac-init-idc is not coded as i_picture(1) and switched to I_picture (SI). Slice (I), in other words, cabac-init_idc can only be defined for P, SP, and B slices. When receiving J and SI slices, cabac-initjdc is the default value. For example, when you want to initialize Into 46 内 内谷 (such as I and SI slices), you can set the cabac-丨仙-say to 3 (because the value of cabac-init-idc can only be 〇~2 according to the H.264 specification), enabling A 2-bit indicates that the slice is I or si. The CABAC unit 530 can also use the INIT-CTX instruction to initialize the area register 612 and the mbNeighCtx memory 610 array structure or component, such as the temporary register associated with the temporary neighboring giant block, see Figure 6D, In one embodiment, the mbNeighCtx memory 610 is located above the graph, and the macroblock reference adjacent content memory of the mbNeighCtx memory 610 is arranged into a memory array to store data about the macroblock column, and the mbNeighCtx memory 610 includes an array. The element he gradually becomes the team i, R i, i+l, ... 119] 601, each element can store a huge block of 12 巨 giant tiles in the block (because HDTV is ΐ92〇χ ι〇8〇) Pixel), currently mbNeighCtx 603 is used to store the currently decoded giant tiles, while the left mbNeighCtx 605 is used to store previously decoded (left) giant tiles, and, in addition, using indicators 607a, 6〇7b and 607c (in the sixth) In the figure D, the arrow • the table does not point to the register 603, 6〇5 and the array element 601. When decoding the current giant tile, the decoded data is stored in the current, when the content nature of the CABAC decoding is known. According to the previous solution The rich information collected in the giant block is used to decode the current giant tile, that is, the left giant tile storage 37 200813884 is on the left side mbNeighCtx 605 and is pointed by the indicator 607b, and the upper giant tile is stored in the array element [i ] and use the indicator 6〇7c for pointing. Continuing to explain the initialization instructions, the INIT-CTX instruction is used to initialize the upper and left indicators 607c and 607b related to the giant tile adjacent to the current giant tile (such as the element of the mbNeighCtx memory 610 array), for example, 'left side The indicator 607b can be set to 〇 and the upper indicator can be set to }, and the INIT-CTX instruction also updates the total register 614. Regarding the initialization content table, in response to calling INIT-CTX, the CABAC unit 530 creates one or more content tables, also referred to as ctx-TABLE. In one embodiment, the table of contents may be a 4x460 x 16-bit table (8 bits). m, 8 bits for η, a signed value or other data structure, each entry of the table of contents contains pStateldx values and valMPS values accessed from pStateldx register 602 and valMPS register 604. The INIT-ADE instruction initiates the BARD module 624, also known as the decoding engine. In one embodiment, after the INIT-BTSR instruction is completed, the call T-ADE is called. After the INIT-ADE instruction is executed, the CABAC unit 530 establishes two. The registers are C0CjlRange register 6〇6 and codlOffset register 608, respectively, with the following instructions or values: codlRange = 0x01FE and codlOffset = ZeroExtend (READ(#9)5 #16) In one embodiment, these The variable can be a 9-bit value. For the codlOffset instruction, the 9-bit is read from the bitstream buffer 618b, and the zero-extended is stored in the 16-bit c〇dl〇fftet register 608. Other embodiments may also use other values. The BARD module 624 uses 38 200813884 = stored in the temporary & 6G6 and 6G8 values to determine whether to output 〇 or 卜. Although the carry decoding is completed, these values will be updated. „In addition to initializing the C〇dlRan register 606 and the codloffset temporary store 608, the INIT-ADE operation simultaneously initializes the binary string register 6 in the example of a pin, the one-bit string register 616 may be % bit. The meta-register receives each output bit from the BARD module 624, and of course _ can use other sizes of registers. When the giant block is encoded into I-PCM data, the BARD module 624 is also initialized. It is known that I-PCM data contains pixel data. According to the Η264 specification, the conversion or prediction model is not applied to the original video data. For example, I-PCM can be applied to lossless coding.
已描述與解析位元流及初始化各種解碼系統元件有關 的架構及指令,下面將描述有關二進位化、取得模型資訊 及内容、以及根據模型及内容進行解碼的程序,通常 CABAC單元530用於取得解析語法成分(Syntax eiement, SE)所有可能的二進位化,或是經由模組620及BIND ⑩ 指令至少取得模型資訊,CABAC單元530更經由GCTX 模組622及GCTX指令得到已知語法成分的内容,並根據 内容及模型資訊,讓BARD模組624及BARD指令實行運 ^ 算解碼,實際上,呼叫GCTX/BARD指令、輸出一位元給 ^ 二進位字串暫存器616直到發現配合已知語法成分之有意 i子碼專兩步驟會構成一迴圈,亦即於一實施例中,每一 次解碼二進位值之後,提供對應的解碼位元給二進位字串 暫存器616,接著GCTX模組622讀回二進位字串暫存器 39 200813884 的内容,直到發現配合的字碼。 這裡更詳細解釋使用單一 CABAC單元530的解碼系 統架構,請再同時參閱第六A圖與第六B圖,驅動軟體 128發出的BIND指令會致能611^1)模組62〇,於一實施例 中,BIND指令具有下列格式: BIND DST,#Imml6, SRC1, 其中,DST對應於DST暫存器652,#Imml6對應於16位 元目前數值,而SRC1對應輸入暫存器SRC1 662,BINd 運异的輸入包含語法成分(SE,包含16位元目前數值imm) 以及内容區塊種類(ctxBlockCat),語法成分可以包含任何 符合H.264規格的語法成分種類(如MB1ypeIia、The architecture and instructions relating to parsing bitstreams and initializing various decoding system components have been described. The following describes procedures for binarization, acquisition of model information and content, and decoding based on models and content, typically used by CABAC unit 530. Parsing all possible binarizations of the syntax component (Syntax eiement, SE), or obtaining at least model information via the module 620 and the BIND 10 command, and the CABAC unit 530 further obtains the content of the known syntax component via the GCTX module 622 and the GCTX command. According to the content and model information, the BARD module 624 and the BARD command are subjected to arithmetic decoding. In fact, the GCTX/BARD command is called, and a bit is output to the binary string register 616 until the match is known. The intentional i subcode of the grammatical component will constitute a loop, that is, in one embodiment, after decoding the binary value, the corresponding decoding bit is provided to the binary string register 616, followed by GCTX. Module 622 reads back the contents of binary string register 39 200813884 until a matching word is found. The decoding system architecture using a single CABAC unit 530 is explained in more detail here. Please refer to the sixth A diagram and the sixth B diagram at the same time. The BIND command sent by the driver software 128 enables the 611^1) module 62〇, in one implementation. In the example, the BIND instruction has the following format: BIND DST, #Imml6, SRC1, where DST corresponds to DST register 652, #Imml6 corresponds to the current value of 16 bits, and SRC1 corresponds to input register SRC1 662,BINd The different inputs contain the syntax component (SE, containing the 16-bit current value imm) and the content block type (ctxBlockCat). The syntax component can contain any grammatical component type that conforms to the H.264 specification (eg MB1ypeIia,
MBSkipFlagB、IntraChromaPredMode 等等),哞叫 BIND 指令會使得驅動軟體128從儲存在記憶體(如晶片上記憶 體或遠端記憶體)中的表單(或其他資料結構)讀取語法 成分’並取得語法成分索引(SEIdx),該語法成分索引用於 存取其他表單或資料結構,以獲得各巨圖塊參數。 於一實施例中,DST暫存器652包含32位元暫存器, 具有下列格式:位元〇_8 (ctxIdxOffset)、位元1卜18 (maxBinldxCtx)、位元 21-23 (ctxBlockCat)、位元 24-29 (ctxIdxBlockCatOffset)、以及位元 31 (bypass flag),這些數MBSkipFlagB, IntraChromaPredMode, etc.), squeaking the BIND command causes the driver software 128 to read the grammatical component from the form (or other data structure) stored in the memory (such as on-wafer memory or remote memory) and obtain the grammar The component index (SEIdx), which is used to access other forms or data structures to obtain the various tile parameters. In one embodiment, the DST register 652 includes a 32-bit scratchpad having the following format: bit 〇_8 (ctxIdxOffset), bit 1 卜 18 (maxBinldxCtx), bit 21-23 (ctxBlockCat), Bits 24-29 (ctxIdxBlockCatOffset), and bit 31 (bypass flag), these numbers
值(如 ctxIdxOffset、maxBinlndxCtx 等等)會傳送至 GCTX 模組622用作内容模型分析之用,在此實施例中,任何未 定義的剩下位元可以是〇,根據語法成分索引與 ctxBlockCat的配對結果,ctxIdxBlockOfftet可從遠端儲存 200813884 或儲存於晶片上記憶體之表單或其他資料結構獲得,表一 說明一非限定實施例之表單内容: 表一Values (eg, ctxIdxOffset, maxBinlndxCtx, etc.) are passed to the GCTX module 622 for use as a content model analysis. In this embodiment, any undefined remaining bits may be 〇, based on the syntax component index and ctxBlockCat pairing results. , ctxIdxBlockOfftet can be obtained from the remote storage 200813884 or a form or other data structure stored on the on-chip memory. Table 1 illustrates the contents of a non-limiting embodiment: Table 1
codeNum (k) Coded_block_pattem Intra一4x4 Inter 0 47 0 1 31 16 2 15 1 3 0 2 4 23 4 5 27 8 6 29 32 7 30 3 8 7 5 9 11 10 10 13 12 11 14 15 12 39 47 13 43 7 14 45 11 15 46 13 16 16 14 17 3 6 18 5 9 19 10 31 20 12 35 21 19 37 22 21 42 23 26 44 24 28 33 25 35 34 26 37 36 27 42 40 28 44 39 29 1 43 30 2 45 31 4 46 32 8 17 33 17 18 34 18 20 41 200813884 35 20 24 36 24 19 37 6 21 38 9 26 39 22 28 40 25 23 41 32 27 42 33 29 43 34 30 44 36 22 45 40 25 46 38 38 47 41 41 如果接收到未定義之ctxBlockCat,則CABAC單元530 可以將未定義參數當成0,所以將ctxIdxBlockOffset當成0 值0 呼叫BIND也會使得重置信號(Rst_signal)從BIND模 組620輸出至BARD模組624,說明如下。codeNum (k) Coded_block_pattem Intra-4x4 Inter 0 47 0 1 31 16 2 15 1 3 0 2 4 23 4 5 27 8 6 29 32 7 30 3 8 7 5 9 11 10 10 13 12 11 14 15 12 39 47 13 43 7 14 45 11 15 46 13 16 16 14 17 3 6 18 5 9 19 10 31 20 12 35 21 19 37 22 21 42 23 26 44 24 28 33 25 35 34 26 37 36 27 42 40 28 44 39 29 1 43 30 2 45 31 4 46 32 8 17 33 17 18 34 18 20 41 200813884 35 20 24 36 24 19 37 6 21 38 9 26 39 22 28 40 25 23 41 32 27 42 33 29 43 34 30 44 36 22 45 40 25 46 38 38 47 41 41 If an undefined ctxBlockCat is received, CABAC unit 530 can treat the undefined parameter as 0, so treat ctxIdxBlockOffset as a 0 value. Calling BIND will also cause the reset signal (Rst_signal) to be output from BIND module 620 to BARD. Module 624 is illustrated below.
為了說明BIND模組620的各種輸入與輸出,這裡提 出至少一實施例說明BIND模組620之操作,哞叫BIND 模組620,則BIND模組620取出語法成分,並且經由軟 體k供a吾法成分索引(SEIdx),利用語法成分索引, 模組620查找表單以獲得maxBjnidxCtx、ctxIdxOffset、以 及bypassFlag的對應數值,這些查找值會暫時儲存在dst 暫存器652的預先定義位元配f,另外,利用語法成分索 亏丨及ctxBlockCat,BIND模組620進行第二次表單查找(如 從遠端記鍾或⑼上記憶體)频得。刷畑⑽㈣的 數值,第一次的查找值也是暫時儲存在DST暫存器 中,因此決定值將用於建立DST暫存器、652,做為^位 元數值輸出目標。 42 200813884 針對某些語法成分,可利用其他的資訊(語法成分與 ctxBlockCat除外)開始Η·264解碼操作,舉個例子,像是 SigCoeffFlag以及lastSigCoeffFlag、儲存在巨圖塊鄰近内 容記憶體610的陣列元素裡的值、以及輸 入ctxBlockCat值等巨圖塊參數,均可用來決定巨圖塊是圖 場編碼或是圖框編碼,根據圖形是圖場編碼或是圖框編 碼,則SigCoeffFlag以及lastSigCoeffFlag會有不同的編 碼,於某些實施例中,即使是不同的語法成分,這些旗標 . 也使用同樣的語法成分數目,然後利用 mb—field一decoding—flag ( mbNeighCtx[l]攔位)來區分。 除了上述所列有關BIND模組620的功能,於第六b 圖中,BIND模組620還與binldx暫存器654、多工器單元 656及/或轉遞暫存器666及/或668 (在第六c圖中為 F1)連接,多工器單元650會根據各輸入提供輸出SRC1 (如暫存器SRC1内的值)給GCTX模組622。 至於標示為F1的轉遞暫存器,當BIND (或GCTX) 鲁 指令產生結果,便會將結果寫入目標暫存器(如DST暫存 器652)及/或標示為F1的轉遞暫存器666及668,一個 指令及對應的模組(如GCTX模組622或BARD模組624) . 疋否使用轉遞暫存器666及668會於指令中用轉遞旗標表 ^ 示’代表轉遞暫存器666及668的符號有F1 666 (使用轉 遞來源1之值,於一實施例中可以指令中的位元26表示) 以及F2 668 (使用轉遞來源2之值,於一實施例中可以指 令中的位元27表示),資料會分別轉遞至GCTX模組622 43 200813884 以及BARD模組624 ’說明如下。 W面已說明BIND模組620及相關程序,這裡將戈 關於GCTX模組622及GCTX指令如何取得已知模型的 容及二進位索引,簡單地說,GCTX模組622的輪入包含 maxBinldxCtx、binldx、以及 CtxIdx〇ffset,GCTX 模組 使用CMdxOffset及binldx值來計算CtxIdx之值(為輪出, 代表内容索引)。 ^ GCTX指令的範例格式如下: • GCTX DST,SRC2, SRC 1, 其中SRC1對應於多工器單元656的輸出值並儲存於暫存 器SRC1 662,而SRC2對應於DST暫存器652的輸出值 並儲存於暫存器SRC2 664,而DST對應於目標暫存器, 於一實施例中,各暫存器具有下列數值: SRC1 [7:0] = binldx ;如果目前語法成分包含一 codedBlockPattern,則SRC1的值(從多工器單元656 輸出,並做為GCTX模組622的輸入)可以是binidx * 暫存器654的值。 SRC1 [15:8]可以是 levelListldx (當計算 sigCoeffFlag、lastSigCoeffFlag)或是 mbPartldx (當 ’ 計算編碼區塊圖樣之Ref_Idx或binldx),亦即,當 . 語法成分是 sigCoeffFlag 或 lastSigCoeffFlag 時,多 工器單元656可以用來傳送levelListldx。 SRC1 [16]可以包含iCbCr旗標,當其值為0時,區 塊為Cb色度區塊,另外,SRC1 [16]可以包含L0/L1 44 200813884 值,如果是L0,其值為0,熟悉此技藝者從本發明 的内容可知L0/L1是用於運動補償預測之圖形參考 列表(LO = listO, LI = listl )。 SRC1 [21:20] = mbPartitionMode SRC2 [8:0]二 ctxIdxOffset SRC2 [18:16] = maxBinldxCtx SRC2 [23:31] = ctxBlockCat SRC2 [29:24] = ctxIdxBlockOffset 畚 SRC2 [31] = bypassFlag 再來,DST包括GCTX模組622的輸出並具有下列值: DST [15:00] = ctxldx DST [ 23:16] = binldx DST [ 27:24] = mbPartldx DST [29:28] = mbPartitionMode DST [30] = L0 ^ GCTX模組622可以與轉遞暫存器互相作用,因此使 用轉遞暫存器的指令格式可以是GCTX.F1.F2,其中jq及 F2分別代表使用轉遞暫存器666及668,亦即,在指令編 碼中有兩位元(F1及F2)’如果缺少一個或兩個轉遞旗標, 則代表丨又有使用轉遞暫存器,如果有設定這些位元(例如 δχ成1),則代表使用轉遞暫存器之值(内部產生值),否 則’就使用來源暫存器之值’因此,這個轉遞暫存器的护 徵在於當最早的指令發出時,將提供編譯器提示,:果: 有使用轉遞,職令在來源暫存器可能會遇到寫後^ 45 200813884 (read-after-write)風險。 對於GCTX指令,如果設定了重置信號Rst_Signal, 貝1J SRC1的值為〇 ’如果(F1 &rst—啦皿1),則srci的值會 是GCTX模組622内的值加卜不然SRC1會是從執行單 兀暫存器得到的binldx值,BIND模組620的輸出可以做 為SRC2的值,供GCTX及BARD指令使用,此時要等到 BARD指令使用轉遞暫存器後才會發出BIND指令,更進 一步解釋,Rst—Signal以及F1轉遞訊號將結合成一單一訊 _ 唬{F1,reset) (2位元訊號),表示輸入GCTX模組622的 SRC1值是包括binldx值或轉遞值,提供Rst—Signal的另 一個作用是清空並重置二進位字串616,並將binIdx暫存 器654重置為〇。 繼續討論GCTX模組622以及取得内容資訊,於一實 施例中,表二及表三列出㈣訊分別對應於地而迪⑶ 記憶體610及目前mbNeighCtx暫存器6〇3,如前所述,目 ^mbNeig,hCtx暫存器603包含目前巨圖塊的解碼輸出結 果,δ目鈾巨圖塊處理結束時,發出CWRITE指令,將目 前mbNeighCtx暫存器、603的資訊複製到心阶啡^記憶 體610陣列的對應位置,這個複製的資訊稍後會做為上方 - 鄰近值。 表二 transform size 8x8 flag mb field decode flag mb skip一flag_ Intra—chroma_pred—mode mb tvoe —數(位元) — 0 1 2 2 4:3 7:5 參數 46 200813884In order to explain various inputs and outputs of the BIND module 620, at least one embodiment is described herein to illustrate the operation of the BIND module 620. When the BIND module 620 is called, the BIND module 620 extracts the syntax components and provides a method via the software k. The component index (SEIdx), using the syntax component index, the module 620 looks up the form to obtain the corresponding values of maxBjnidxCtx, ctxIdxOffset, and bypassFlag, and these search values are temporarily stored in the pre-defined bit of the dst register 652 with f, in addition, Using the grammatical component and ctxBlockCat, the BIND module 620 performs a second form lookup (such as from a remote clock or (9) on the memory). For the value of 畑(10)(4), the first search value is temporarily stored in the DST register, so the decision value will be used to create the DST register, 652, as the ^ bit value output target. 42 200813884 For some grammatical components, other information (except for ctxBlockCat) can be used to start the 264264 decoding operation. For example, SigCoeffFlag and lastSigCoeffFlag, array elements stored in the macroblock adjacent to the content memory 610. The values in the macro block and the input ctxBlockCat value can be used to determine whether the giant block is field code or frame code. SigCoeffFlag and lastSigCoeffFlag will be different depending on whether the picture is field code or frame code. The encoding, in some embodiments, even for different syntax components, these flags use the same number of syntax components and then distinguish them using mb-field-decoding_flag ( mbNeighCtx[l]). In addition to the functions listed above for the BIND module 620, in the sixth b diagram, the BIND module 620 is also associated with the binldx register 654, the multiplexer unit 656, and/or the transfer register 666 and/or 668 ( In the sixth c-picture, the F1) connection, the multiplexer unit 650 provides an output SRC1 (such as the value in the register SRC1) to the GCTX module 622 according to each input. As for the transfer register labeled F1, when the BIND (or GCTX) Lu command produces a result, the result is written to the target register (such as DST register 652) and/or the transfer indicated as F1. 666 and 668, an instruction and corresponding module (such as GCTX module 622 or BARD module 624). 疋 No use of transfer registers 666 and 668 will use the transfer flag table in the instruction. The symbols representing the transfer registers 666 and 668 are F1 666 (using the value of forwarding source 1, which can be represented by bit 26 in the instruction in one embodiment) and F2 668 (using the value of forwarding source 2, In one embodiment, the bit 27 in the command can be indicated), and the data is forwarded to the GCTX module 622 43 200813884 and the BARD module 624 ' respectively. The BIND module 620 and related programs have been described. Here, how the GCTX module 622 and the GCTX command can obtain the binary index of the known model. In short, the rounding of the GCTX module 622 includes maxBinldxCtx and binldx. And CtxIdx〇ffset, the GCTX module uses the CMdxOffset and binldx values to calculate the value of CtxIdx (for rounding, representing the content index). ^ The sample format of the GCTX instruction is as follows: • GCTX DST, SRC2, SRC 1, where SRC1 corresponds to the output value of multiplexer unit 656 and is stored in register SRC1 662, and SRC2 corresponds to the output value of DST register 652. And stored in the scratchpad SRC2 664, and the DST corresponds to the target register. In one embodiment, each register has the following values: SRC1 [7:0] = binldx ; if the current syntax component contains a codedBlockPattern, then The value of SRC1 (output from multiplexer unit 656 and as input to GCTX module 622) may be the value of binidx * register 654. SRC1 [15:8] can be levelListldx (when calculating sigCoeffFlag, lastSigCoeffFlag) or mbPartldx (when 'calculating code block pattern Ref_Idx or binldx), that is, when the .syntax component is sigCoeffFlag or lastSigCoeffFlag, the multiplexer unit 656 can be used to transfer levelListldx. SRC1 [16] may contain the iCbCr flag. When the value is 0, the block is a Cb chrominance block. In addition, SRC1 [16] may contain L0/L1 44 200813884 value, and if it is L0, its value is 0. Those skilled in the art will appreciate from the teachings of the present invention that L0/L1 is a graphical reference list for motion compensated prediction (LO = listO, LI = listl). SRC1 [21:20] = mbPartitionMode SRC2 [8:0] two ctxIdxOffset SRC2 [18:16] = maxBinldxCtx SRC2 [23:31] = ctxBlockCat SRC2 [29:24] = ctxIdxBlockOffset 畚SRC2 [31] = bypassFlag Come again, The DST includes the output of the GCTX module 622 and has the following values: DST [15:00] = ctxldx DST [ 23:16] = binldx DST [ 27:24] = mbPartldx DST [29:28] = mbPartitionMode DST [30] = L0 ^ GCTX module 622 can interact with the transfer register, so the instruction format using the transfer register can be GCTX.F1.F2, where jq and F2 represent the use of transfer registers 666 and 668, respectively. That is, there are two elements (F1 and F2) in the instruction code. If one or two transfer flags are missing, then the transfer register is used, and if these bits are set (for example, δ is converted into 1), it means using the value of the transfer register (internal generated value), otherwise 'use the value of the source register'. Therefore, the protection of this transfer register is that when the earliest instruction is issued, Provide compiler hints: Fruit: There is a use of forwarding, the order in the source register may be encountered after writing ^ 45 2008138 84 (read-after-write) risk. For the GCTX instruction, if the reset signal Rst_Signal is set, the value of the Bay 1J SRC1 is 〇 'if (F1 & rst - dish 1), then the value of srci will be the value in the GCTX module 622 plus the SRC1 will It is the binldx value obtained from the execution of the single buffer. The output of the BIND module 620 can be used as the value of SRC2 for the GCTX and BARD instructions. At this time, the BIND will not be issued until the BARD instruction uses the transfer register. The instruction further explains that the Rst-Signal and F1 forwarding signals will be combined into a single signal _ 唬{F1,reset) (2-bit signal), indicating that the SRC1 value input to the GCTX module 622 includes the binldx value or the forwarding value. Another role of providing Rst-Signal is to clear and reset the binary string 616 and reset the binIdx register 654 to 〇. Continuing to discuss the GCTX module 622 and obtaining content information, in one embodiment, Tables 2 and 3 list (4) respectively correspond to the Didi (3) memory 610 and the current mbNeighCtx register 6〇3, as described above. , mbNeig, hCtx register 603 contains the decoding output of the current giant block, when the end of the processing of the δ uranium giant block, the CWRITE command is issued, and the information of the current mbNeighCtx register and 603 is copied to the heart level. The corresponding position of the array of memory 610, this copied information will later be used as the top-adjacent value. Table 2 transform size 8x8 flag mb field decode flag mb skip-flag_ Intra-chroma_pred-mode mb tvoe —number (bit) — 0 1 2 2 4:3 7:5 Parameters 46 200813884
codedBlockPattemLuma 4 11:8 codedBlockPattemChroma 2 13:12 codedFlagY 1 14 codedFlagCb 1 15 codedFlagCr 1 16 codedFlagTrans 8 24:17 refldx 8 32:25 predMode 4 36:33 表三 參數 位元數(位元) transform_size_8x8_flag 1 0 mbfielddecodeflag 1 1 mb一skip_flag 1 2 Intra_chroma_pred_mode 2 4:3 mbQpDeltaGTO 1 88 codedBlockPattemLuma 4 11:8 codedBlockPattemChroma 2 13:12 codedFlagY 1 14 codedFlagCb 1 15 codedFlagCr 1 16 codedFlagTrans 24 87:64 refldx 16 52:37 predMode 8 60:53 mb_type 3 63:61 於一實施例中,codedFlagTrans分成三段,例如前4 位元與ctxBlockCat為0或1時有關,較高4位元則與 ctxBlockCat為3或4有關,較高4位元還分成兩個部分, 較低2位元用於iCbCr二0時,而另2位元則用於iCbCr二1 時,預測模式(predMode)有三種選項:predLO == 0、predLl =^ NiPred = 2 ° 第六E圖顯示表二與表三中的refldx之一結構實施 47 200813884 例,refldx為重建圖形之用 構對記憶體及邏輯電路提供了最佳圖的形;=:個結 refldx結構包括第—列的巨輯6〇9'\圖 (押油㈣611 (圖中有4個)、L0/Ll值613、以^ = L0和L1值都有對岸的德户— 及母個 = 疋值_(大於0)615和 ^ ; ,雖然需要的是底部列的巨圖塊,通常 萬要存取的是上方相鄰巨圖塊_,巨圖塊 參codedBlockPattemLuma 4 11:8 codedBlockPattemChroma 2 13:12 codedFlagY 1 14 codedFlagCb 1 15 codedFlagCr 1 16 codedFlagTrans 8 24:17 refldx 8 32:25 predMode 4 36:33 Table 3 Parameter Number of Bits (Bit) transform_size_8x8_flag 1 0 mbfielddecodeflag 1 1 Mb_skip_flag 1 2 Intra_chroma_pred_mode 2 4:3 mbQpDeltaGTO 1 88 codedBlockPattemLuma 4 11:8 codedBlockPattemChroma 2 13:12 codedFlagY 1 14 codedFlagCb 1 15 codedFlagCr 1 16 codedFlagTrans 24 87:64 refldx 16 52:37 predMode 8 60:53 mb_type 3 63 : 61 In an embodiment, codedFlagTrans is divided into three segments. For example, the first 4 bits are related to ctxBlockCat being 0 or 1. The higher 4 bits are related to ctxBlockCat being 3 or 4. The higher 4 bits are also divided into two. In part, when the lower 2 bits are used for iCbCr 2, and the other 2 bits are used for iCbCr 2, the prediction mode (predMode) has three options: predLO == 0, predLl =^ NiPred = 2 ° Figure E shows the structure implementation of one of the refldx in Table 2 and Table 3. In 2008, the example of refldx is used to reconstruct the graphics to provide memory and logic circuits. The shape of the best graph; =: The knot refldx structure includes the first column of the series 6〇9'\Fig. (oil (4) 611 (4 in the figure), L0/Ll value 613, ^= L0 and L1 values There are on the other side of the German - and the parent = 疋 _ (greater than 0) 615 and ^;, although the need is the giant block of the bottom column, usually the access is the upper adjacent giant block _, giant Tile reference
:,形成4個巨圖塊分割區611,對於每—個二 確疋觀613的值,但不是真實值,即觸L0及L1的 值是i或大於!即可,於—實施例中,藉由儲存2位元⑽ 仍和如617完成判斷,這2位元用於計算語法成分 (refldx)。 更進-步轉滅dx結構的域,便是進行了兩次最 佳化,如果進行一次最佳化,只有保留2位元(雖然參考 值通常比較大),CABAC單元530解碼refldx不需要其他 位元,解碼完整值並保留在執行單元暫存器或記憶體(如 L2快取記憶體408)中,第二次最佳化則只保留4個元素 (2個左侧及2個上方),這4個元素再次利用,並由 CWRITE指令將最終值寫入相鄰元素,因為目前 mbNeighCtx暫存器603只需要保留16位元,左側 mbNeighCtx暫存器605以及陣列010的上方mbNeighCtx 元素601只需要8位元,所以可節省記憶體,同時因為不 再元整地计异解碼參考值,改以較少位元的布林(B〇〇iean) 運算代替,亦節省了計算邏輯電路。 48 200813884 表四顯示包含的mb_type : 表四:, form 4 giant block partitions 611, for each value of 613, but not the true value, that is, the value of touch L0 and L1 is i or greater than! That is, in the embodiment, the judgment is completed by storing 2 bits (10) and 617, which is used to calculate the syntax component (refldx). Further advancement to the domain of the dx structure is performed twice. If optimization is performed, only 2 bits are reserved (although the reference value is usually large), CABAC unit 530 does not need to decode refldx. The bit is decoded and stored in the execution unit register or memory (such as L2 cache 408). The second optimization only retains 4 elements (2 left and 2 above). These four elements are reused, and the final value is written to the adjacent element by the CWRITE instruction, because currently the mbNeighCtx register 603 only needs to retain 16 bits, the left mbNeighCtx register 605 and the array 010 above the mbNeighCtx element 601 only It requires 8 bits, so it can save memory. At the same time, because it no longer counts the decoding reference value, it replaces the B〇〇iean operation with fewer bits, which also saves the computational logic circuit. 48 200813884 Table 4 shows the included mb_type : Table 4
Mb_type 名稱, 4fb000 SI 4Ab001 I_4x4 or I一NxN 4'b010 工 16x16 4,b011 I_PCM 4,bl00 P_8x8 4,bl01 B—8x8 4'bllO B_Direct_l6x16 4,blll Others 另外也可使用第六B圖沒有晝出或討論的暫存器,像 是mbPerLine ( 8位元,不具正負號)、mb—qp—delta ( 8位 元’具有正負號)、以及mbAddrCurr ( 16位元,目前巨圖 塊位址),對於AddrCurr,提供1920χ 1〇8〇陣列,雖然只 需要13位元,但某些實施例可使用16位元以增進16位元 計算效能。 總暫存器614也儲存有從上述暫存器得到的值(如 mbPerline、mb AddrCurr、以及 mb—qp—delta),亦即這些儲 存在總暫存器614的值也會儲存在其他暫存器,有助於硬 體設計’於一實施例中,總暫存器614包括32位元暫存器, 内部包括對應於 mbPerline、mb AddrCurr、以及 mb qp delta 的值,其他還有對應於NUT、MBAFF_FLAG、以及 chroma—format—idc 的值。 可以利用INSERT指令更新總暫存器614的各攔位, INSERT指令的格式可為: INSERT DST, #Imm,SRC1 49 200813884 於此INSERT指令,#Imm包含10位元數字,資料的前5 位元寬度和較高5位元指定將插入資料的位置,輸入參數 具有下列格式:Mb_type name, 4fb000 SI 4Ab001 I_4x4 or I-NxN 4'b010 work 16x16 4, b011 I_PCM 4, bl00 P_8x8 4, bl01 B-8x8 4'bllO B_Direct_l6x16 4, bll Others You can also use the sixth B picture without the output or The scratchpads discussed are like mbPerLine (8-bit, no sign), mb-qp-delta (8-bit 'with sign), and mbAddrCurr (16-bit, current giant block address), for AddrCurr, which provides a 1920 χ 1 〇 8 〇 array, although only 13 bits are needed, some embodiments may use 16 bits to improve 16-bit computational efficiency. The total register 614 also stores values obtained from the above registers (e.g., mbPerline, mb AddrCurr, and mb_qp-delta), that is, the values stored in the total register 614 are also stored in other temporary stores. In an embodiment, the total register 614 includes a 32-bit scratchpad, internally including values corresponding to mbPerline, mb AddrCurr, and mb qp delta, and others corresponding to the NUT , MBAFF_FLAG, and the value of chroma_format_idc. The INSERT instruction can be used to update the stalls of the total register 614. The format of the INSERT instruction can be: INSERT DST, #Imm, SRC1 49 200813884 In this INSERT instruction, #Imm contains 10 digits, the first 5 bits of the data. The width and the upper 5 bits specify where the data will be inserted. The input parameters have the following format:
Mask = NOT(0xFFFFFFFF«#Imm[4:0])Mask = NOT(0xFFFFFFFF«#Imm[4:0])
Data = SRC 1 & Mask SDATA = Data«#Imm[9:5] SMask = Mask«#Imm[9:5] 輸出DST可以下式表示:Data = SRC 1 & Mask SDATA = Data«#Imm[9:5] SMask = Mask«#Imm[9:5] The output DST can be expressed as:
# DST = (DST & NOT(sMask)) I SDATA 利用INIT_CTX指令也可以將至少某些欄位(如NUT (NAL—UNIT—TYPE)、C (constrained—intra」)red—flag)、 MBAFF FLAG、mbPerLine、以及 mbAddrCurr)的值寫入 /初始化總暫存器614。 於一實施例中,區域暫存器612包含32位元暫存器, 其中包括對應於 b、mb_qp_delta、numDecodAbsLevelEql、 以及numDecodAbsLevelGtl的攔位,使用INSERT指令可 _ 以更新這些攔位,初始化區域暫存器612後,b = 0、 mb—qp—delta = 0、numDecodAbsLevelEql 二-1、以及 numDecodAbsLevelGtl = 0,使用下列格式的指令可以進行 • 初始化: , CWRITE SRC1, 其中,SRC 1 [15:0] = mbAddrCurr,CWRITE SRC1 更新總 暫存器614的mbAddrCurr攔位,CWRITE指令還有其他 的功能’於間早讨論相鄰元素結構及如何使用於解碼程序 50 200813884 後,將再做進一步的說明。 在CABAC解碼程序中,根據相鄰的巨圖塊(例如左 側及上方)預測及/或模式分析語法值,下面介紹幾種方 法,描述CABAC單元530如何決定左侧及上方相鄰巨圖 塊,以及決定這些巨圖塊是否存在,符號解碼階段利用 mbPerLine參數,如前所述,解碼程序使用鄰近值(如上 、 方或左側的巨圖塊或區塊),於一實施例中,BARD模組 624利用目前巨圖塊號碼及一列巨圖塊數量(mbperL㈣計 參 异下列式子,以计异出上方巨圖塊位址並確定左侧及上方 巨圖塊是否存在。 舉個例子,要判斷相鄰巨圖塊(如左侧巨圖塊)是否 存在(有效),必須進行一運算(如mbCurrAddr % mbPerLine),檢查結果是否為〇,於一實施例中,進行下 列算式: mbCurrAddr - x mbPerLine :[mbCurrAddr VombPerLine) mbCurrAddr mbPerLine# DST = (DST & NOT(sMask)) I SDATA At least some fields (such as NUT (NAL_UNIT_TYPE), C (constrained-intra)) red-flag), MBAFF FLAG can also be used with the INIT_CTX instruction. The values of mbPerLine, and mbAddrCurr) are written/initialized to the scratchpad 614. In an embodiment, the area register 612 includes a 32-bit scratchpad including intercepts corresponding to b, mb_qp_delta, numDecodAbsLevelEql, and numDecodAbsLevelGtl, which can be updated by using the INSERT instruction to initialize the area temporary storage. After 612, b = 0, mb_qp-delta = 0, numDecodAbsLevelEql two-1, and numDecodAbsLevelGtl = 0, can be performed using the following format instructions: • Initialization: , CWRITE SRC1, where SRC 1 [15:0] = mbAddrCurr, CWRITE SRC1 Updates the mbAddrCurr block of the total scratchpad 614, and the CWRITE instruction has other functions 'to discuss the adjacent element structure early and how to use it in the decoding program 50 200813884, which will be further explained. In the CABAC decoding process, based on the prediction and/or pattern analysis syntax values of adjacent giant tiles (eg, left and above), several methods are described below to describe how the CABAC unit 530 determines the left and upper adjacent giant tiles. And determining whether the macroblocks exist, the symbol decoding stage utilizes the mbPerLine parameter, as previously described, the decoding program uses neighboring values (such as giant tiles or blocks on the top, side or left side), in one embodiment, the BARD module 624 uses the current giant block number and the number of giant blocks (mbperL (4) to calculate the following formula to calculate the upper block address and determine whether the left and top giant blocks exist. For example, to judge Whether an adjacent giant tile (such as the left giant tile) exists (valid), an operation (such as mbCurrAddr % mbPerLine) must be performed to check whether the result is 〇. In one embodiment, the following equation is performed: mbCurrAddr - x mbPerLine :[mbCurrAddr VombPerLine) mbCurrAddr mbPerLine
mbCurrAddr代表對應於待解碼二進位符號的目前巨圖塊 位置,mbPerLine代表每一列的巨圖塊數量,上面的計算 用到一除法、一乘法、以及一減法。 再進一步說明BARD模組624的解碼機制,請參閱第 六F圖,其顯示待解碼之圖形(16 χ 8巨圖塊,因此 mbPerLin=16),如果解碼巨圖塊35 (mbC雲nt為35, 巨圖塊36還未完全解碼),需要前次解碼之上方巨圖塊 及左側巨圖塊34的資料,上方巨圖塊的資訊可從 200813884 mbNeighCtx[i]獲得,其中 i = mbCuirent % mbPerLine,在 這個例子中,i = 35% 16 = 3,當目前巨圖塊解碼完畢,利 用CWRITE指令更新左侧mbNeighCtx暫存器605及陣列 中之 mbNeighCtx[i] 601。 於另一例子中,考慮下式: mbCurrAddr s [〇: maxMB-\] 其中,maxMB 是 8192 ’ 而 mbPerLine = 120,於一實施例 中,可利用乘法及由儲存於晶片上記憶體的表單(如12〇χ 11位元表)查找之(Ι/mbPerLine)進行除法,如果 mbCurrentAddr是13位元,則使用13 χΐΐ乘法器,於一實 施例中’將乘法運异的結果取整數,儲存較上方的位 元,進行13x7的乘法運算,儲存較低的13位元,最後進 行13位元的減法運鼻以決定 a ,整個運算程序需要2 個週期,可以儲存這個結果給其他運算使用,每當 mbCurrAddr改變就計算一次。 於某些實施例中不進行模數(modulo)運算,改以執行 單元(如執行單元420a,420b等等)内的著色器邏輯電路 提供第一個mbAddrCmr值,將其指派給第一切片之第一 行’舉個例子,這個著色器邏輯電路可以進行下列計算: mbAddrCurr = absoluteMbAddrCurr - n x mbPerLine 因為 Η·264 彈性巨圖塊順序(flexible macr〇bl〇ck ordering , FMO)模式有些複雜的相鄰結構,為了處理這些模式,要新 增解碼系統200的著色器以計算左侧/上方可用性,並載 入CABAC單元530的一個或多個暫存器,如果不載入 52 200813884 (off-loading)CABAC單元530,當致能所有η·264模式支 援符號解碼,可降低硬體的複雜性。 CWRITE指令從目前mbNeighCtx 603的適當欄位複製 到上方mbNeighCtx[]601以及左側mbNeighCtx[](如陣列 610中的左侧巨圖塊),根據mBaffFrameFlag (MBAFF)是 否設定以及目前與先前巨圖塊的解碼方式是圖場解碼或圖 框解碼專因素’將資料寫入特定的上方m]3NeighCtx[] 601 以及左側 mbNeighCtx[],當(mbAddrCurr % mbPerLine 二= . 0),左侧mbNeighCtxLeft 605則標記為不存在(如初始化 成0) ’可以利用CWRITE指令「移動」111|5]^6丨811(1^記憶 體610、區域暫存器612、以及總暫存器614的内容,舉個 例子,CWRITE指令移動mbNeighCtx記憶體610的相關 内谷到弟i個巨圖塊的左侧及上方區塊(如j^NeighCtxfi] 或目前巨圖塊),並清空mbNeighCtx暫存器603,如前所 述,與mbNeighCtx記憶體相關的兩個指標是左侧指標6〇7b 及上方指標607c,CWRITE指令之後,上方索引增加1, ❿ 而目前巨圖塊的内容則移至陣列的上方位置及左侧位置, 上述糸統可以減少記憶體陣列的讀取/寫入埠的數量至一 個讀取/寫入埠。 , 利用1NSERT指令可以更新mbNeighCtx記憶體610、 _ 區域暫存器612、以及總暫存器614的内容,舉個例子,mbCurrAddr represents the current giant tile position corresponding to the binary symbol to be decoded, and mbPerLine represents the number of giant tiles for each column. The above calculation uses a division, a multiplication, and a subtraction. Further explaining the decoding mechanism of the BARD module 624, please refer to the sixth F picture, which shows the graphic to be decoded (16 χ 8 giant tiles, thus mbPerLin=16), if the decoding giant block 35 (mbC cloud nt is 35) , the huge block 36 has not been completely decoded), the data of the upper block and the left block 34 of the previous decoding are needed, and the information of the upper block can be obtained from 200813884 mbNeighCtx[i], where i = mbCuirent % mbPerLine In this example, i = 35% 16 = 3. When the current giant tile is decoded, the left mbNeighCtx register 605 and the mbNeighCtx[i] 601 in the array are updated with the CWRITE instruction. In another example, consider the following formula: mbCurrAddr s [〇: maxMB-\] where maxMB is 8192 ' and mbPerLine = 120. In one embodiment, multiplication and a form stored on the memory on the wafer may be utilized ( For example, 12〇χ11 bit table) find (Ι/mbPerLine) for division. If mbCurrentAddr is 13 bits, use 13 χΐΐ multiplier. In one embodiment, 'the result of multiplication is taken as an integer. The upper bit performs 13x7 multiplication, stores the lower 13 bits, and finally performs 13-bit subtraction to determine a. The entire operation requires 2 cycles. This result can be stored for other operations. Calculated once mbCurrAddr changes. In some embodiments, no modulo operation is performed, and the color filter logic in the execution unit (e.g., execution unit 420a, 420b, etc.) provides the first mbAddrCmr value, which is assigned to the first slice. The first line 'For example, this shader logic can perform the following calculations: mbAddrCurr = absoluteMbAddrCurr - nx mbPerLine because the mac·264 flexible macr〇bl〇ck ordering (FMO) pattern has some complicated phases Neighboring structure, in order to process these modes, a colorizer of the decoding system 200 is added to calculate the left/upper availability and loaded into one or more registers of the CABAC unit 530, if not loaded 52 200813884 (off-loading The CABAC unit 530, when enabling all η.264 modes to support symbol decoding, can reduce the complexity of the hardware. The CWRITE command is copied from the appropriate field of the current mbNeighCtx 603 to the upper mbNeighCtx[] 601 and the left mbNeighCtx[] (such as the left macro block in array 610), according to whether mBaffFrameFlag (MBAFF) is set and the current and previous giant blocks are The decoding method is field decoding or frame decoding. The specific factor 'writes the data to the specific upper m] 3NeighCtx[] 601 and the left mbNeighCtx[], when (mbAddrCurr % mbPerLine two = . 0), the left mbNeighCtxLeft 605 is marked as Does not exist (eg, initialized to 0) 'The CWRITE command can be used to "move" 111|5]^6丨811 (1^memory 610, area register 612, and total register 614 contents, for example, The CWRITE instruction moves the relevant inner valley of the mbNeighCtx memory 610 to the left and upper blocks of the i-th macroblock (such as j^NeighCtxfi) or the current giant block), and clears the mbNeighCtx register 603, as described above. The two indicators related to the mbNeighCtx memory are the left indicator 6〇7b and the upper indicator 607c. After the CWRITE instruction, the upper index is increased by 1, ❿ while the current giant block content is moved to the upper position and the left position of the array. The above system can reduce the number of read/write ports of the memory array to one read/write port. The mbNeighCtx memory 610, the _ region register 612, and the total register 614 can be updated by the 1NSERT instruction. Content, for example,
使用 INSERT 指令(如 INSERT #ImmlO,SRCl)可以寫入目前巨圖塊,之後的運算不會影 響左侧指標607b及上方指標6〇7c (只寫入目前位置)。 53 200813884 INSERT指令以及BARD模組624的更新將寫入 mbNeighCtx記憶體610的目前mbNeighCtx陣列元素6〇ι, 左侧指標607b指向記憶體610的元素,這個元素與相鄰的 陣列元素(相鄰於 mbNeighCtx 601,例如 mbNeighCtx[i-i]j 相同。 上面說明了有關取得内容及模型資訊,接下來說明 BARD模組624以及如何根據内容及模型資訊進行算述解 碼,BARD模組624受BARD指令操作,BARD指令的格 式可為: ° BARD DST,SRC2, SRC 1, 提供的一進位鼻術解碼彳呆作中’每一個二進位解碼形成單 一位元輸出,輸入參數如下: SRC1 = biuldx/ctxldx,這是 GCTX 模組 622 的輪出 SRC2 = bypassFlag,這是 BIND 模組 620 的輸出 如果使用轉遞暫存器,格式可為BARD.F1.F2,其中jq及 F2代表轉遞暫存器666及668,如果缺少一個或兩個轉遞 旗標,這表示沒有使用對應的轉遞暫存器,如前所述, BARD模組624會接收RST—Signal,而且在接收到訊號 後,會保留RST_Signal等到第一次呼叫BARD指令,然 後清空訊號。 … 運算時,BARD模組624從GCTX模組622接收内容 索引(ctxldx)值以及指向編碼位元流的目前位元解析位置 之指標(binldx),BARD模組624使用從c〇dl〇ffset暫存器 608以及codlRange暫存器606接收到的補償及範圍值,以 54 200813884 ^錄解碼引擎的目前區間(補償、補償+範圍)狀態、,BARD 松組624使用内容索弓丨來存取内容表(ctX-Table),依次 使用内谷表存取目前的可能性狀態pStateIdx及MPS值, pstateldx肖於讀取(從儲存在遠端記憶體或晶片上記憶體 之表單)LPS子範圍值、次一 Mps值、以及次一 Lps機 率0Use the INSERT instruction (such as INSERT #ImmlO, SRCl) to write the current giant tile. The subsequent operations will not affect the left indicator 607b and the upper indicator 6〇7c (only the current position is written). 53 200813884 The INSERT command and the update of the BARD module 624 will be written to the current mbNeighCtx array element 6〇ι of the mbNeighCtx memory 610, and the left indicator 607b points to the element of the memory 610, which is adjacent to the adjacent array element (next to mbNeighCtx 601, for example, mbNeighCtx[ii]j is the same. The above describes the content and model information. Next, the BARD module 624 and how to decode and decode according to the content and model information, the BARD module 624 is operated by the BARD command, BARD. The format of the instruction can be: ° BARD DST, SRC2, SRC 1, providing a carry-in nose decoding 彳 ' ' 'Every binary decoding to form a single bit output, the input parameters are as follows: SRC1 = biuldx/ctxldx, this is The output of the GCTX module 622 is SRC2 = bypassFlag, which is the output of the BIND module 620. If the transfer register is used, the format can be BARD.F1.F2, where jq and F2 represent the transfer registers 666 and 668, If one or two forwarding flags are missing, this means that the corresponding forwarding register is not used. As mentioned earlier, the BARD module 624 receives the RST_Signal and receives it. After the signal, RST_Signal is reserved until the first call to the BARD command, and then the signal is cleared. ... During the operation, the BARD module 624 receives the content index (ctxldx) value from the GCTX module 622 and the current bit parsing position to the encoded bit stream. The indicator (binldx), the BARD module 624 uses the compensation and range values received from the c〇dl〇ffset register 608 and the codlRange register 606 to record the current interval of the decoding engine at 54 200813884 (compensation, compensation + Scope) state, BARD loose group 624 uses the content to access the table of contents (ctX-Table), sequentially use the inner table to access the current possibility state pStateIdx and MPS value, pstateldx read (from storage) The form of the memory on the remote memory or on the wafer) LPS sub-range value, next Mps value, and next Lps probability 0
根據MPS值的狀態、次一範圍及可能性資訊,BARD 模組624计异目前二進位符號的MPS值,BARD模組624 輸出一個二進位符號(位元或二進位數值,例如bG、br·· bn)給—進位字串暫存器616,然後針對下一個二進位值的 相同或不同内容重複這個程序,路徑如圖中所示的從二進 位字串616暫存器到GCTX模組622的反饋接線658,根 據MPS值的選擇,BARD模組624亦更新補償、範圍值、 以及次一二進位數值的可能性狀態,另外,BARD模組024 將目前MPS及可能性狀態寫入内容表,供後來的内容使 用。 關於轉遞暫存器666及668的使用,如果利用信號通 知轉遞’可以延遲或不延遲指令,舉個例子,從bind模 組620轉遞至GCTX模組622沒有延遲,所以在下一週期 即發出GCTX指令;而從GCTX模組622轉遞至BARD 模組624會用掉4個週期,如果在週期j發出GCTX指令, 則可能在週期(j+5)發出BARD指令,中間沒有指令的空位 則填入4個NOP ;從BIND模組620轉遞至BARD模組 624也沒有延遲;從BARD模組624轉遞至GCTX模組622 55 200813884 的話,如果在週期j發出BARD指令,則在週期(j+5)發出 GCTX指令;如果保留第二個二進位字串而用切換的方 式,從BARD模組624轉遞至BIND模組620也沒有延遲, 要保留第二個二進位字串,可能發出bard至bard ‘々成為;又有延遲的繞走(bypass)方式。According to the state, the next range and the possibility information of the MPS value, the BARD module 624 calculates the MPS value of the current binary symbol, and the BARD module 624 outputs a binary symbol (bit or binary value, such as bG, br· • bn) to the carry string register 616, which is then repeated for the same or different contents of the next binary value, as shown in the figure from the binary string 616 register to the GCTX module 622. The feedback wiring 658, according to the selection of the MPS value, the BARD module 624 also updates the probability state of the compensation, the range value, and the second-two-digit value. In addition, the BARD module 024 writes the current MPS and the possibility state to the content table. For later use. Regarding the use of the transfer registers 666 and 668, if the notification "can be delayed or not delayed", for example, there is no delay in forwarding from the bind module 620 to the GCTX module 622, so in the next cycle, The GCTX command is issued; and the transfer from the GCTX module 622 to the BARD module 624 takes 4 cycles. If the GCTX command is issued in the cycle j, the BARD command may be issued in the cycle (j+5) with no instruction gap in the middle. Then fill in 4 NOPs; there is no delay in forwarding from BIND module 620 to BARD module 624; if BARD module 624 is forwarded to GCTX module 622 55 200813884, if BARD command is issued in cycle j, then in the cycle (j+5) issue the GCTX command; if the second binary string is reserved and switched, the BARD module 624 is forwarded to the BIND module 620 without delay. To retain the second binary string, It is possible to issue a bard to bard '々; there is a delayed bypass.
^應強調的是,本發明所舉的上所實施例或「較佳」實 施例僅為可能之施行範例,僅用以清楚說明本發明之原 理二即便對上述實補施以、變化和修飾,然皆不脫此中所 述系統及綠之精神和酬,所有此等修似變化應涵括 於本案之範圍内,受如时料利範關護。 【圖式簡單說明】 這裡所揭露實施_各方觀點可參考下觸式以獲得更深 入之瞭解1式巾的元件並未限定其尺寸_,僅用於清 ^明本發明之原^各圖中相㈣標號代表相對應的部 第-圖··圖形處理器系統實施例之方塊圖,其中可施 種解碼系統(及方法)實施例。 财處理環境之方㈣,其+可施行多種解碼系 第三圖:第二圖例示處理環境内之選擇元件方塊圖。 第四圖:第二圖與第三圖例示處理環境内之計算核心方塊 56 200813884 圖 —’其t可施行多種解碼系統實施例。 第五a圖:第四时算核心内 —,其t可施行多種解碼純實施例。70、梅元件方塊第五B圖:執行單元資料路之 解碼系統實施例。 鬼圖,其中可施行多種 ^六A圖··第五圖所示解碼系統實施例之 =六B圖:第六A圖解碼系統之方塊圖。‘ 弟六C圖:第六a圖解碼系統的位元流 塊圖。 第六D圖··第六A圖解碼系統之内容記 暫存器實施例之方塊圖。 第六£圖··應用第六A圖解碼系統之巨 例之方塊圖。 第六F圖··利用第六A圖解碼 機制之方塊圖。 蚊 圖 圖 緩衝器實施例之方 憶體結構配合相關 圖塊分割方式實施 例示巨圖塊解碼 【主要元件符號說明】 本案圖式中所包含之各s件列式如下: 1〇〇圖形處理器系統 1 〇4顯示介面單元 11()記憶體介面單元 118匯流排介面單元 102顯示袭置 106區域記憶體 H4圖形處理單元 122晶片組 57 200813884 124糸統s己憶體 126中央處理單元 128驅動軟體 200解碼系統 2〇2圖形處理器 204計算核心 2〇6執行單元集合控制及頂點/串流快取記憶體單元 繪圖管線 302紋理過濾單元 304像素打包元件 308寫回單元 402執行單元輸入 4〇6記憶體存取單元 41〇記憶體介面仲裁器 413接線 306命令流處理器 310紋理位址產生器 4〇4執行單元輸出 408 L2快取記憶體 412執行單元集合 420執行單元 504指令快取記憶體控制器 5〇6執行緒控制器 508緩衝器 51〇共用暫存器㈣犯執行單元資料路徑 514執行單元資料路徑先進先出緩衝器It should be emphasized that the above-described embodiments or "preferred" embodiments of the present invention are only possible examples, and are merely illustrative of the principles of the present invention, even if the above-mentioned implementations are modified, modified, and modified. However, they do not deviate from the spirit and rewards of the system and the greens mentioned in this article. All such changes should be covered in the scope of this case. [Simplified description of the schema] The implementation of the method disclosed herein can be referred to the lower touch to obtain a deeper understanding of the elements of the type 1 towel without limiting its size _, only for clearing the original picture of the present invention The middle phase (four) reference numerals represent block diagrams of corresponding portions of the graphics processor system embodiment in which a decoding system (and method) embodiment can be implemented. The financial processing environment (4), its + can perform a variety of decoding systems. The third figure: The second figure illustrates the block diagram of the selected components in the processing environment. Fourth Figure: Second and Third Figures illustrate a computational core block within a processing environment 56 200813884 Figure - 'T' can implement a variety of decoding system embodiments. The fifth a picture: the fourth time calculation core -, its t can be implemented in a variety of decoding pure embodiment. 70. Mei Element Block Figure 5B: Embodiment of the decoding system of the execution unit data path. Ghost map, which can be implemented in a variety of six-figure diagrams. The fifth embodiment of the decoding system embodiment = six B-picture: the block diagram of the sixth A-picture decoding system. ‘Different C C: The bitstream block diagram of the sixth a-picture decoding system. The sixth D figure · The sixth A picture decoding system content block diagram of the register embodiment. The sixth figure is a block diagram of the giant example of the decoding system of the sixth A picture. The sixth F map uses a block diagram of the sixth A picture decoding mechanism. Mosquito diagram buffer embodiment embodiment of the memory structure and related tile segmentation mode embodiment of the giant block decoding [main component symbol description] The s parts included in the diagram of this case are as follows: 1 〇〇 graphics processor System 1 显示 4 display interface unit 11 () memory interface unit 118 bus interface interface unit 102 display attack 106 area memory H4 graphics processing unit 122 chip group 57 200813884 124 s s 体 126 central processing unit 128 driver software 200 decoding system 2〇2 graphics processor 204 computing core 2〇6 execution unit set control and vertex/streaming memory unit drawing pipeline 302 texture filtering unit 304 pixel packing element 308 write back unit 402 execution unit input 4〇6 Memory access unit 41 〇 memory interface arbiter 413 wiring 306 command stream processor 310 texture address generator 4 〇 4 execution unit output 408 L2 cache memory 412 execution unit set 420 execution unit 504 instruction cache memory Controller 5〇6 thread controller 508 buffer 51〇shared register (4) commit unit data path 514 execution unit data path first First in first out buffer
516述部暫存器稽案⑽純量^器檔案 ,資料輸出控制器、似執行緒任務介面 526暫存器檔案 ^ ^ ns 532向量浮點單元 輯單元 530 CABAC 單元 534向量整數算術邏 536特殊目的單元 540暫存器檔案 544目前訊號線 6〇2狀態索弓丨暫存器 538多工器 542運算訊號線 601陣列元素 603 目前 mb]SfeighCtx 58 200813884 604高可能符號值暫存器 605 左侧 mbNeighCtx 606 碼長範圍暫存器 607指標 608 碼長補償暫存器 609巨圖塊列 610 巨圖塊相鄰内容記怜 611巨圖塊分割區 612 區域暫存器 613 L0/L1 值 614 總暫存器516 section temporary register file (10) scalar ^ device file, data output controller, like thread task interface 526 register file ^ ^ ns 532 vector floating point unit series unit 530 CABAC unit 534 vector integer arithmetic logic 536 special Destination unit 540 register file 544 current signal line 6〇2 state cable bow register 538 multiplexer 542 operation signal line 601 array element 603 current mb]SfeighCtx 58 200813884 604 high possible symbol value register 605 left mbNeighCtx 606 code length range register 607 indicator 608 code length compensation register 609 giant block column 610 giant picture block adjacent content memory 611 giant picture block partition 612 area register 613 L0/L1 value 614 total Memory
615 Gtl 617 GtO 616二進位字串暫存器615 Gtl 617 GtO 616 binary string register
618移位暫存^串流緩衝ϋ/直接記憶體存取引盤 620二進位化模組 622取得内容模組 624二進位算術解碼弓丨擎 628目標匯流排 632來源匯流排 636延遲/重置匯流排 640資料匯流排 652 DST暫存器 656多工器單元 660CABAC邏輯模組 664運算元暫存器 668轉遞暫存器 630來源匯流排 634命令及執行緒資訊匯流排 638地址匯流排 650記憶體模組 654 Binldx 暫存器、 658反饋接線 662運算元暫存器 666轉遞暫存器 59618 shift temporary storage ^ stream buffer ϋ / direct memory access platter 620 binary module 622 acquisition content module 624 binary arithmetic decoding 丨 丨 628 target bus 632 source bus 636 delay / reset convergence 640 data bus 652 DST register 656 multiplexer unit 660CABAC logic module 664 operation unit register 668 transfer register 630 source bus 634 command and thread information bus 638 address bus 650 memory Module 654 Binldx register, 658 feedback wiring 662 operation unit register 666 transfer register 59
Claims (1)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US81182106P | 2006-06-08 | 2006-06-08 |
Publications (2)
Publication Number | Publication Date |
---|---|
TW200813884A true TW200813884A (en) | 2008-03-16 |
TWI348653B TWI348653B (en) | 2011-09-11 |
Family
ID=38899303
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW96120899A TWI344795B (en) | 2006-06-08 | 2007-06-08 | Decoding of context adaptive variable length codes in computational core of programmable graphics processing unit |
TW96120728A TWI354239B (en) | 2006-06-08 | 2007-06-08 | Decoding system unit |
TW096120896A TWI348653B (en) | 2006-06-08 | 2007-06-08 | Decoding of context adaptive binary arithmetic codes in computational core of programmable graphics processing unit |
TW96120726A TWI428850B (en) | 2006-06-08 | 2007-06-08 | Decoding method |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW96120899A TWI344795B (en) | 2006-06-08 | 2007-06-08 | Decoding of context adaptive variable length codes in computational core of programmable graphics processing unit |
TW96120728A TWI354239B (en) | 2006-06-08 | 2007-06-08 | Decoding system unit |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW96120726A TWI428850B (en) | 2006-06-08 | 2007-06-08 | Decoding method |
Country Status (2)
Country | Link |
---|---|
CN (4) | CN101087411A (en) |
TW (4) | TWI344795B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9438933B2 (en) | 2011-11-08 | 2016-09-06 | Samsung Electronics Co., Ltd. | Method and device for arithmetic coding of video, and method and device for arithmetic decoding of video |
Families Citing this family (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8156410B2 (en) * | 2008-03-05 | 2012-04-10 | Himax Technologies Limited | Fast debugging tool for CRC insertion in MPEG-2 video decoder |
US8686921B2 (en) * | 2008-12-31 | 2014-04-01 | Intel Corporation | Dynamic geometry management of virtual frame buffer for appendable logical displays |
CN101577629B (en) * | 2009-05-14 | 2011-05-25 | 北京邮电大学 | Dynamic allocation method of coding vector based on graph coloring in multicast network |
CN101908200B (en) * | 2009-06-05 | 2012-08-08 | 财团法人资讯工业策进会 | Graphics processing system with power gating function and method |
US8681162B2 (en) * | 2010-10-15 | 2014-03-25 | Via Technologies, Inc. | Systems and methods for video processing |
GB2488159B (en) * | 2011-02-18 | 2017-08-16 | Advanced Risc Mach Ltd | Parallel video decoding |
US9378560B2 (en) | 2011-06-17 | 2016-06-28 | Advanced Micro Devices, Inc. | Real time on-chip texture decompression using shader processors |
US9231616B2 (en) * | 2011-08-05 | 2016-01-05 | Broadcom Corporation | Unified binarization for CABAC/CAVLC entropy coding |
CN103037213B (en) * | 2011-09-28 | 2016-02-17 | 晨星软件研发(深圳)有限公司 | The cloth woods entropy decoding method of cloth woods entropy decoder and image playing system |
US20130307860A1 (en) * | 2012-03-30 | 2013-11-21 | Mostafa Hagog | Preempting Fixed Function Media Devices |
US9451258B2 (en) | 2012-04-03 | 2016-09-20 | Qualcomm Incorporated | Chroma slice-level QP offset and deblocking |
US9942571B2 (en) * | 2012-05-29 | 2018-04-10 | Hfi Innovations Inc. | Method and apparatus for coding of sample adaptive offset information |
US9196014B2 (en) * | 2012-10-22 | 2015-11-24 | Industrial Technology Research Institute | Buffer clearing apparatus and method for computer graphics |
CN103813177A (en) * | 2012-11-07 | 2014-05-21 | 辉达公司 | System and method for video decoding |
US9947084B2 (en) | 2013-03-08 | 2018-04-17 | Nvidia Corporation | Multiresolution consistent rasterization |
JP6379107B2 (en) * | 2013-05-21 | 2018-08-22 | 株式会社スクウェア・エニックス・ホールディングス | Information processing apparatus, control method therefor, and program |
CN107037984B (en) * | 2013-12-27 | 2019-10-18 | 威盛电子股份有限公司 | Data memory device and its method for writing data |
US9455743B2 (en) * | 2014-05-27 | 2016-09-27 | Qualcomm Incorporated | Dedicated arithmetic encoding instruction |
TW201626218A (en) | 2014-09-16 | 2016-07-16 | 輝達公司 | Techniques for passing dependencies in an API |
US10205957B2 (en) | 2015-01-30 | 2019-02-12 | Mediatek Inc. | Multi-standard video decoder with novel bin decoding |
US10250912B2 (en) * | 2015-02-17 | 2019-04-02 | Mediatek Inc. | Method and apparatus for entropy decoding with arithmetic decoding decoupled from variable-length decoding |
CN104869398B (en) * | 2015-05-21 | 2017-08-22 | 大连理工大学 | A kind of CABAC realized based on CPU+GPU heterogeneous platforms in HEVC parallel method |
GB2542162B (en) * | 2015-09-10 | 2019-07-17 | Imagination Tech Ltd | Trailing or leading digit anticipator |
US9537504B1 (en) * | 2015-09-25 | 2017-01-03 | Intel Corporation | Heterogeneous compression architecture for optimized compression ratio |
US10467006B2 (en) * | 2015-12-20 | 2019-11-05 | Intel Corporation | Permutating vector data scattered in a temporary destination into elements of a destination register based on a permutation factor |
US10375395B2 (en) | 2016-02-24 | 2019-08-06 | Mediatek Inc. | Video processing apparatus for generating count table in external storage device of hardware entropy engine and associated video processing method |
CN106921859A (en) * | 2017-05-05 | 2017-07-04 | 郑州云海信息技术有限公司 | A kind of CABAC entropy coding methods and device based on FPGA |
CN107277505B (en) * | 2017-05-19 | 2020-06-16 | 北京大学 | AVS-2 video decoder device based on software and hardware partition |
CN107242882A (en) * | 2017-06-05 | 2017-10-13 | 上海瓴舸网络科技有限公司 | A kind of B ultrasound shows auxiliary equipment and its control method |
CN110710219B (en) * | 2017-12-08 | 2022-02-11 | 谷歌有限责任公司 | Method and apparatus for context derivation for coefficient coding |
TWI674558B (en) | 2018-06-12 | 2019-10-11 | 財團法人工業技術研究院 | Device and method for processing numercial array data, and color table generation method thereof |
CN109818855B (en) * | 2019-01-14 | 2020-12-25 | 东南大学 | Method for obtaining content by supporting pipeline mode in NDN (named data networking) |
CN110458120B (en) * | 2019-08-15 | 2022-01-04 | 中国水利水电科学研究院 | Method and system for identifying different vehicle types in complex environment |
CN111028135B (en) * | 2019-12-10 | 2023-06-02 | 国网重庆市电力公司电力科学研究院 | Image file repairing method |
CN112582009B (en) * | 2020-12-11 | 2022-06-21 | 武汉新芯集成电路制造有限公司 | Monotonic counter and counting method thereof |
US11748011B2 (en) | 2021-03-31 | 2023-09-05 | Silicon Motion, Inc. | Control method of flash memory controller and associated flash memory controller and storage device |
US11733895B2 (en) | 2021-03-31 | 2023-08-22 | Silicon Motion, Inc. | Control method of flash memory controller and associated flash memory controller and storage device |
CN114816434B (en) * | 2022-06-28 | 2022-10-04 | 之江实验室 | Programmable switching-oriented hardware parser and parser implementation method |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7742544B2 (en) * | 2004-05-21 | 2010-06-22 | Broadcom Corporation | System and method for efficient CABAC clock |
EP1599049A3 (en) * | 2004-05-21 | 2008-04-02 | Broadcom Advanced Compression Group, LLC | Multistandard video decoder |
KR100612015B1 (en) * | 2004-07-22 | 2006-08-11 | 삼성전자주식회사 | Method and apparatus for Context Adaptive Binary Arithmetic coding |
US7800620B2 (en) * | 2004-11-05 | 2010-09-21 | Microsoft Corporation | Optimizing automated shader program construction |
-
2007
- 2007-06-08 CN CN 200710126453 patent/CN101087411A/en active Pending
- 2007-06-08 CN CN 200710110297 patent/CN101072350B/en active Active
- 2007-06-08 TW TW96120899A patent/TWI344795B/en active
- 2007-06-08 TW TW96120728A patent/TWI354239B/en active
- 2007-06-08 TW TW096120896A patent/TWI348653B/en active
- 2007-06-08 CN CN 200710126452 patent/CN101072353B/en active Active
- 2007-06-08 TW TW96120726A patent/TWI428850B/en active
- 2007-06-08 CN CN 200710110295 patent/CN101072349B/en active Active
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9438933B2 (en) | 2011-11-08 | 2016-09-06 | Samsung Electronics Co., Ltd. | Method and device for arithmetic coding of video, and method and device for arithmetic decoding of video |
US9888261B2 (en) | 2011-11-08 | 2018-02-06 | Samsung Electronics Co., Ltd. | Method and device for arithmetic coding of video, and method and device for arithmetic decoding of video |
US9888262B2 (en) | 2011-11-08 | 2018-02-06 | Samsung Electronics Co., Ltd. | Method and device for arithmetic coding of video, and method and device for arithmetic decoding of video |
US9888263B2 (en) | 2011-11-08 | 2018-02-06 | Samsung Electronics Co., Ltd. | Method and device for arithmetic coding of video, and method and device for arithmetic decoding of video |
US9888264B2 (en) | 2011-11-08 | 2018-02-06 | Samsung Electronics Co., Ltd. | Method and device for arithmetic coding of video, and method and device for arithmetic decoding of video |
Also Published As
Publication number | Publication date |
---|---|
CN101072350B (en) | 2012-12-12 |
CN101087411A (en) | 2007-12-12 |
CN101072350A (en) | 2007-11-14 |
CN101072353B (en) | 2013-02-20 |
TWI344795B (en) | 2011-07-01 |
CN101072349B (en) | 2012-10-10 |
TWI348653B (en) | 2011-09-11 |
TW200809689A (en) | 2008-02-16 |
TWI428850B (en) | 2014-03-01 |
TW200821982A (en) | 2008-05-16 |
TW200803526A (en) | 2008-01-01 |
CN101072349A (en) | 2007-11-14 |
TWI354239B (en) | 2011-12-11 |
CN101072353A (en) | 2007-11-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TW200813884A (en) | Decoding of context adaptive binary arithmetic codes in computational core of programmable graphics processing unit | |
US7656326B2 (en) | Decoding of context adaptive binary arithmetic codes in computational core of programmable graphics processing unit | |
US7626518B2 (en) | Decoding systems and methods in computational core of programmable graphics processing unit | |
US7626521B2 (en) | Decoding control of computational core of programmable graphics processing unit | |
US7623049B2 (en) | Decoding of context adaptive variable length codes in computational core of programmable graphics processing unit | |
TWI482117B (en) | Filtering for vpu | |
TWI428852B (en) | Shader processing systems and methods | |
US20140153635A1 (en) | Method, computer program product, and system for multi-threaded video encoding | |
Abeydeera et al. | 4K real-time HEVC decoder on an FPGA | |
USRE44923E1 (en) | Entropy decoding methods and apparatus using most probable and least probable signal cases | |
CN107409229A (en) | Indicate the syntactic structure of the end of coding region | |
US20110261885A1 (en) | Method and system for bandwidth reduction through integration of motion estimation and macroblock encoding | |
US8624896B2 (en) | Information processing apparatus, information processing method and computer program | |
KR20230079414A (en) | Latency Management Using Deep Learning-Based Prediction in Gaming Applications | |
CN101909212A (en) | Multi-standard macroblock prediction system of reconfigurable multimedia SoC | |
US20090158379A1 (en) | Low-Latency Multichannel Video Port Aggregator | |
US8462848B2 (en) | Method and system for intra-mode selection without using reconstructed data | |
US11941397B1 (en) | Machine instructions for decoding acceleration including fuse input instructions to fuse multiple JPEG data blocks together to take advantage of a full SIMD width of a processor | |
Cho et al. | Parallelizing the H. 264 decoder on the cell BE architecture | |
AU739533B2 (en) | Graphics processor architecture | |
KR101693416B1 (en) | Method for image encoding and image decoding, and apparatus for image encoding and image decoding | |
TWI603616B (en) | On die/off die memory management | |
US9330060B1 (en) | Method and device for encoding and decoding video image data | |
AU744329B2 (en) | Data normalization circuit and method | |
Juurlink et al. | Putting It All Together: A Fully Parallel and Efficient H. 264 Decoder |