TW201216200A

TW201216200A - Multi-shader system and processing method thereof

Info

Publication number: TW201216200A
Application number: TW100117717A
Authority: TW
Inventors: Timour Paltashev; John Brothers; Yi-Jung Su; Yang Jeff Jiao
Original assignee: Via Tech Inc
Priority date: 2010-10-15
Filing date: 2011-05-20
Publication date: 2012-04-16
Also published as: TWI451355B; CN102254297A; US8681162B2; CN102254297B; US20120092353A1

Abstract

A multi-shader system in a programmable graphics processing unit (GPU) for processing video data, includes a first shader stage configured to receive slice data from a frame buffer and perform variable length decoding (VLD), wherein the first shader stage outputs data to a first buffer within the frame buffer; a second shader stage configured to receive the output data from the first shader stage and perform transformation and motion compensation on the slice data, wherein the second shader stage outputs decoded slice data to a second buffer within the frame buffer; a third shader stage configured to receive the decoded slice data and perform in-loop deblocking filtering (IDF) on the frame buffer; a fourth shader stage configured to perform post-processing on the frame buffer; and a scheduler configured to schedule execution of the shader stage, the scheduler comprising a plurality of counter registers; wherein execution of the shader stage is synchronized utilizing the counter registers.

Description

201216200 六、發明說明：【發明所屬之技術領域】本發明係有關於一種資料處理系統，特別是有於一種影像資料處理系統及方法。【先前技術】中央處理單元(CPU)是有許多計算架構所構成，用以處理資料，如影像及繪圖資料。對於某些影像或是繪圖而言，雖然中央處理單元可具有足夠的處理能力，但仍需處理其它資料。可透過許多介面，如微軟公司的Direct3D介面、 OpenGL···等，實現計算架構中的許多繪圖系統。若在一電腦上執行一特定操作系統時，上述介面可提供多媒體硬體控制，如一繪圖加速器或是一緣圖處理單元。圖畫或是影像的產生一般稱為渲染，而為了要實現此操作，主要需透過一繪圖加速器。一般而言，在3D電腦繪圖中，表示一場景中的物件的表面（或是體積）的幾何會被轉換成畫素（圖畫元件），並儲存在於一圖框缓衝器中，然後再被呈現在一影像裝置中。每一物件或是某些物件的外觀（如材料、反射、形狀、紋理…等）可能具有特定的視覺效果，而這些物件的外觀會被定義成一渲染描述表。為了改善所產生的影像的視覺品質，並要求較少的資料量，已發展出許多標準。在這些標準中，H.264係為一種高壓縮數位影像編碼標準，也就是ISO MPEG-4第10部分。在產生相同影像品質的情況下’ Η. 2 6 4編碼後的結果會比MPEG-2編碼後的結果還少3倍的位元數量。因此，在目前3D繪圖加速器中，經常使用Η.264以進行影像處理。 S3U07-0009100-TW/0608D-A42795-TW/Final 4 201216200 為了進行上述的處理，一般需使用特定的硬體單元或是通用的中央處理單元。然而，習知的架構具有一缺點，就是當一繪圖處理單元進行與3D繪圖有關的動作時，將會閒置H.264影像處理的硬體。然而，在目前的領域中，並沒有可解決上述問題的方法。【發明内容】本發明提供一種多著色器系統，用以在一可程式繪圖處理單元中，處理影像資料。在一可能實施例中，多著色器系統包括，一第一著色階段、一第二著色階段、第三著色階段、一第四著色階段以及一排程器。第一著色階段從一圖框缓衝器中，接收片資料，並執行可變長度解碼。第一著色階段輸出資料予圖框緩衝器内的一第一緩衝器。第二著色階段接收來自第一著色階段的輸出資料，並對片資料進行轉換及移動補償。第二著色階段輸出已解碼片資料予圖框緩衝器内的一第二緩衝器。第三著色階段接收已解碼片資料，並在圖框緩衝器中，進行迴圈内去方塊濾波。第四著色階段在圖框缓衝器中，進行後處理。排程器安排著色階段的進行，並包括複數計數暫存器。利用計數暫存器，同步化著色階段的進行。本發明另提供一種處理方法，利用一多著色器架構，處理一影像資料。在一可能實施例中，處理方法包括：將影像播放所需的複數功能映射到複數著色器；擷取著色器的等待值，其中每一等待值表示相對應的著色器的執行時間；以及在一命令流處理器中，並列進行一第一著色器、一第二著色器以及一第三著色器，其中第一著色器進行可 S3U07-0009100-TW/0608D-A42795-TW/Final 5 201216200 ，長度解碼’第二著色器進行轉換及移動補償器進行迴圈内去方塊濾波。一邑為讓本發明之特徵和優點能更明顯易較佳實施例，並配合所附圖式，作詳細說明如下.、牛、【實施方式】如上所述’H.264 -般需要特定硬體單元或是通理影像資料。著色器會被寫入，用以在同」 :至大集合元件，舉例而言，在勞幕的某- &域内的母一書辛，岑异一禮刑μ益卞 m ^ —茗及疋摈型的母一頂點。這很適用用在並行處理，並且為了達到此用應理單元都具有多核心〜阳。目別許夕的繪圖處八百夕核〜又6十。因此，可改善處理的秋 ’習知架構具有-缺點，就是當― ^ 3D繪圖有_動料，將會· Η·2 ==與因此，以下將說明許多實施方式，藉由理的硬體。以進行Η264与m 曰由私式化者色器，用負荷。在可程式化的著色器的許多要-的令以及暫存器以達到同步化。而要許夕的指苐1圖為電腦系統之—可能實系統包括一中央處理哭⑻"如圖所不，電腦从团* 恩益102、—系統記憶體104以月201216200 VI. Description of the Invention: [Technical Field] The present invention relates to a data processing system, and more particularly to an image data processing system and method. [Prior Art] A central processing unit (CPU) is composed of a number of computing architectures for processing data such as images and graphics. For some images or drawings, although the central processing unit can have sufficient processing power, it still needs to process other data. Many drawing systems in the computing architecture can be implemented through many interfaces, such as Microsoft's Direct3D interface, OpenGL, etc. The above interface provides multimedia hardware control, such as a graphics accelerator or a picture processing unit, when executing a particular operating system on a computer. The generation of a picture or image is generally referred to as rendering, and in order to do this, it is mainly through a drawing accelerator. In general, in 3D computer graphics, the geometry of the surface (or volume) of an object in a scene is converted into a pixel (picture component) and stored in a frame buffer, and then Presented in an imaging device. The appearance of each object or object (such as material, reflection, shape, texture, etc.) may have a specific visual effect, and the appearance of these objects will be defined as a rendering description table. Many standards have been developed to improve the visual quality of the images produced and to require less data. Among these standards, H.264 is a high compression digital image coding standard, which is part 10 of ISO MPEG-4. In the case of the same image quality, the result of the encoding is 3. 2 6 4, which is three times less than the result of the MPEG-2 encoding. Therefore, in current 3D graphics accelerators, Η.264 is often used for image processing. S3U07-0009100-TW/0608D-A42795-TW/Final 4 201216200 In order to perform the above processing, it is generally necessary to use a specific hardware unit or a general-purpose central processing unit. However, the conventional architecture has the disadvantage that when a graphics processing unit performs an action related to 3D drawing, the hardware of the H.264 image processing will be idle. However, in the current field, there is no way to solve the above problems. SUMMARY OF THE INVENTION The present invention provides a multi-colorizer system for processing image data in a programmable graphics processing unit. In a possible embodiment, the multi-coloror system includes a first coloring stage, a second coloring stage, a third coloring stage, a fourth coloring stage, and a scheduler. The first coloring stage receives the slice data from a frame buffer and performs variable length decoding. The first coloring stage outputs data to a first buffer in the frame buffer. The second coloring stage receives the output data from the first coloring stage and converts and compensates for the slice data. The second coloring stage outputs the decoded slice data to a second buffer in the frame buffer. The third coloring stage receives the decoded chip data, and performs a round-trip deblocking filtering in the frame buffer. The fourth shading stage is post-processed in the frame buffer. The scheduler schedules the coloring phase and includes a complex count register. Synchronize the coloring phase with the count register. The present invention further provides a processing method for processing an image data using a multi-colorizer architecture. In a possible embodiment, the processing method includes: mapping a complex function required for image playback to a complex shader; extracting a wait value of the shader, wherein each wait value indicates an execution time of the corresponding color shade; In a command stream processor, a first shader, a second shader and a third shader are arranged side by side, wherein the first shader performs S3U07-0009100-TW/0608D-A42795-TW/Final 5 201216200, The length decoding 'the second shader performs the conversion and the motion compensator performs the intra-loop deblocking filtering. In order to make the features and advantages of the present invention more obvious and preferred embodiments, and with reference to the accompanying drawings, the detailed description is as follows., cattle, [Embodiment] As described above, 'H.264 - generally requires a specific hard Body unit or general image data. The shader will be written to be used in the same ": to a large collection of components, for example, in the parent of a screen - in the field of a book, a different sentence, a singularity of 礼卞 ^ m ^ — The mother's apex of the 疋摈 type. This is useful for parallel processing, and in order to achieve this, the processing unit has multiple cores ~ yang. Look at the drawing place of Xu Xi, eight hundred eve nuclear ~ sixty. Therefore, the conventional architecture that can improve the processing has the disadvantage that when the "^3D drawing has a _mechanical, it will be Η·2 == and therefore, many embodiments will be described below, by way of hardware. To carry out Η264 and m 曰 by the private colorizer, use the load. There are many commands in the programmable color shader and the scratchpad to achieve synchronization. And Xu Xi's finger 苐 1 picture is the computer system - the possible system includes a central processing cry (8) " as shown in the figure, the computer from the group * Enyi 102, - system memory 104 months

〜曰圖處理早元n〇。中央處理器〗Q 括判斷資訊功能，如判斷丁：夕功此’包時，便需考廣此雜备办里/ ^ ^置。在產生綠圖結果〜角位置。系統記憶體104儲存許多資粗包括Γ圖顯示„資料，如紋理資料— '曰圖處理早& 110根據中央處理器1〇及系統記憶體]04所儲在滑M 5孔以㈣存的貧料，產生顯示資料予-顯干 S3U07-0009,00-TW/〇6〇8D-A42795-TW/Fina, 201216200 裝置130。在-可能實施例中，顯示裝置i3〇係為一營幕。為了將紋理舖在物體上，可使用一紋理映射(t_re maP_g)。繪圖田處理系統11〇提供一 3d物體的許多部分。將該等部》堆S在-起，便可形成―物體。當需要產生一 3D紋理時’可將紋理鋪在一物體上，用以形成一影像。因此，該物體便已被紋理化。中央處理。。102透過—系統介面1〇8，對繪圖處理單元 110發出要求’如要树圖處理單元11G處理並顯示繪圖資訊。繪圖處理單元110接收中央處理以〇2戶斤發出的要求，並且前端（w-end)處理器112亦會接收中央處理器1〇2所發出的要求。前端處理器112產生—晝素流，其包括晝素座標。紋理濾波器118透過—紋理快取系統114，接收一資訊’此資訊與前端處理器112所產生的晝素座標有關。紋理快取系統114接收來自前端處理器112的資訊，並且將紋理資料儲存在快取記憶體中。紋理濾'波盗118接著進行濾波處理，如雙線濾波 (bilinear filtering)、三線濾波(trilinear flltering)、或是雙線濾波與三線濾波的組合。另外，紋理濾波器118亦會產生每一晝素的紋理資料。相對於習知紋理濾波元件，如線性内插益（linear interpolator)以及累積器（accumu〗at〇r),紋理濾波器Π 8亦具有一可程式表格遽波器（pr〇grammabie table filter)，用以與其它紋理濾波元件一起提供特定的濾波操作。紋理資料106係為一最終顏色資料。紋理資料1〇6被傳送到一圖框缓衝器120。圖框緩衝器12〇根據紋理資料 106，使顯示裝置130呈現影像。 S3U07-0009I00-TW/0608D-A42795-TW/Final 7 201216200 紋理快取系統114可能具有許多快取記憶體，如第一層快取記憶體（Ll cache)以及第二層快取記憶體（L2 cache)。紋理資訊會被儲存在各自的紋理元件中，如紋理影像元件（texel)。在繪圖處理中，紋理影像元件用以定義在晝素座標的顏色資料。紋理資料1〇6由系統記憶體1〇4，被傳送到紋理快取系統Π 4之中，然後再被傳送到紋理滤波器11 8。第2圖係為第1圖的繪圖處理單元1 1 〇内的繪圖管線 200的元件或階段示意圖。在第2圖中，繪圖處理單元11〇内的繪圖管線200具有一命令流處理器（c〇mman(j stream pr〇CeSS〇r)252。命令流處理器252讀取記憶體25〇所儲存的頂點（vertex)。記憶體250所儲存的頂點用以形成幾何圖元(geometry primitive) ’並產生管線2〇〇所需的工作項目。就此而言，命令流處理器252讀取記憶體250所儲存的資料，以及讀取管線所產生的資料，如三角形、直線、點戋是其它所引用的圖元。這些幾何資訊會被收集，並被傳送到頂點著色器（vertex shader)254。頂點著色器254繪出弧形邊緣’這是習知所採用的技術，用以描繪一幾何管線的階段。藉由執行可程式執行單元或是第3圖所示的執行單元組合的命令’便可執行這些階段。一般而言，頂點著色哭 254藉由進行轉換操作、掃描操作及明暗操作，便可處理頂點。接著，頂點著色器254將處理後的結果傳送至幾何著色器256。幾何著色器256接收頂點著色器254所輪出的頂點，用以產生一全圖元，並可輸出許多頂點，該等頂點可形成一單一拓樸，如一三角條(triangle strip)、一線條 S3U07-00〇9!00-TW/0608D-A42795-TW/Final g 201216200 (line strip)、點清單（point list)…等。幾何著色器256更可執行許多演算法，如幾何鑲嵌(tessellation)以及陰影體產生 (shadow volume generation).·.等。幾何著色器256輸出資訊，用以進入一三角設定階段 257。在三角設定階段257中，可執行三角ϊ貞碎拒絕(triangle trivial rejection)、決定的估測（determinant calculation)、選擇（culling)、預先屬性設定(pre-attribute setup)KLMN、邊緣功能估測以及防護帶修剪（guardband clipping)。對於三角設定階段而言’這些操作是必須的，並且為本領域人士所深知’故不需再多加說明。三角設定階段257輸出資訊予線段及碑塊產生器（span and tile generator)258。繪圖管線的這個階段係為本領域人士所深知，故不需再詳細說明。如果三角設定階段257所處理的三角形並未被線段及磚塊產生器258或繪圖管線的其它階段所拒絕時，則進入繪圖管線的屬性設定階段(attribute setup stage)259，用以執行屬性設定操作。屬性設定階段259產生一内插變數 (interpolation variables)清單以及管線下一階段所需的屬性。另外，屬性設定階段259處理許多屬性，該等屬性與繪圖管線所處理的幾何圖元有關。對於屬性設定階段259所轉換的每一畫素而言，需要使用晝素著色器260。一般而言，畫素著色器260執行内插以及其它操作，其用以決定輸出至一圖框緩衝器262的晝素顏色。第2圖所示的許多元件的動作原理係為本領域人士所深知，故不需再詳細說明。因此，就算沒有說明這些單元的操作原理，亦可完全地了解本發明。 S3U07-00〇91〇〇.TW/〇608D-A42795-TW/Final 9 201216200 第3圖係為第1圖所示的繪圖處理單元的一功能方塊示意圖。繪圖系統可產生一可程式著色器，如幾何著色器 310、晝素著色器312、頂點著色器308、或是其它習知的著色器。這些著色器係由一程式所產生，並可被一可程式執行單元群306内的至少一執行單元所執行。可程式執行單元群306可包括一處理核心，其可執行多線程操作。因此，可程式執行單元群306可將一個以上的線程分配予一特定型態的著色器。可程式執行單元群306可同時執行兩線程，用以同時處理兩資料。舉例而言，在可程式執行單元群306為了處理一資料，而執行幾何著色器310的線程的同時，可程式執行單元群306可執行頂點著色器308的線程，用以處理另一資料。可程式執行單元群306的每一執行單元可在一時脈週期内，執行許多指令。因此，每一執行單元可同時處理多線程。如上所述，在一執行單元執行與幾何著色操作有關的線程的同時，該執行單元亦可執行與晝素著色操作有關的線程。在複數著色器階段進行後，一排程器（scheduler) 進行接下來的動作，從著色階段，接收接下來的工作，用以進行計算並可將計算結果，分配予該等執行單元。在可程式執行單元群306的執行單元内所執行的線程係各自被排定，用以執行相關的著色器計算，使得一特定的線程被排定，以進行不同的著色階段。另外，當其它的線程被配予其它的著色器單元時，一特定的執行單元可將某些線程配予一著色器。因此，可使系統中的執行單元間的負荷達到平衡，使得系統具有最理想的處理能力。同樣地，可程 S3U07-0009!00-TW/0608D-A42795-TW/Final 10 201216200 式執行單元群306的線程的負荷亦可達到平衡，使得系統具有最大的處理能力。由於習知繪圖系統使用專用的著色器硬體，因此，上述的結構並未具有穩健性，並無法達到動態線程管理。因此，習知繪圖系統的結構缺乏彈性及延展性。執行單元群控制及快取次系統304包含第二層（L2)快取記憶體，不但供執行單元群306使用，而且亦供系統使用，用以安排執行單元群306。在習知的繪圖處理單元中，執行單元群306與外部元件（在執行單元群306之外）之間的連繫係透過執行單元群控制及快取次系統304。然而，本領域人士深知若在執行單元群306上建立其它的連結及/ 或通訊連結，係有助於繪圖管線的進行。特別來說，三角設定單元314、屬性設定單元.316以及線段及磚塊產生器 318具有固定的硬體邏輯元件，其可透過執行單元群控制及快取次系統304，與執行單元群306進行通訊。第4圖係為第1圖所示的繪圖處理單元110的其它可能實施例。繪圖處理單元110 —般具有一繪圖處理管線 424。匯流排介面428隔開繪圖處理管線424與快取系統 426。繪圖處理管線424具有一頂點著色器430、一幾何著色器432、一光栅波形掃描器（rasterizer)434以及一晝素著色器436。繪圖處理管線424的輸出會被傳送到一回寫單元（未顯示）。快取系統426具有一頂點流快取記憶體440、一第一層（L1)快取記憶體442、一第二層（L2)快取記憶體 444、一 Z快取記憶體446以及一紋理快取記憶體448。頂點流快取記憶體440接收命令以及繪圖資料，並將 S3U07-0009100-TW/0608D-A42795-TW/Final 11 201216200 所接收到的命令及資料傳送至頂點著色器43G。頂點著色 =對頂點流快取記憶體44。所提供的資料執行頂= =-，頂點著色器㈣利用頂點資訊’產生欲呈現的物體勺-角形及多邊形。幾何著色器432及第體⑷接收頂點著色器43〇所輸出的頂點資料。如= 第一=快取記憶體442及第二層快取記憶體叫2可分享彼 =2。第—層快取記憶體442可將資料提供予幾何著色益432。幾何著色器432執行某種功能，如幾何镶敌、子畫面—SPriteS)...等。幾何著色器432亦可k供平滑操作，用以從―單—頂點處，產生—三，或是利用單一三角形，產生多個三角形。 - 、，”管線424亦具有一光栅波形掃描器434。光栅波形掃描434處理來自幾何著色器432以及第二層快取記憶體444所輸出的資料。光栅波形掃描器434亦^能利用z快取記憶體446，作深度分析，以及利用紋理快取記憶體桃，作顏色特性的處理。光栅波形掃描器434可能具有固定魏操作，如三肖形設定、線段魏操作、一深度測試(z測試）、預先包覆(pre_packing)、晝素内插⑼xei interpolation)、包覆…等。光柵波形掃描器434可能具有轉換矩陣’用以將在一世界空間⑽仙印叫裡的—物體的頂點’轉換成一螢幕空間的座標。柵波形掃描器434將資料傳送至畫素著色器436，用以決定最終的畫素值。畫素著色器436根據不同的顏色特性’處理並轉換每一晝素的顏色值。接著，繪圖處理管線 424輸出完整的影像圖框。如第4圖所示，在一些階段中， S3U07-0009!00-TW/0608D-A42795-TW/Final 12 201216200 者色态單元430、432、434以及固定功能單元使用快取系統426。如果匯流排介面428係為一非同步介面時，則繪圖處理管線424與快取系統426之間的傳輸可能具有額外的緩衝作用。在一可能實施例中，具有視訊壓縮標準H.264的錄放裝置(playback)會使用許多著色階段，並且當這些著色階段被執行後’這些著色階段會對應到習知續'圖處理單元的許多影像處理階段。另外，為了使具有視訊壓縮標準H.264 的錄放裝置達到高晝質效能，該等著色階段會同時被執行。請參考第5圖，一第一通用著色階段（以下稱為GP0)504 用以執行一可變長度解碼（variable length decoding ; VLD)514。一第二通用著色階段（以下稱為GP1)506用以執行移動補償功能及轉換功能。這些功能可能包含反離散餘弦轉換函數（inverse discrete cosine transform function)以及移動壓縮516。一第三通用著色階段（以下稱為GP2)508用以進行迴圈内去方塊慮波（in-loop de-blocking filtering ; IDF)518。最後，一第四通用著色階段（以下稱為GP3)510 用以進行一般後處理功能520。後處理功能520可能為解交錯（de-interlacing)功能、縮放(scaling)功能、顏色空間轉換（color space conversion)功能…等。除了影像處理外，此系統包括一 AES模組524，用以對命令流處理器（CSP)的位元流進行解碼，並且合成最終的影像圖框，使得桌上型電腦呈現3D影像。一般使用頂點著色及晝素著色。然而，在一些可能實施例中，當影像係以全晝面方式呈現出來時，便不需進行合成動作，並且不 S3U07-0009I00-TW/0608D-A42795-TW/Final 13 201216200 需進行頂點著色及畫素著色。可同時進行所有的階段504、506、508及510或是只進行部分階段’使得GP0 504内的可變長度解碼(vlD)邏輯達到最大使用率。在此需要強調的是，藉由大幅使用VLD 邏輯，可避免同一時間只有一個方塊被啟用。大幅使用VLD 邏輯有助於與VLD有關的解碼邏輯運算，並且對性能而言，影像處理單元(video processing unit ; VPU)VPU通常是最大瓶頸，特別是在高位元率的H.264位元流中。依照目前的3D模式’在進行影像解碼時，會利用管線方式，同時進行影像解碼階段504、506、508及510。為了合成而啟動3D模式時，解碼著色器504、506、508及 51 〇會被切換成3D模式’並且啟動頂點著色器（vs)以及書素著色器（VS)。在完成3D命令後，解碼著色器會被切換回影像模式。當所有的著色階段504、506、508及510同時進行時’必須考慮同時進行處理時的複雜性及所需的資源。因此’為了平衡多個著色階段同時進行的複雜度，在影像模式下’只有三個或四個著色階段會被同時進行。在 3D模式下’只有兩個3D階段會同時被進行。根據上述著色階段的基本圖框工作說明，以下將詳細說明影像處理的通用（GP)著色階段。如上所述，影像錄放裝置進行許多邏輯著色階段(如GP0〜GP2及GP5)。為了充分利用邏輯執行可變長度解碼(VLD)、轉換、移動補償以及迴圈内去方塊’許多著色階段會同時進行。因此，亦可使影像處理單元(VPU)處理所有影像資料。舉例而言，可程式移動補償階段可與紋理管線以及額外的VPU —同工作。 S3U07-0〇〇9I〇〇-TW/0608D-A42795-TW/Final ]4 201216200 如上所述，GP0著色階段504 —般執行可變長度解碼 (VLD>GP0著色階段504亦可讀取圖框緩衝記憶體内的片資料（slice data)，並且將移動向量、殘餘資料以及巨集區塊控制結構寫入圖框緩衝器内的其它緩衝器中。一線程通常處理一片資料。根據移動壓縮（MC)以及反離散餘弦轉換函數(IDCT)操作，將片資料流解碼成巨集區塊。請參考第3圖，許多電腦架構具有至少一執行單元 (EU)，用以處理資料。更具體來說，在至少一架構中，一執行單元可用以處理許多不同種類的資料。一電腦裝置可能具有一執行單元群。執行單元群可能包括至少一執行單元，用以在電腦架構中，執行資料。另外，一個或多個執行單元可進行一著色階段。請參考第5圖，為了啟動通用階段，繪圖驅動器在每一被啟動的通用階段中，產生命令佇列予記憶體，用以提供輸入資料。在一可能實施例中，命令佇列可能具有512 位元。由於讀取系統記憶體需花費許多時間，故在一些實施例中，可將命令佇列係儲存在影像記憶體中，用以減少讀取的延遲時間。在接收到佇列時，需先停止一線程。當繪圖驅動器需要執行一被啟動的GP階段的許多線程時，必需額外寫入到命令佇列的尾端，並更新對應的暫存器。一旦所有被分配命令的緩衝器均被使用時，繪圖驅動器應該開始提出一第二命令佇列緩衝器。一旦第二命令佇列緩衝器亦被使用時，繪圖驅動器便切換回第一緩衝器或是切換到一缓衝器循環中的其它緩衝器循環。~ 曰 map processing early yuan n〇. The central processor 〖Q includes the judgment information function, such as judging Ding: Xigong this package, you need to test this miscellaneous office / ^ ^ set. In the resulting green image results ~ angular position. The system memory 104 stores a lot of resources including a map display „data, such as texture data — '曰处理早 & & 110 110 according to the central processor 1 系统 and system memory 】 04 stored in the sliding M 5 hole to (4) Poor material, producing display data to the display device S3U07-0009, 00-TW/〇6〇8D-A42795-TW/Fina, 201216200 device 130. In the possible embodiment, the display device i3 is a battalion. In order to lay the texture on the object, a texture map (t_re maP_g) can be used. The plot field processing system 11 provides a number of parts of a 3d object. By "stacking" the S", an "object" can be formed. When a 3D texture needs to be generated, the texture can be laid on an object to form an image. Therefore, the object has been textured. Central Processing. 102. Through the system interface 1〇8, the drawing processing unit 110 The request is sent to the processing of the tree diagram processing unit 11G and display of the drawing information. The drawing processing unit 110 receives the request from the central processing to send the user, and the front end (w-end) processor 112 also receives the central processing unit 1 2 issued requirements. Front-end processor 112 a morpheme stream comprising a morpheme coordinate. The texture filter 118 receives a message through the texture-fastening system 114. This information is related to the pixel coordinates generated by the front end processor 112. The texture cache system 114 receives the front end from the front end. The information of the processor 112, and the texture data is stored in the cache memory. The texture filter 'wave hacker 118 then performs filtering processing, such as bilinear filtering, trilinear filtering, or two-line filtering. In combination with the three-line filtering, in addition, the texture filter 118 also generates texture data for each element. Compared to conventional texture filtering elements, such as linear interpolator and accumulator (accumu at〇r) The texture filter Π 8 also has a pr〇grammabie table filter for providing a specific filtering operation together with other texture filtering components. The texture data 106 is a final color data. The frame 6 is transferred to a frame buffer 120. The frame buffer 12 causes the display device 130 to present an image based on the texture data 106. S3U07-0009I00-TW/060 8D-A42795-TW/Final 7 201216200 The texture cache system 114 may have many cache memories, such as the first layer of cache memory (Ll cache) and the second layer of cache memory (L2 cache). They are stored in their respective texture elements, such as texture image elements (texel). In the drawing process, the texture image component is used to define the color data of the pixel coordinates. The texture data 1〇6 is transferred from the system memory 1〇4 to the texture cache system 4 and then to the texture filter 117. Fig. 2 is a diagram showing the elements or stages of the drawing line 200 in the drawing processing unit 1 1 of Fig. 1. In Fig. 2, the drawing pipeline 200 in the drawing processing unit 11 has a command stream processor (c stream pr〇CeSS〇r) 252. The command stream processor 252 reads the memory 25 〇 Vertex. The vertices stored by the memory 250 are used to form a geometry primitive 'and generate the work items required for the pipeline 2. In this regard, the command stream processor 252 reads the memory 250. The stored data, as well as the data generated by the reading pipeline, such as triangles, lines, and points, are other referenced primitives. These geometric information is collected and passed to the vertex shader 254. Vertex The shader 254 draws a curved edge'. This is a technique used to describe a stage of a geometric pipeline. By executing a command of a programmable unit or an execution unit combination as shown in FIG. These stages are performed. In general, vertex shading 254 can process vertices by performing conversion operations, scanning operations, and shading operations. Next, vertex shader 254 passes the processed results to geometric shading. 256. The geometry shader 256 receives the vertices rotated by the vertex shader 254 to generate a full primitive, and can output a plurality of vertices, which can form a single topology, such as a triangle strip, A line S3U07-00〇9!00-TW/0608D-A42795-TW/Final g 201216200 (line strip), point list...etc. Geometry shader 256 can perform many algorithms, such as geometric mosaic ( Tessellation) and shadow volume generation.. etc. The geometry shader 256 outputs information for entering a triangle setting phase 257. In the triangle setting phase 257, triangle trivial rejection can be performed. ), determinant calculation, culling, pre-attribute setup KLMN, edge function estimation, and guardband clipping. For the triangle setting phase, these operations are It is necessary and well known to those in the field, so there is no need to explain more. The triangle setting phase 257 outputs information to the segment and tile generator (span and tile generator) 258. This stage of the drawing pipeline is well known to those skilled in the art and therefore need not be described in detail. If the triangle processed in the triangle setting phase 257 is not used by the line segment and the brick generator 258 or other stages of the drawing pipeline When rejected, it enters the attribute setup stage 259 of the drawing pipeline to perform the attribute setting operation. The attribute setting phase 259 produces a list of interpolation variables and the attributes required for the next stage of the pipeline. In addition, the attribute setting stage 259 processes a number of attributes related to the geometry elements processed by the drawing pipeline. For each pixel converted by the attribute setting stage 259, a pixel shader 260 is required. In general, pixel shader 260 performs interpolation and other operations for determining the color of the pixels output to a frame buffer 262. The principle of operation of many of the elements shown in Figure 2 is well known to those skilled in the art and need not be described in detail. Therefore, the present invention can be fully understood without explaining the operation principle of these units. S3U07-00〇91〇〇.TW/〇608D-A42795-TW/Final 9 201216200 Figure 3 is a functional block diagram of the drawing processing unit shown in Figure 1. The drawing system can generate a programmable shader, such as geometry shader 310, pixel shader 312, vertex shader 308, or other conventional shaders. These shaders are generated by a program and executed by at least one execution unit within a programmable execution unit group 306. Programmable execution unit group 306 can include a processing core that can perform multi-threaded operations. Thus, the programmable execution unit group 306 can assign more than one thread to a particular type of color shader. The programmable execution unit group 306 can execute two threads simultaneously to process two data simultaneously. For example, while the programmable execution unit group 306 executes the threads of the geometry shader 310 in order to process a material, the programmable execution unit group 306 can execute the threads of the vertex shader 308 for processing another material. Each execution unit of the programmable execution unit group 306 can execute a number of instructions during a clock cycle. Therefore, each execution unit can process multiple threads simultaneously. As described above, while an execution unit executes a thread associated with a geometry shading operation, the execution unit can also execute a thread associated with the pixel shade operation. After the complex shader stage, a scheduler performs the next action, from the shader stage, receives the next work for calculation and assigns the result to the execution units. The threading threads executed within the execution units of the executable execution unit group 306 are each scheduled to perform associated colorimeter calculations such that a particular thread is scheduled for different coloring stages. In addition, when other threads are assigned to other color unit, a particular execution unit can assign certain threads to a shader. As a result, the load between the execution units in the system can be balanced, giving the system the best processing power. Similarly, the load of the thread of the S3U07-0009!00-TW/0608D-A42795-TW/Final 10 201216200 execution unit group 306 can also be balanced, so that the system has the maximum processing power. Since the conventional drawing system uses a dedicated colorizer hardware, the above structure is not robust and cannot achieve dynamic thread management. Therefore, the structure of conventional drawing systems lacks flexibility and extensibility. Execution unit group control and cache sub-system 304 includes a second level (L2) of cache memory for use not only by execution unit group 306, but also for system usage to schedule execution unit group 306. In a conventional graphics processing unit, the association between the execution unit group 306 and external components (outside the execution unit group 306) is transmitted through the execution unit group control and cache subsystem 304. However, it is well known in the art that the creation of other links and/or communication links on the execution unit group 306 facilitates the drawing pipeline. In particular, the triangle setting unit 314, the attribute setting unit 316, and the line segment and brick generator 318 have fixed hardware logic elements that are communicable with the execution unit group 306 via the execution unit group control and the cache subsystem 304. . Fig. 4 is another possible embodiment of the drawing processing unit 110 shown in Fig. 1. The graphics processing unit 110 typically has a graphics processing pipeline 424. The bus interface 428 separates the graphics processing pipeline 424 from the cache system 426. The drawing processing pipeline 424 has a vertex shader 430, a geometric colorizer 432, a raster waveform scanner 434, and a halogen color picker 436. The output of the graphics processing pipeline 424 is passed to a writeback unit (not shown). The cache system 426 has a vertex stream cache memory 440, a first layer (L1) cache memory 442, a second layer (L2) cache memory 444, a Z cache memory 446, and a texture. Cache memory 448. The vertex stream cache memory 440 receives the command and the drawing data, and transmits the commands and data received by the S3U07-0009100-TW/0608D-A42795-TW/Final 11 201216200 to the vertex shader 43G. Vertex Shading = Cache memory 44 for vertex stream. The provided data is executed top ==-, and the vertex shader (4) uses the vertex information to generate the object scoop-angle and polygon to be presented. The geometry shader 432 and the body (4) receive vertex data output by the vertex shader 43. For example, = first = cache memory 442 and second layer cache memory called 2 can share his = 2. The first layer cache memory 442 can provide data to the geometric color 432. Geometry shader 432 performs certain functions, such as geometric inlays, sprites - SPriteS, etc. Geometry shader 432 can also be used for smoothing operations to generate -3 from a single-vertex, or to generate multiple triangles using a single triangle. -, "Line 424 also has a raster waveform scanner 434. Raster waveform scan 434 processes the data output from geometry shader 432 and second layer cache 444. Raster waveform scanner 434 can also utilize z fast The memory 446 is taken for depth analysis, and the texture memory is used to process the color characteristics. The raster waveform scanner 434 may have a fixed Wei operation, such as a three-Shaw setting, a line segment operation, and a depth test (z). Test), pre-packing, xei interpolation, cladding, etc. The raster waveform scanner 434 may have a transformation matrix 'to vertices of objects in a world space (10) The coordinates converted into a screen space. The grid waveform scanner 434 passes the data to the pixel shader 436 for determining the final pixel value. The pixel shader 436 processes and converts each element based on different color characteristics. The color value. Next, the drawing processing pipeline 424 outputs a complete image frame. As shown in Fig. 4, in some stages, S3U07-0009!00-TW/0608D-A42795-TW/Final 12 2 01216200 The color unit 430, 432, 434 and the fixed function unit use the cache system 426. If the bus interface 428 is a non-synchronous interface, the transfer between the graphics processing pipeline 424 and the cache system 426 may have additional Buffering effect. In a possible embodiment, a playback device with video compression standard H.264 uses a number of coloring stages, and when these coloring stages are executed, these coloring stages correspond to the conventional continuous processing unit. In addition, in order to achieve high quality performance of the recording and playback device with video compression standard H.264, these coloring stages will be executed at the same time. Please refer to Figure 5, a first general coloring stage (hereinafter referred to as GP0) 504 is used to perform a variable length decoding (VLD) 514. A second general coloring stage (hereinafter referred to as GP1) 506 is used to perform the motion compensation function and the conversion function. These functions may include inverse discrete Inverse discrete cosine transform function and mobile compression 516. A third general coloring stage ( Hereinafter referred to as GP2) 508 for in-loop de-blocking filtering (IDF) 518. Finally, a fourth general coloring stage (hereinafter referred to as GP3) 510 is used for general post-processing. Function 520. Post-processing function 520 may be a de-interlacing function, a scaling function, a color space conversion function, and the like. In addition to image processing, the system includes an AES module 524 for decoding the bit stream of the Command Stream Processor (CSP) and synthesizing the final image frame so that the desktop computer renders the 3D image. Vertex shading and morphein shading are generally used. However, in some possible embodiments, when the image is presented in a full-faceted manner, no synthetic action is required, and no vertex coloring is required for S3U07-0009I00-TW/0608D-A42795-TW/Final 13 201216200. Pixel coloring. All stages 504, 506, 508, and 510 can be performed simultaneously or only partial stages can be made to maximize the use of variable length decoding (vlD) logic within GP0 504. It should be emphasized here that by using VLD logic extensively, it is avoided that only one block is enabled at a time. The large use of VLD logic helps with VLD-related decoding logic operations, and for performance, the video processing unit (VPU) VPU is usually the biggest bottleneck, especially at high bit rate H.264 bitstreams. in. In the case of video decoding in accordance with the current 3D mode, the video decoding stages 504, 506, 508, and 510 are simultaneously performed in a pipeline manner. When the 3D mode is initiated for synthesis, the decoding shaders 504, 506, 508, and 51 are switched to the 3D mode' and the vertex shader (vs) and the book shader (VS) are enabled. After completing the 3D command, the decode shader will be switched back to image mode. When all of the coloring stages 504, 506, 508, and 510 are performed simultaneously, the complexity of the simultaneous processing and the resources required must be considered. Therefore, in order to balance the complexity of simultaneous multi-coloring stages, only three or four coloring stages are performed simultaneously in the image mode. In 3D mode, only two 3D stages will be played simultaneously. According to the basic frame work description of the coloring stage described above, the general (GP) coloring stage of image processing will be described in detail below. As described above, the video recording and playback device performs a number of logical coloring stages (e.g., GP0 to GP2 and GP5). In order to fully utilize logic to perform variable length decoding (VLD), conversion, motion compensation, and round-trip blocks, many coloring stages are performed simultaneously. Therefore, the image processing unit (VPU) can also process all image data. For example, the programmable motion compensation stage can work with texture pipelines as well as additional VPUs. S3U07-0〇〇9I〇〇-TW/0608D-A42795-TW/Final ]4 201216200 As described above, the GP0 coloring stage 504 generally performs variable length decoding (VLD> GP0 coloring stage 504 can also read the frame buffer The slice data in the memory, and the motion vector, residual data, and macro block control structure are written into other buffers in the frame buffer. A thread usually processes a piece of data. According to mobile compression (MC And the inverse discrete cosine transform function (IDCT) operation to decode the slice data stream into macroblocks. Referring to Figure 3, many computer architectures have at least one execution unit (EU) for processing data. More specifically In at least one architecture, an execution unit can be used to process many different kinds of data. A computer device may have an execution unit group. The execution unit group may include at least one execution unit for executing data in a computer architecture. One or more execution units can perform a coloring phase. Referring to Figure 5, in order to initiate the general phase, the graphics driver generates commands in each of the general phases that are started. The memory is provided to provide input data. In a possible embodiment, the command queue may have 512 bits. Since it takes a lot of time to read the system memory, in some embodiments, the command can be queued. It is stored in the image memory to reduce the delay of reading. When receiving the queue, it needs to stop a thread first. When the drawing driver needs to execute many threads of a started GP stage, additional writes are required. Go to the end of the command queue and update the corresponding scratchpad. Once all the buffers to which the command is assigned are used, the drawing driver should start proposing a second command queue buffer. Once the second command queue buffer When the device is also used, the graphics driver switches back to the first buffer or to another buffer loop in a buffer loop.

GP1著色階段506 —般對一單一片資料，執行IDCT S3U07-0009100-TW/0608D-A42795-TW/Final 15 201216200 及移動補償。特別來說，GP1著色階段506讀取來自圖樞缓衝記憶體的GP0著色階段504的輸出。在其它實施例中，GP1著色階段506更讀取來自op]著色階段5〇8所輪出的參考資料。GP1著色階段506解碼Mc及dct的資料流，並根據MC預測資料，產生未濾波的γυν基本影像資料。除了紋理管線外，為了執行此功能，Gp丨著色階段 506亦會利用可程式EU核心。GP1著色階段506所產生的結果係為一解碼片資料。該解碼片資料儲存在其它緩衝器中，作為一圖框。當一圖框具有許多片資料時，可利用多個GP1著色階段506的多個線程，對圖框進行解碼。由於該等線程係對同一圖框進行解碼，故解碼後的結果係寫入相同的輸出緩衝器之中。 GP2著色階段508對一圖框或是一圖場，執行迴圈内去方塊濾波(IDF)°GP2著色階段508的輸入資料係來自 GP1著色階段506的輸出資料。單一線程處理單一圖框。當未濾波的YUV基本影像資料進行完迴圈内去方塊濾波處理後，便可產生一最終的YUV影像資料。GP2著色階段 508只使用單一可程式EU核心。GP2著色階段508的輸出經常循環地被回送至GP1著色階段506。 GP3者色階段510進行一般後處理功能。後處理功能包含’如薄膜顆料技術(film grain technology ; FGT)、解交錯、以及其它可增進影像品質的功能。後處理通常係針對一被啟動的線程所對應的一完整圖框。可知道的是，GP2 著色階段508的輸出並不會被回送至解碼迴路。在執行 VLD階段(GP0著色階段504)前，GP3著色階段510亦可在 S3U07-0009I00-TW/0608D-A42795-TW/Final 16 201216200 命令流處理器（CSP)中，進行一高級加密系統（advaneed encryption system ; AES)操作。特別來說，這個步驟將被力口密的位元流資料從週邊元件連接介面（periphei<aiThe GP1 coloring stage 506 is generally performed on a single piece of data, IDCT S3U07-0009100-TW/0608D-A42795-TW/Final 15 201216200 and motion compensation. In particular, GP1 shading stage 506 reads the output of GP0 shading stage 504 from the graph buffer memory. In other embodiments, the GP1 coloring stage 506 reads the reference material from the op] coloring stage 5〇8. The GP1 coloring stage 506 decodes the data streams of Mc and dct, and generates unfiltered γυν basic image data based on the MC prediction data. In addition to the texture pipeline, in order to perform this function, the Gp coloring stage 506 also utilizes the programmable EU core. The result of the GP1 coloring stage 506 is a decoded piece of data. The decoded slice data is stored in other buffers as a frame. When a frame has a plurality of pieces of material, the plurality of threads of the GP1 coloring stage 506 can be utilized to decode the frame. Since the threads decode the same frame, the decoded result is written to the same output buffer. The GP2 coloring stage 508 performs an intra-circle deblocking filtering (IDF) on a frame or a field. The input data of the GP2 coloring stage 508 is the output data from the GP1 coloring stage 506. Single thread handles a single frame. When the unfiltered YUV basic image data is processed in the loop, the final YUV image data is generated. The GP2 Shading Phase 508 uses only a single programmable EU core. The output of GP2 shading stage 508 is often looped back to GP1 shading stage 506. The GP3 color stage 510 performs a general post-processing function. Post-processing functions include 'film grain technology (FGT), de-interlacing, and other features that enhance image quality. Post-processing is usually for a complete frame corresponding to a thread being started. It will be appreciated that the output of the GP2 shading stage 508 will not be sent back to the decoding loop. The GP3 coloring stage 510 can also perform an advanced encryption system (advaneed) in the S3U07-0009I00-TW/0608D-A42795-TW/Final 16 201216200 Command Stream Processor (CSP) prior to performing the VLD phase (GP0 coloring phase 504). Encryption system ; AES) operation. In particular, this step will be used to connect the bit stream data from the peripheral component interface (periphei<ai

Component Interconnect Express ; PCIE)記憶體複製到圖樞缓衝器’並且在複製處理中，對位元流進行解密。對於錄放裝置而言’其所保護的内容係採用這個處理。GP3著色階段510與GP2著色階段508均係採用單一可程式EU核心。當解密金鑰使用在VLD資料流時，需在進行上述四個著色階段的任一者前，先產生AES金鑰。GP3著色階段5i〇亦會對YUV影像資料進行縮放處理，作為一紋理來源，用以繪製一 3D矩形表面。另外，此階段亦可達到縮放處理以及RGB轉換處理。當PCIE匯流排傳送影像資料時，被保護的影像錄放襄置具有加密高位值影像内容，然後當資料被寫入影像記憶體時，影像錄放裝置對影像内容進行解密。在高級排程執行時’若影像内容被傳送至系統記憶體，則影像内容會再次被加密。計算器模式AES以及串接密碼(BG-AES)可支持兩次的加密過程。計算器模式AES通常用以發送部分的解碼影像流，並提供資料予系統記憶體，或是從系統記憶體拍頁取資料。在加密資料時，針對所有被解碼的影像資料，一般使用串接密碼排程。串接密碼排程可降低CPU的負載。加密/解密採用驅動器，其將金鑰提供予硬體。為了預防金錄被不當地存取’當匯流排傳送金鑰時，亦會對金鑰進行加密。特別來說，一，，對話(sessi〇n)”金鑰用以對，，内容，，金錄進行解密。對話金鑰係用以對影像資料進行加密，並 S3U07-0009100-TW/0608D-A42795-TW/Fina] 17 201216200 且連同每一影像資料封包一起被發送。在另一實施例中，一對話金鑰會連同多個封包一起被發送。上述著色階段只是反應出多個可能實施方法中的一種’用以在許多階段中，劃分影像資料。因此，可了解的是’亦可利用其它架構處理影像資料，而且其它的選擇、修改以及等效結構均落在本案的範圍中。另外，在上述的實施例中，係以H.264為例，但亦可使用其它的資料格式，如 VC-1、WMV9(Windows Media Video 9)及 MPEG-2。可了解的是’除了影像錄放裝置及編碼外，其它後處理功能亦可提供予通用演算（如GPGPU或是GPU上的通用演算），但並非用以限制本發明。在上述所敍述的每個著色階段中，現在要說明著色階 f又的同步關係。请參考第2圖，糸統更包括一排程器526。排程器526用以控制上述不同的著色階段514、516、518 及520的進行。系統更包括一計數暫存器528。稍後將詳細說明計數暫存器528。由於該等著色階段間的相互關係，必需藉由一裝置’同步化每一著色階段的啟動。在說明同步化處理前’必須先敍述該等著色階段間的相互關係，用以說明同步化在該等階段中，是必要的。一般而言，為了順利地解碼一影像圖框，通常會發生下列事件。第一，在進行電腦運作時，會產生一 AED解密金錄，用以破解所進入的影像流，這使得GP〇(VLD階段) 對被解密的片資料進行解碼。然而，在GP〇對被解密的片資料進行解碼前，會先產生一 AES金靖，並且所進入的影像流會先被破解。巨集方塊流緩衝器也會空出一儲存空 S3U07-0009!00-TW/0608D-A42795-TW/Final 18 201216200 間’用以累計所進入的被解碼的片資料。在執行前，GP2(如MC/IDCT階段）要求vu) _ 流控制一有效空槽。另外，針對每一 B/p片型態（she type)，GP2 -般要求在參考圖框中，進行迴_去方塊滅波(IDF^GP3(即IDF階段)要求所有在特定圖框裡的片料需先經過移動壓縮操作以及反離散餘弦轉換函數（idc 丁) 操作。GP4(即後處理階段）亦要求對—特定圖框或是一特定群組裡的所有圖場(field)執行IDF。一般而言’上述的不同GP階段可能會被連結或是不連結在一起。當該等GP階段均被連結在一起時，則一 Gp階段的輸出係作為另一 GP階段的輸入。舉例而言，Gp〇的輸出可能被輸入至GP1。然而’在一些實施例中，為了要開始進行處理，需要一個以上階段的輸出。舉例而言，在移動補償中，需要VLD階段(GP0)所輸出的巨集資料，通常也會需要IDF階段(GP2)所輸出的參考圖框資料。另外，被寫入的輸出緩衝器應為一有效緩衝器，其可能會被另一下游階段所讀取。在特定實施例中，可能會有多個輸出緩衝器。因此，為了確保另一階段不會讀取到相同的輸出緩衝器，在寫入資料前，必需對輸出缓衝器進行確認。在一可能實施例中，移動補償著色階段(GP〗）應該確認輸出緩衝器將被寫入，並且確認移動補償之後的IDF著色器（Gp2) 不會讀取到這個輸出缓衝器，然而，此揭露並非用以限制本發明。因此，為了同步化許多可程式著色階段，必須提供許多指令以及暫存器。在一可能實施例中，可利用圍籬/等 S3U07-0009!00-TW/0608D-A42795-TW/Final 1q 201216200 待同步（fence/wait synchronization)設計，提供著色階段及其相應工作所需的同步位準。圍籬/等待同步設置具有16 個計數暫存器，其中每個計數暫存器具16位元。16位元暫存器係由執行單元群(execution unit pool ; EUP)所控制。著色階段所執行的指令與該等計算暫存器有關。這些新指令會被加入到著色指令集架構(ISA)中，用以進行同步化，接下來將詳細說明。為了促進著色階段(GP0〜GP3)間的同步，接下來的指令會被新增至著色指令集架構中。指令STREG係用以進行暫存态儲存。指令CHKCTR係用以碑認計數器。指令streg 一般係等同於柵攔（fence)指令，並可寫入資料到計數暫存益。指令CHKCTR —般等同於一等待指令，並可讀取計數暫存器。特別來說，指令CHKCTR接收兩參數，分別為— δ十數參數以及一等待參數。等待參數將與一特定計數器的計數值相比較。指令CHKCTR將一等待參數與目前計數暫存器的計數資料相比較。若等待參數小於或等於目前計數值，則繼續進行著色操作，否則將使線程進入睡眠狀態，直到計數暫存器的計數值等於預設的等待參數。一般而言，若有許多計數值需確認時，便需使用多個指令。了列為爷行指令STREG的二可能實姑例〇 5 3 5 2 5 1 5 0 4 9 3 8 3 7 3 6 3 5 3 4 3 S 3 ? 3 1 STREG 1 0 0 0 0 0 1 0 u 」 0 0 1 1 1 STREG Rd, Rsl ；表示執行暫存器儲存操作。 S3U07-〇〇〇91〇〇-TW/0608D-A42795-TW/Final 9〇 201216200 IMM場位元1〇〜13表示，的區塊。接收儲存暫存/命令資料的目器（CSP);2 表示 EUP; 〇表示記憶體；1表示命令流處理 3表示TCC ; 4〜5係為備用。指令STREG儲# 512位元資料。儲存操作的目的區塊可能是記憶體、命令流處理器（csp)、Eup或是取。令STREG肖以指示記憶體/命令時，164位元的暫存的内容會被儲存，ϋ且是從512位元資料的最終有效位: (LSB)開始儲存。凡上述的164位元会合内交机下: 次型態型態 2 163 162 0 : REG 1 ： CMD 場 14 161 148 -—-— 場 vmsk 4 147 144 有效遮罩 REG/CMD ： [0] :資料[31:〇] [1] :資料[63:32] [2] :資料[95:64；) [3] :資料[127:961 型態 REG _-- 場 <R> 2 143 142 -—--— 場 blk id 6 141 136 ---- 場 reg addr 6 135 130 ______^ S3U07-0009I00-TW/0608D-A42795-TW/Final 21 201216200 場 reg off ------- 129 128 REG位址偏移場 data 128 127 0 __________ ........_ —'------- 143 142 _ - —— 141 136 135 128 —--- 127 0 "-------- , wS己憶體，則會透過轨彳丁旱7L群記憶通運(繞過L2快取記憶體），將資料發送至記憶體存取單元The Component Interconnect Express; PCIE) memory is copied to the graph pivot buffer' and the bitstream is decrypted during the copy process. For the recording device, the content it protects uses this process. Both the GP3 coloring stage 510 and the GP2 coloring stage 508 employ a single programmable EU core. When the decryption key is used in the VLD data stream, the AES key is generated before any of the four coloring stages described above. The GP3 coloring stage 5i〇 also scales the YUV image data as a texture source to draw a 3D rectangular surface. In addition, zoom processing and RGB conversion processing can be achieved at this stage. When the PCIE bus transmits image data, the protected image recording device has encrypted high-level image content, and then the image recording device decrypts the image content when the data is written into the image memory. When advanced scheduling is executed ‘If the image content is transferred to the system memory, the image content will be encrypted again. Calculator mode AES and serial password (BG-AES) support two encryption processes. The calculator mode AES is usually used to send a portion of the decoded image stream and provide data to the system memory or to retrieve data from the system memory page. When encrypting data, serial password scheduling is generally used for all decoded image data. Cascading password schedules reduces CPU load. Encryption/decryption uses a driver that provides the key to the hardware. In order to prevent the gold record from being improperly accessed, the key is also encrypted when the bus transmits the key. In particular, one, the sessi〇n key is used to decrypt the content, the content, and the golden record. The dialogue key is used to encrypt the image data, and S3U07-0009100-TW/0608D- A42795-TW/Fina] 17 201216200 is sent along with each image data packet. In another embodiment, a dialog key is sent along with multiple packets. The coloring stage described above only reflects a number of possible implementation methods. One of the 'is used to divide image data in many stages. Therefore, it can be understood that 'other structures can be used to process image data, and other options, modifications and equivalent structures fall within the scope of this case. In the above embodiment, H.264 is taken as an example, but other data formats such as VC-1, WMV9 (Windows Media Video 9) and MPEG-2 can also be used. It can be understood that 'except for video recording and playback. In addition to the device and encoding, other post-processing functions can also be provided to general calculus (such as GPGPU or general calculus on the GPU), but are not intended to limit the invention. In each of the coloring stages described above, now The synchronization relationship of the coloring order f. Referring to Figure 2, the system further includes a scheduler 526. The scheduler 526 is used to control the progress of the different coloring stages 514, 516, 518 and 520 described above. A count register 528. The count register 528 will be described in detail later. Due to the interrelationship between the coloring stages, it is necessary to 'synchronize the start of each coloring stage by a device. Before describing the synchronization process' The relationship between these shading stages must first be described to illustrate that synchronization is necessary in these stages. In general, in order to successfully decode an image frame, the following events typically occur. When the computer is running, an AED decryption record will be generated to crack the incoming image stream, which causes the GP〇 (VLD stage) to decode the decrypted piece of data. However, the GP〇 decrypts the piece of data. Before decoding, an AES Jinjing will be generated first, and the incoming image stream will be cracked first. The macro block stream buffer will also be freed by a storage S3U07-0009!00-TW/0608D-A42795-TW/ Final 18 201216200 'To accumulate the decoded piece data that has been entered. Before execution, GP2 (such as MC/IDCT stage) requires vu) _ flow to control a valid empty slot. In addition, for each B/p slice type (she type ), GP2 - General requirements in the reference frame, go back to the block to destroy the wave (IDF ^ GP3 (ie IDF stage) requires all the footage in a specific frame to first pass the mobile compression operation and the inverse discrete cosine conversion function (idc D) Operation. GP4 (ie, post-processing stage) also requires IDF to be performed on a particular frame or all fields in a particular group. In general, the different GP stages mentioned above may be linked or not linked together. When the GP phases are all linked together, the output of one Gp phase is used as input to the other GP phase. For example, the output of Gp〇 may be input to GP1. However, in some embodiments, more than one stage of output is required in order to begin processing. For example, in motion compensation, the macro data output by the VLD stage (GP0) is required, and the reference frame data output by the IDF stage (GP2) is usually required. In addition, the output buffer being written should be a valid buffer that may be read by another downstream stage. In a particular embodiment, there may be multiple output buffers. Therefore, in order to ensure that the same output buffer will not be read in another phase, the output buffer must be acknowledged before writing the data. In a possible embodiment, the motion compensated shading stage (GP) should confirm that the output buffer will be written, and that the IDF shader (Gp2) after the motion compensation is confirmed will not read the output buffer, however, This disclosure is not intended to limit the invention. Therefore, in order to synchronize many programmable coloring stages, many instructions and scratchpads must be provided. In a possible embodiment, the fence/sequence S3U07-0009!00-TW/0608D-A42795-TW/Final 1q 201216200 to be synchronized (fence/wait synchronization) design can be utilized to provide the coloring phase and its corresponding work. Synchronization level. The fence/wait sync setting has 16 count registers, each of which counts 16 bytes. The 16-bit scratchpad is controlled by the execution unit pool (EUP). The instructions executed during the shading phase are related to the computational registers. These new instructions are added to the Shading Instruction Set Architecture (ISA) for synchronization, as explained in more detail below. To facilitate synchronization between the shading stages (GP0~GP3), the next instructions are added to the shader instruction set architecture. The instruction STREG is used for temporary storage. The command CHKCTR is used to mark the counter. The instruction streg is generally equivalent to a fence instruction and can write data to the count temporary benefit. The instruction CHKCTR is generally equivalent to a wait instruction and can read the count register. In particular, the command CHKCTR receives two parameters, namely - δ dec parameter and a wait parameter. The wait parameter will be compared to the count value of a particular counter. The instruction CHKCTR compares a wait parameter with the count data of the current count register. If the wait parameter is less than or equal to the current count value, the coloring operation continues, otherwise the thread will be put to sleep until the count value of the count register is equal to the preset wait parameter. In general, if there are many count values to confirm, multiple instructions are required. The second possible example of a STREG is listed as a rule. 5 3 5 2 5 1 5 0 4 9 3 8 3 7 3 6 3 5 3 4 3 S 3 ? 3 1 STREG 1 0 0 0 0 0 1 0 u 0 0 1 1 1 STREG Rd, Rsl ; Indicates that the scratchpad storage operation is performed. S3U07-〇〇〇91〇〇-TW/0608D-A42795-TW/Final 9〇 201216200 IMM field bits 1〇~13 indicate, the block. Receive the storage of the temporary storage/command data (CSP); 2 for EUP; 〇 for memory; 1 for command stream processing 3 for TCC; 4 to 5 for standby. The instruction STREG stores #512 bits of data. The destination block for the store operation may be memory, command stream processor (csp), Eup, or fetch. When STREG is instructed to indicate the memory/command, the 164-bit temporary storage will be stored and stored from the last valid bit of the 512-bit data: (LSB). Where the above 164-bit meets the internal delivery: Subtype 2 163 162 0 : REG 1 : CMD Field 14 161 148 -—-- Field vmsk 4 147 144 Effective Mask REG/CMD: [0] : Information [31: 〇] [1] : Data [63:32] [2] : Data [95:64;) [3] : Data [127:961 Type REG _-- Field <R> 2 143 142 -—--- Field blk id 6 141 136 ---- Field reg addr 6 135 130 ______^ S3U07-0009I00-TW/0608D-A42795-TW/Final 21 201216200 Field reg off ------- 129 128 REG address offset field data 128 127 0 __________ ........_ —'------- 143 142 _ - —— 141 136 135 128 —--- 127 0 "--- ----- , wS memory, will send data to the memory access unit through the track and the 7L group memory (bypassing the L2 cache)

(U)中父〇ut匯流排中的不可快取(n〇n_cacheabje)位元會被設定。可由暫存器Rd中，得到記憶體位址。口。_右欲將貝料餘存於命令流處理器（csp)，則會透過執行單兀群jit通道(繞過L2快取記憶體），將資料發送至記憶 to存取單itOVDOj)中。x_Gut匯流排中的不可快取位元以及 CSP寫入位元會被f免定。從Ευρ @ 匯流排内的csp 寫入，7L會被設定’用以將AES解密錄傳送至csp。若欲將資料儲存於EUP，則資料會透過 X-out頂點快取通道’被發送至EUp。這是利用指令顶㈤既刪除/無效L2快取記憶體’並且藉由設定暫存器，用以更新EUp 的GP著色器計數器。、、若欲將資料儲存於TCC，則將透過乂_〇泔頂點快取通道將資料發送至EUp，然後資料會被傳送至TCC，用以利用命令TRIGGER，删除/無效紋理快取記憶體。 22 S3U07-0009I00-TW/0608D-A42795.TW/Final 201216200 5 3 5 2 5 1 5 0 4 9 3 8 3 7 3 6 3 5 CHKCTR 1 0 0 0 0 0 1 0 1 CHKCTR Rd, Rsl ；The non-cacheable (n〇n_cacheabje) bit in the (U) parent ut bus will be set. The memory address can be obtained from the scratchpad Rd. mouth. _ Right to save the shell material to the command stream processor (csp), it will send the data to the memory to access the single itOVDOj by executing the single-group jit channel (bypassing the L2 cache). The non-cacheable bits in the x_Gut bus and the CSP write bits are exempt from f. From the csp write in the @ρ @ bus, 7L will be set to 'delete the AES decryption record to csp. If you want to store the data in the EUP, the data will be sent to EUp via the X-out vertex cache channel. This is the GP shader counter used to update EUp by using the instruction top (5) both delete/invalid L2 cache memory' and by setting the scratchpad. If you want to store the data in the TCC, the data will be sent to EUp through the 〇泔_〇泔 vertex cache channel, and the data will be sent to the TCC to delete/invalid texture cache memory with the command TRIGGER. 22 S3U07-0009I00-TW/0608D-A42795.TW/Final 201216200 5 3 5 2 5 1 5 0 4 9 3 8 3 7 3 6 3 5 CHKCTR 1 0 0 0 0 0 1 0 1 CHKCTR Rd, Rsl ;

CHKCTR來源](RS1)敍明—個由4暫存器所構成的群組，其係為16位元的向上計數器，其可計數到&。沒有使用到的計數器的數值被設定成〇，因此，計數器的叶數值總是會小於或等於一等待參數。、當CRF暫存器位於來源丨時，則位元〇〜15表示計數器〇的計數值、位元16〜31表示計數器i的計數值..... 位元112〜127表示計數器7的計數值。當CRF暫存器係位於（來源1)+1時，則位元〇〜15表示計數器8的計數值、位力16〜31表示計數器9的計數值.....位元112〜127表示計數器15的計數值。當RefO小於等於Cntr〇，並且Ref丨小於等於〜灶i，並且…，並且Ref 15小於等於Cntr 行單元的比較操作。成執 _如果比較的結果為真（true)，則線程的操作會繼續進仃。如果比較的結果為假(false)，則線程的操作會被暫停，直到可通過確認。此時，線程的操作仍會保持在啟動的狀態。 _ EUP透過相的匯流排，將計數值發送予所有執行單亚且主要計數器會被更新。在每個週期内，EUP内的計數器只有一個會被更新。 S3U07.0009100.TW/0608D-A42795-TW/Final 23 201216200 藉由圖式，以下將說明操作許多著色階段的一般順序，並且包含同步架構以及上述指令STREG以及 CHKCTR。首先，分析在CRF(—般暫存器）暫存器0及1 的輸入資料，其中輸入資料具有512位元，並且根據計數值，用以得知等待時間。接著，執行至少一次的指令 CHKCTR，用以確認是否所有輸入及輸出缓衝器已準備就緒。如果必要，可從一個或多個緩衝器中，讀取輸入資料。一般而言，這些缓衝器的位址就是上述512位元的輸入資料。接著進行許多計算，並且將資料寫入緩衝器中。然後，删除及/或無效化執行單元L2快取記憶體範圍内的一範圍。如果需要，可使用指令STREG，其可維持記憶體的連貫性。另外，如果需要，可使用指令STREG，無效化紋理快取記憶體。利用指令STREG，更新其它著色階段的EUP 同步化計數器。一外部柵欄會被發送至繪圖驅動器，用以指示硬體的處理位置。由於該等階段係對應不同的目的，因此，除非是會阻礙同步計數器，不然每個通用的著色階段係使用一獨立的柵欄位址。計數器的計數值會增加，並且每一線程在開始之前，會先根據計數器的值進行等待。另外，在線程操作的結束前，將更新計數器的值。針對影像解碼以及影像後處理，當著色階段的操作與同步架構有關時，亦會對被保護的影像進行AES解密。在 CSP中所執行的解碼處理可作為一虛擬分頁表（virtual page table ; VPT)方塊的一部分。當需要進行影像解密時，可藉由讀取輸入到VLD著色階段的影像圖框缓衝器，將 S3U07-0009!00-TW/0608D-A42795-TW/Final 24 201216200 PCIE系統記憶體的一淫播^ —#f 55 * ’ 。。複衣/解碼到影像記憶體的另 =二影像記憶體的儲存空間有限，故而里復從用缓衝态。為了佶 m 4 π j使驅動态重覆寫入緩衝器，將佶用外部柵欄命令。為了番芦时使马了重覆使用影像記憶體内的緩衝哭， EUP計數器會使用eup _們/笙# ★ °° 槿就是《…寺架構。EUP栅攔/等待架構就疋《Ρ的内部等待命令，以及來自⑽凡段的指令STREG。考巴ί白由於影像記憶體緩衝器有限，故必須重覆使用影像兮己憶脰緩衝器。在讀取緩衝器後，緩衝器會再重新儲存資料。若將命令儲存DMA緩衝器，並執行讀Α緩衝H内的命八時，將會造成很長的延遲。因此，為了盡可能地減少硬體的驅動器，在進行AES複製命令前，驅動器設置_内部等待命令，心料-段時間，直到—計㈣的計數值達一預設值，才對一目的位置進行讀取，並可過度寫入目的位置。在内部特命令後，可先進行㈣命令，用以確保可完成AES複製，接著再更新計數器，用以表示Gp〇著色階段的輸入資料有效。在内部等待命令下，csp只會讀取前 4個計數器（〇〜3) ’但是CSP可更新16個計數暫存器之任者其中母個δ十數益具有16位元。gp著色階段(如Gp〇) 讀取CSP所設定的計數器，並且在内部等待命令下，可藉由著色階段(如GP0)内的指令STREG設定計數器。在此需要強調的是，上述的多GP階段架構提供較有彈性的可程式模型，因此，可根據使用者需求，調整影像解碼效能。可調整的效能包含線程細微度(thread granularity) 以及快取命中率的調整。對每一影像解碼線程而言，資料 S3U07-0009100-TW/〇608D-A42795-TW/Final 201216200 處理可為巨集（MB)、片處理（MC/IDF···等）或是圖框處理。另外，並列進行的線程可處理一個或多個圖框。不同的資料細微度會造成不同的解碼效率以及不同複雜度的驅動器。為了說明本發明的目的，接下來將說明多GP階段架構的使用實施例。在本實施例中，一管線架構具有GPO、GP1 及GP2，但並非用以限制本發明。另外，假設已知一特定圖框的片數量，並且解碼過程的每個階段具有適當的啟動 (kickoffs)數量。一般而言，一啟動代表一特定階段的進行。舉例而言，一圖框具有2片資料。一開始的GP0具有2啟動（也就是每1片資料具有2啟動），GP1具有2啟動（也就是每1片資料具有2啟動），接著，GP2具有1啟動（也就是整個圖框）。如先前所述，GP0係用以進行可變長度解碼。GP0的輸入資料具有片位址以及參數，其中片位址以及參數與片資料有關。GP0的輸入資料更包括輸出緩衝器的位址。EUP 根據計數器〇的計數值，等待一段時間，以避免過度提供輸入資料至移動補償階段(GP1)。GP1更新計數器0。如上所述，藉由16個計數暫存器的本地圍籬/等待同步架構，可同步化多個著色階段及其對應的任務。在這16個計數暫存器中，每一計數暫存器具有16位元，並且由EUP(執行單元群)所控制維持。驅動器一般具有輸出緩衝器。輸出緩衝器以陣列方式排列。當輸出最大量的片資料時，輸出緩衝器可使片輸出具有足夠的驅動能力。著色階段使移動補償資料儲存於至少一緩衝器之中。藉由GP0階段或是驅動 S3U07-0009I ⑻-TW/0608D-A42795-TW/Fina] 26 201216200 器或是其它組合，將輸入資料封包（如多少緩衝器被寫入) 寫入至後續的GP1階段（其執行移動補償）。在完成目前階段後，需等計數值達到一預設值後，EUP才會繼續下—解碼階段。 ' 在完成GP0線程後’更新CSP中的AES解碼操作，或是在MC線程完成後，才更新AES解碼操作。在本實施例中，很有可能不需清除或是無效化執行單元的L2快取記憶體所儲存的資料，因此，在GP0階段中，相對應的暫存器的控制位元會被設定成〇。另外，也可能不需無效化紋理陕取5己憶體，因此，相對應的控制位元會被設定成〇。栅攔貢料會被寫入到柵攔位址。當Gp〇線程開始時，可立即啟動另一個GP0線程。在一可能實施例中，GP0線程的總數量並未超過2。 GP1階段進行轉換、移動補償，在其它實施例中，gP1 更可進行去方塊(de-blocking)。一般而言，一線程可處理一完整的片資料。輸入資料封包包括，移動補償緩衝器（包括 MBC、MV以及剩餘資料）的總數量、輸出缓衝器位址（已解碼圖框的位址）、紋理映射表格或是其它資料。Eup根據計數器1及計數器2的計數結果，等待一段時間。由計數器 1的計數結果可得知，所有參考圖框均已被解碼。由計數器2的計數結果可得知，是否已致能輸出緩衝器，用^寫入資料。著色裔言買取移動補償緩衝器，並產生已解碼的圖框。在完成GP1操作後，更新計數器〇。AES解碼結果會被提供至VLD輸入緩衝器之中。在進行GP1階段的操作時，會執行以下任務。外部圍 S3U〇7-0009!00-TW/0608D-A42795-TW/Final ?7 201216200 籬資料會被寫入至圍籬位址。EUP的L2快取記憶體會被空出，使得透過紋理快取記憶體讀取解碼圖框時，可將讀取結果作為後續圖框解碼時的一參考值。紋理快取記憶體一般會被無效化。在完成GP0線程後，便啟動GP1線程。 GP2階段對一圖框或一圖場進行迴圈内去方塊濾波 (IDF)操作以及其它操作，如解交錯操作。一般而言，一片資料處理一圖框。在GP2階段中，輸入資料包括解碼圖框的位址、、輸出緩衝器位址以及其它驅動器定義資料。EUP 等待一段時間，以確保輸出緩衝器可以被寫入，而不會過度讀取資料。在完成GP2著色階段後，更新對應計數器，寫出外部圍籬以及空出EUP的L2快取記憶體。根據接下來的階段，讀取GP2階段的輸出。舉例而言，若下一階段係進行額外的後處理（在紋理操作下，寫入資料），EUP的 L2快取記憶體的相對應位址的資料會被清除。在其它實施例中，若下一階段讀取資料，並作為紋理顯示（藉由顯示介面單元或DIU)時，快取記憶體也會被清除。如果輸出緩衝器覆蓋先前透過紋理快取記憶體所讀取的資料，則紋理快取記憶體會被無效化，以避免讀取到舊的資料。如上所述， GP3用以進行通用後處理，如解交錯、縮放、顏色空間轉換…等。第6圖係為利用CSP内的多著色架構進行影像處理的一可能流程圖。方塊610將影像播放所需的功能映射到許多著色器。在一些實施例中，可使用第3圖所顯示的映射架構。在方塊620中，擷取每一著色器的一等待值。這些等待值與相對應的著色器的執行時間有關。在方塊630 S3U07-0009!00-TW/0608D-A42795-TW/Final 28 201216200 中，根據擷取到的等待值，令該等著色器並列進行。一般而言，方塊620及630係指向先前所述的同步架構。另外，如上所述，同步架構使用第5圖所示的計數暫存器528。第7圖係為高級加密系統(AES)資訊的一可能複製實施例。在進行第一著色階段(如GP0)前，方塊710開始複製高級加密系統資訊。特別來說，這個步驟包含，將週邊元件連接介面（PCIE)記憶體的加密位元流資料複製到圖框緩衝器。在方塊720中，進行複製處理時，對位元流進行解密。接著，將解密後的位元流複製到一圖框緩衝器。這個處理係用於播放被保護的内容。如上所述，在進行上述4 個著色階段任一者的工作前，需先產生AE S金鐘，用以解密位元流。解密金鑰亦可用於VLD資料流。雖然本發明已以較佳實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明之精神和範圍内，當可作些許之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。【圖式簡單說明】第1圖係為多管線處理系統之一可能實施例。弟2圖為弟1圖的繪圖處理糸統之可程式元件。第3圖為第1圖的繪圖處理單元的功能方塊示意圖。第4圖為第1圖的繪圖處理單元的一可能實施例。第5圖係為將影像播放功能映射到多著色器結構的一可能實施例，其使用第2圖的繪圖處理單元。第6圖為進行影像播放的一可能實施例，其中該影像 S3U07-0009!00-TW/0608D-A42795-TW/Final 29 201216200 播放使用多著色器結構。第7圖為複製高級加密系統(AES)資訊的一可能實施例。【主要元件符號說明】 100 :電腦系統； 104 ：系統記憶體； 105 :圖元資料； 106 ：紋理資料； 102 :中央處理器； 108 ：系統介面； 110 :繪圖處理單元； 112 ：前端處理器； 113 :光栅波形掃描器； 114 ：紋理快取系統； 118 :紋理濾波器； 119 ：後端處理器； 120 :圖框緩衝器； 130 ：顯示裝置； 200 :繪圖管線； 250 ：記憶體； 252 :命令流處理器； 254 : 頂點著色器； 256 :幾何著色器； 257 ：三角設定階段 258 :線段及磚塊產生器， 259 :屬性設定階段； 260 ：畫素著色器； 261 :隱藏表面移動器 262 ：圖框緩衝器； 304 :執行單元群控制及快取次系統； 306 :可程式執行單元群；308 ：頂點著色器； 310 :幾何著色器； 312 ：晝素著色器； 314 :三角設定單元； 316 ：屬性設定單元； 318 :線段及磚塊產生器；424 ：繪圖處理管線； 426 :快取系統； 430 ：頂點著色器； 432 :幾何著色器； 434 ：光柵波形掃描器 436 :畫素著色器； 440 ：頂點流快取記憶 S3U07-0009!00-TW/0608D-A42795-TW/Final 30 201216200 高級加密系統模組；514 :可變長度解碼；反離散餘弦轉換函數/移動壓縮；迴圈内去方塊濾波(IDF) ; 520 :後處理功能 442 : L1快取記憶體； 446 : Z快取記憶體； 504 : GP0 階段； 508 : GP2 階段； 524 ： 516 ： 518 ： 526 :排程器； 444 : L2快取記憶體； 448 :紋理快取記憶體； 506 : GP1 階段； 510 : GP3 階段； 528 :計數暫存器。 S3U07-0009!00-TW/0608D-A42795-TW/FinalCHKCTR Source] (RS1) describes a group consisting of 4 registers, which is a 16-bit up counter that counts to &. The value of the counter that is not used is set to 〇, so the leaf value of the counter will always be less than or equal to a wait parameter. When the CRF register is located at the source port, the bits 〇 15 15 indicate the counter value of the counter 、, and the bits 16 to 31 indicate the count value of the counter i..... The bits 112 127 to 127 represent the counter 7 Value. When the CRF register is located at (source 1)+1, the bits 〇15-15 indicate the count value of the counter 8, and the bit forces 16~31 indicate the count value of the counter 9.....bits 112~127 indicate The count value of the counter 15. When RefO is less than or equal to Cntr 〇, and Ref 丨 is less than or equal to ~ stove i, and ..., and Ref 15 is less than or equal to the comparison operation of the Cntr row unit. If the result of the comparison is true, the operation of the thread will continue. If the result of the comparison is false, the thread's operation will be suspended until it can be confirmed. At this point, the thread's operation will remain in the boot state. _ EUP sends the count value to all execution units through the phase bus, and the main counter is updated. Only one of the counters in the EUP will be updated during each cycle. S3U07.0009100.TW/0608D-A42795-TW/Final 23 201216200 By way of illustration, the general sequence of operating many coloring stages will be explained below, and includes the synchronization architecture and the above-mentioned instructions STREG and CHKCTR. First, analyze the input data in the CRF (General Register) registers 0 and 1, where the input data has 512 bits and is used to know the waiting time based on the count value. Next, execute the instruction CHKCTR at least once to confirm that all input and output buffers are ready. Input data can be read from one or more buffers if necessary. In general, the address of these buffers is the above 512-bit input. A number of calculations are then performed and the data is written to the buffer. Then, the execution unit L2 is deleted and/or invalidated to a range within the memory range. If desired, the instruction STREG can be used to maintain memory coherence. Alternatively, if necessary, invalidate the texture cache using the instruction STREG. Update the EUP synchronization counters of other shading stages with the instruction STREG. An external fence is sent to the drawing driver to indicate the processing location of the hardware. Since these phases correspond to different purposes, unless a synchronization counter is blocked, each common coloring phase uses a separate fence address. The count value of the counter will increase, and each thread will wait for the value of the counter before starting. In addition, the value of the counter is updated before the end of the thread operation. For image decoding and image post-processing, when the operation of the coloring stage is related to the synchronization architecture, the protected image is also AES decrypted. The decoding process performed in the CSP can be used as part of a virtual page table (VPT) block. When image decryption is required, the S3U07-0009!00-TW/0608D-A42795-TW/Final 24 201216200 PCIE system memory can be read by reading the image frame buffer input to the VLD coloring stage. Broadcast ^ —#f 55 * ' . . The storage space of the second image memory of the reticle/decoding to the image memory is limited, so the buffer state is used. In order for 佶 m 4 π j to cause the drive state to be overwritten into the buffer, the external fence command will be used. In order to make the horses repeat the use of buffers in the image memory, the EUP counter will use eup _ _ / 笙 # ★ ° ° 槿 is the "... Temple architecture. The EUP barrier/waiting architecture is referred to as the internal waiting command of Ρ, and the instruction STREG from (10). Kaubao White Because the image memory buffer is limited, you must repeat the image buffer. After reading the buffer, the buffer will re-storage the data. If the command is stored in the DMA buffer and the memory in the read buffer H is executed, it will cause a long delay. Therefore, in order to reduce the hardware driver as much as possible, before the AES copy command, the driver sets the _ internal wait command, the heart-segment time, until the count value of the meter (four) reaches a preset value, and the target position is Read and overwrite the destination. After the internal special command, the (4) command can be executed to ensure that the AES copy can be completed, and then the counter is updated to indicate that the input data of the Gp coloring stage is valid. Under the internal wait command, csp will only read the first 4 counters (〇~3)' but the CSP can update any of the 16 count registers, where the parent δ tens of dollars has 16 bits. The gp coloring stage (such as Gp〇) reads the counter set by the CSP, and under the internal wait command, the counter can be set by the instruction STREG in the coloring stage (such as GP0). It should be emphasized here that the multi-GP stage architecture described above provides a more flexible and programmable model, so that the image decoding performance can be adjusted according to user requirements. Adjustable performance includes thread granularity and adjustments to the cache hit rate. For each video decoding thread, the data S3U07-0009100-TW/〇608D-A42795-TW/Final 201216200 processing can be macro (MB), slice processing (MC/IDF···, etc.) or frame processing. . In addition, threads that are side-by-side can process one or more frames. Different data nuances result in different decoding efficiencies and drivers of different complexity. For the purpose of illustrating the present invention, an embodiment of the use of a multi-GP stage architecture will be described next. In this embodiment, a pipeline architecture has GPO, GP1, and GP2, but is not intended to limit the present invention. In addition, it is assumed that the number of slices of a particular frame is known, and each stage of the decoding process has an appropriate number of kickoffs. In general, a start represents the progress of a particular phase. For example, a frame has 2 pieces of data. The initial GP0 has 2 starts (that is, 2 starts per piece of data), GP1 has 2 starts (that is, 2 starts per piece of data), and then GP2 has 1 start (that is, the entire frame). As mentioned previously, GP0 is used for variable length decoding. The input data of GP0 has a slice address and parameters, and the slice address and parameters are related to the slice data. The input data of GP0 further includes the address of the output buffer. The EUP waits for a period of time based on the counter 〇 count value to avoid over-provisioning the input data to the motion compensation phase (GP1). GP1 updates counter 0. As described above, multiple shader stages and their corresponding tasks can be synchronized by the local fence/wait synchronization architecture of the 16 count registers. In these 16 count registers, each count register has 16 bits and is maintained by the EUP (Executive Unit Group). Drivers typically have an output buffer. The output buffers are arranged in an array. The output buffer allows the chip output to have sufficient drive capability when outputting the maximum amount of slice data. The coloring stage causes the mobile compensation data to be stored in at least one buffer. The input data packet (such as how many buffers are written) is written to the subsequent GP1 stage by the GP0 phase or by driving the S3U07-0009I (8)-TW/0608D-A42795-TW/Fina] 26 201216200 or other combination. (It performs motion compensation). After completing the current stage, the EUP will continue to the next-decoding stage after the counter value reaches a preset value. 'Update the AES decoding operation in the CSP after completing the GP0 thread, or update the AES decoding operation after the MC thread is completed. In this embodiment, it is highly probable that the data stored in the L2 cache of the execution unit is not cleared or invalidated. Therefore, in the GP0 phase, the control bits of the corresponding scratchpad are set to Hey. In addition, it may not be necessary to invalidate the texture and capture the 5th memory, so the corresponding control bit will be set to 〇. The gate tribute will be written to the gate address. When the Gp〇 thread starts, another GP0 thread can be started immediately. In one possible embodiment, the total number of GP0 threads does not exceed two. The GP1 phase performs conversion and motion compensation. In other embodiments, gP1 is more de-blocking. In general, a thread can process a complete piece of material. The input data packet includes the total number of motion compensation buffers (including MBC, MV, and remaining data), the output buffer address (address of the decoded frame), the texture mapping table, or other data. Eup waits for a period of time based on the count results of counter 1 and counter 2. It can be known from the counting result of the counter 1 that all the reference frames have been decoded. It can be known from the counting result of the counter 2 whether the output buffer has been enabled and the data is written with ^. The coloring language buys a motion compensation buffer and produces a decoded frame. After the GP1 operation is completed, the counter is updated. The AES decoding result is provided to the VLD input buffer. The following tasks are performed during the GP1 phase. External enclosure S3U〇7-0009!00-TW/0608D-A42795-TW/Final ?7 201216200 The fence data will be written to the fence address. The EUP's L2 cache memory is vacated, so that when the texture frame is read through the texture cache, the read result can be used as a reference value for decoding the subsequent frame. Texture cache memory is generally invalidated. After the GP0 thread is completed, the GP1 thread is started. The GP2 phase performs a loop-in-block filtering (IDF) operation on a frame or a field, as well as other operations such as de-interlacing. In general, a piece of data is processed into a frame. In the GP2 phase, the input data includes the address of the decoded frame, the output buffer address, and other drive definition data. The EUP waits for a while to ensure that the output buffer can be written without over-reading the data. After the GP2 coloring phase is completed, the corresponding counter is updated, and the external fence and the L2 cache memory that vacates the EUP are written. The output of the GP2 phase is read according to the next stage. For example, if the next stage is to perform additional post-processing (writing data under texture operation), the data of the corresponding address of the E2's L2 cache memory will be cleared. In other embodiments, if the next stage reads the data and displays it as a texture (by displaying the interface unit or DIU), the cache memory is also cleared. If the output buffer overwrites the data previously read by the texture cache, the texture cache will be invalidated to avoid reading the old data. As mentioned above, GP3 is used for general post-processing such as deinterlacing, scaling, color space conversion, etc. Figure 6 is a possible flow diagram for image processing using a multi-coloring architecture within the CSP. Block 610 maps the functions required for image playback to a number of color pickers. In some embodiments, the mapping architecture shown in Figure 3 can be used. In block 620, a wait value for each shader is retrieved. These wait values are related to the execution time of the corresponding color shader. In block 630 S3U07-0009!00-TW/0608D-A42795-TW/Final 28 201216200, the shaders are arranged side by side based on the retrieved wait values. In general, blocks 620 and 630 point to the synchronization architecture previously described. In addition, as described above, the synchronization architecture uses the count register 528 shown in FIG. Figure 7 is a possible replication example of Advanced Encryption System (AES) information. Block 710 begins copying advanced encryption system information prior to the first coloring stage (e.g., GP0). In particular, this step involves copying the encrypted bitstream data of the Peripheral Component Interface (PCIE) memory to the Frame Buffer. In block 720, the bitstream is decrypted when the copy process is performed. Next, the decrypted bit stream is copied to a frame buffer. This process is used to play protected content. As described above, the AE S Admiral is required to decrypt the bit stream before performing any of the above four coloring stages. The decryption key can also be used for VLD data streams. Although the present invention has been disclosed in the above preferred embodiments, it is not intended to limit the invention, and any one of ordinary skill in the art can make some modifications and refinements without departing from the spirit and scope of the invention. Therefore, the scope of the invention is defined by the scope of the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a possible embodiment of a multi-line processing system. Brother 2 is a programmable component of the drawing processing system of the brother 1 picture. Fig. 3 is a functional block diagram of the drawing processing unit of Fig. 1. Figure 4 is a possible embodiment of the graphics processing unit of Figure 1. Figure 5 is a possible embodiment of mapping a video playback function to a multi-colorizer structure using the graphics processing unit of Figure 2. Figure 6 is a possible embodiment of performing video playback, wherein the image is played using a multi-colorizer structure. S3U07-0009!00-TW/0608D-A42795-TW/Final 29 201216200 Playback. Figure 7 is a possible example of copying Advanced Encryption System (AES) information. [Description of main component symbols] 100: computer system; 104: system memory; 105: primitive data; 106: texture data; 102: central processing unit; 108: system interface; 110: graphics processing unit; 112: front-end processor 113: raster waveform scanner; 114: texture cache system; 118: texture filter; 119: backend processor; 120: frame buffer; 130: display device; 200: drawing pipeline; 250: memory; 252: Command stream processor; 254: Vertex shader; 256: Geometry shader; 257: Triangle setting stage 258: Line segment and brick generator, 259: Property setting phase; 260: Pixel shader; 261: Hidden surface Mover 262: frame buffer; 304: execution unit group control and cache subsystem; 306: programmable execution unit group; 308: vertex shader; 310: geometry shader; 312: pixel shader; 314: triangle Setting unit; 316: attribute setting unit; 318: line segment and brick generator; 424: drawing processing pipeline; 426: cache system; 430: Point Shader; 432: Geometry Shader; 434: Raster Waveform Scanner 436: Pixel Shader; 440: Vertex Stream Cache Memory S3U07-0009!00-TW/0608D-A42795-TW/Final 30 201216200 Advanced Encryption System module; 514: variable length decoding; inverse discrete cosine transform function / mobile compression; loop inner deblocking filtering (IDF); 520: post processing function 442: L1 cache memory; 446: Z cache memory 504: GP0 phase; 508: GP2 phase; 524: 516: 518: 526: scheduler; 444: L2 cache memory; 448: texture cache memory; 506: GP1 phase; 510: GP3 phase; : Count the scratchpad. S3U07-0009!00-TW/0608D-A42795-TW/Final

Claims

201216200 VII. Patent application scope: 1.-Multiple colorizer system for processing image data in _ programmable drawing processing. The multi-colorator system includes: Early 71 - first-coloring stage 'from-frame buffer And performing variable length decoding, wherein ""coloring stage output data ^ a first buffer in the frame buffer; τ 3 - second coloring stage, receiving rounds from the first coloring stage And converting and moving compensation of the piece of data, wherein the second-before stage outputs the decoded piece of data to the frame buffer: J; step one slows the second color stage to receive the decoded The film is oblique, is barren, and is in the frame buffering, performing round-trip deblocking filtering; a fourth coloring stage 'in the frame buffer, performing post-processing; and the scheduler includes a scheduler who arranges for the coloring stages to perform a plurality of count registers; / wherein the count registers are used to synchronize the progress of the coloring stages, as described in the second paragraph of the patent scope Multi-color system In addition, the same level of the system is used to copy the encrypted data from the memory to the frame buffer n. In the case of copying the material, the advanced encryption system module decomposes the encrypted data. The system module 'decrypts data' in a command stream processor. One or three people such as / please apply the multi-color system described in item 1 of the patent scope, and further includes ~"(4) 'the command (4) includes the coloring Corresponding line S3U07-00〇9r〇0-TW/0608D.A42795.TW/Fina) 32 201216200 Cheng Haowei Patent (4) Item ─ multi-coloring 11 system, where 兮#count register - Go to r^ should be related to the coloring stage, where the second:: wait value and - relative number register will be updated., after the elementary mother-color stage, the count, etc. == patent scope item 4 The multi-colorator system is described in which the execution time of the coloring stage of the temple 4 is not the parent coloring stage. (4) The multi-colorator system described in the first item of the patent scope, wherein the column S is a pipeline The architecture is set so that the coloring stages can be expected to be as described in item 1 of the patent application scope. a multi-color system, wherein the material includes, the result of the motion vector, the remaining capital, and the macro control structure. Only :8: the multi-color system of claim 1, wherein The first color stage outputs an unfiltered YUV base image data. The multi-color system of claim 8, wherein the second coloring stage is in the unfiltered γυν basic image (10) 1 : In the frame 'to perform a back-to-block filtering for generating - the final basic image data. 10. The multi-color system of claim 9, wherein when performing motion compensation, the final The γυν basic image data will return to the two coloring stages. μ — 11. As in the multi-colorator system described in claim 1, the post-treatment of the 1:9 coloring stage of the bean includes: film particle technology: and 9l00-TW/0608D-A42795-TW/ Final 33 201216200 = Wrong 'where the fourth shading stage is more processed — the full frame is threaded. </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; And a pixel shader. η. The multi-coloror system described in the scope of the patent application, the image data includes one of H'264 f material, vc] data, observation material and MPEG-2 data. The M 14.-processing method utilizes a multi-colorizer architecture to process a poor material. The processing method includes: mapping a complex function required for image playback to a complex shader; # pulse waiting values of the shaders, Wherein each wait value represents an execution time of a corresponding color shader; and in a command stream processor, a first shader, a second shader, and a third shader are performed side by side, wherein the first shader The length is solved by the mother, the second coloring H is converted and the motion is compensated, and the third coloring device performs the inner-blocking filtering in the loop. 15. The processing method described in the fourth aspect of the patent application includes the Four shader, the The four shaders perform a post-processing comprising at least a "film particle technique, deinterlacing, scaling, and color space rotation." The processing method of claim 14 is further included in the parent shader. After the completion, update the relevant counting register. Π · As described in the processing method of claim 14, the variable length decoding performed by the first-shader includes: reading - frame slow ^ S3U07-0009I00-TW/0608D-A42795-TW/Final 34 °° 201216200 piece of data, and the wheel of the workmanship out of the mobile vector dip,, ·. 1 in the frame buffer to the remaining data and macro Controlling the second _-----a buffer--the buffering step of the credit-ordering process includes: repaying, and rotating one-two-to-one-chip data: The second buffer of the frame buffer, wherein the third shader enters the data of the device, and re-takes the block filter for the second::nr. 18·If the patent application scope is based on the wait value of the shaders, The processing method of the demolition is further included, the root processing system includes: n-heart processing-image data, the multi-color shader of the image, and the equal-color shader includes: ° in a command stream processor, #孩弟者Color benefit, execution can be love e - color wheel rotation data good morning length decoding, which 哕筮, - - frame buffer - first ang: the brother 1 material,: brother two shader, receive from the first - coloring: device; second, 'transformation and movement compensation for the piece of greed, its" round of funding = output decoded piece of information to the frame buffer:: two coloring benefits; a third shader receives the decoded piece of data, and performs buffering in the buffer to perform block filtering; in the figure, a fourth coloring stage is performed, and in the frame buffer, the complex counting register is performed. , storing the corresponding color picker; S3U07-00〇9I〇〇-TW/0608D-A42795-TW/Final 35 ' to be valued; 201216200 and a scheduler, according to the waiting values, arrange the coloring Device. 20. The graphics processing system of claim 19, wherein the shaders confirm wait values for other shaders before starting. S3U07-0009I00-TW/0608D-A42795-TW/Final 36