TWI297468B

TWI297468B - Graphics processor, graphics system, embedded processor, method of performing a graphics processing operation, method of operating a graphics pipeline, method of performing a register write, and method of monitoring a graphics processor

Info

Publication number: TWI297468B
Application number: TW094115854A
Authority: TW
Inventors: Edward A Hutchins; Brian K Angell; Paul Kim
Original assignee: Nvidia Corp
Priority date: 2004-05-14
Filing date: 2005-05-16
Publication date: 2008-06-01
Also published as: EP1759380B1; WO2005114646A2; EP1759380A2; KR20070028368A; KR100865811B1; WO2005114646A3; EP1759380A4; ATE534114T1; JP4914829B2; TW200609842A; JP2007538319A

Abstract

A graphics piocessor has a programmable Arithmetic Logic Unit (.ALL ) capable of scalar arithmetic operations for processing pixel packets and pixel packets formatted in a S l 8 format to improve dynamic range or in a different data format (see figure 3) The graphic processor may be implemented as a configurable graphics pipeline, as distributors couple elements of a graphics pipeline to permit the process flow of pixel packets through to be reconfigured in response to a command from a host and a data packet triggers an element of the graphics pipeline to discover an identifier A configurable test point selector may be used to monitor a selected subset of tap points of the graphics pipeline and count statistics for at least one condition associated with each tap point ot the subset of tap points Pixels may be assigned even pixels or odd pixels and the pixel packets of odd and even pixels then interleaved to account for ALU latency.

Description

1297468 九、發明說明：【發明所屬之技術領域】本發明大體係關於可程式之虛採哭 .^ 八 < 慝理窃。更特定言之，本發明係針對用於圖形應用之低功率可程式之處理器。【先前技術】 w 產生二維圖形影像在各種電子遊戲及其他應用中被關 /主。通# ’用於產生情景之二維影像之某些步驟包括產生待顯不之物體的二維模型。形成幾何圖原（ge〇metHcal • primitive)(例如，三角形），其連同深度資訊一起被映射至二維投影。再現（繪製）圖原包括在圖原之每一二維投影上内插諸如深度及色彩之參數。圖形處理單元（GPU)—般用於圖形系統中以產生三維影像’以回應來自中央處理單元之指令（instructi〇n)。現代Gpu 一般利用圖形管線處理資料。圖1為傳統管線架構之先前技術圖式，該架構為一具有專用於執行特定功能之階段的深管線。轉換階段1 〇 5執行圖原之幾何計算且亦可執行 φ 裁剪運算（dipping operation)。設定/光柵階段110將圖原光栅化。紋理位址階段115及紋理提取階段12〇係用於紋理映射。務化階段130實施霧化演算法（f〇g algorithm)。透明度測試階段135執行透明度測試（alpha test)。深度測試階段140 執行用於挑選閉塞像素之深度測試。透明度混色階段145執行透明度混色色彩組合演算法。記憶體寫入階段1 50寫入管線之輸出。一般使用OpenGL®圖形語言來最佳化圖1中所說明之傳 101822.doc 1297468 統咖管線架構以用於快速紋理化。深管線架構之益處在允許陕速、咼品質地再現甚至複雜之情景。在無線電話、個人數位助理（PDA)及成本與功率消耗為重要設計需求之其他裝置中對利用三維圖形之關注曰益增」而傳統之深管線架構要求顯著之晶片區域，其導致成亡：於所要成本。另外，即使階段正執行相當少：處理=管線亦消耗顯著功率。此係因為許多階段消耗大約相同i之功率而無論其是否正處理像素。、出於成本及功率考慮，圖1中所說明之習知深管線架構不適口於許多圖形應用，諸如在無線電話及pDA上實施三維遊戲。因此’吾人所要的是一適合於圖形處理應用但功率及大小需求降低之處理器架構。【發明内容】一圖形處理器包括-用於處理像素封包之可程式之算術邏輯單元（ALU)階段4ALU階段中在像素封包上執行純量算術運算以實施圖形功能。在像素上執行圖形處理運算之方法的—實施例包括：識別待在像素封包上執狀—純量算料算序列讀行圖形功能；為該像素產生複數個像素封包，每—像素封包包括在該純量算術運算序列中待作為運算元處理的—子組像素屬性；自至少一彻中的像素封包讀取運算元；及根據指令序列來執行純量算術運算以執行該純量算術運算序列。圖形處理器之-實施例包括：_具有用於處理像素封包 101822.doc I297468 =tALU的可程式之ALU階段，每-彻經程式化以在一且能之純量算術運算’該純量算術運算係傻/ 。之當前指令的傳入像素封包上執行，其中在素封包上執行-算術運算序列以執行圖形處理功能。圖:處，器包括詩在像素封包上執行純量算術運算之、异術邏輯單元(alu)。對於所選純量算術運算，可格式化像素封包中的運算元以改良動態範国對於至少一其他έφ晉曾你；、富曾格式化像素封包。以其他資料格式來在一方法之—實施例中，識別純量算術運算，其可在像素封包上來執行以實施圖形功能。為待處理之每―像素產生至少-像㈣包列，每—像素封包包括詩待作為運算 7L處理之-子組像素屬性的至少—攔位，該至少—列具有 -相關聯之指令序列。經指派之運算元在複數個彻:每 -者中被讀取，該等運算元之至少―者對應於一自—像素封包列内之一像素封包所讀取的運算元。在每一 alu中，根據指令序列在經指派之運算元上執行一純量算術計算。對於需要[0, 圍内之結果的所選純量算術運算，以S18 格式來格式化像素封包之對應運算元，該S18格式對應於具有8位元小數成份之[_2, +2]範圍内之運算元的基數2表示’並將所選純量算術函數的結果钳制至[()，丨]之範圍。對於至少一其他純量算術運算，以不同資料格式來格式化像素封包。圖形處理器具有由分配器所耦合之圖形管線的元件。該 101822.doc -8· 1297468 等分配器允許重新域像素封包經由f線之處理流程1 回應來自主機之命令。在一設備之一實施例中，一圖形瞢口办S線包括複數個階段。第-分配器麵合至該複數個階段的個別輸入。第二分配写耦合至該複數個階段的個別輸出。該第一分配器及該第二分配器經調適成重新組態像素封包纟、 ^、、二由複數個階段之處理流程，以回應來自主機之命令。在一方法之-實施例中，接收來自軟體主機之命令，以重新組態像素封包經㈣㈣線之元件的處理流程。作為回應，調整至少一分配器以將管峻自 s線自弟—處理流程重新組悲至弟二處理流程。圖形處理器包括-用於處理像素封包之算術邏輯單元像素可需要處理—個以上之像素封包列。父錯諸如奇像素及偶像素之不同像夸 U彳冢素的像素封包以解決 ALU等待時間。所成1297468 IX. Description of the invention: [Technical field to which the invention pertains] The large system of the present invention is about the imaginary cry of the program. ^ 八 < 慝理窃. More specifically, the present invention is directed to a low power programmable processor for graphics applications. [Prior Art] w Producing 2D graphics images is turned off/main in various video games and other applications. Some of the steps used to generate a two-dimensional image of a scene include generating a two-dimensional model of the object to be displayed. A geometric primitive (ge〇metHcal • primitive) (eg, a triangle) is formed, which is mapped to the two-dimensional projection along with the depth information. Reproducing (drawing) the map originally includes interpolating parameters such as depth and color on each of the two-dimensional projections of the original image. A graphics processing unit (GPU) is typically used in graphics systems to generate three-dimensional images' in response to instructions from the central processing unit (instructi). Modern Gpus typically use graphics pipelines to process data. Figure 1 is a prior art diagram of a conventional pipeline architecture, which is a deep pipeline with stages dedicated to performing specific functions. The conversion phase 1 〇 5 performs the original geometry calculation and can also perform the φ dipping operation. The set/raster stage 110 rasterizes the picture. The texture address stage 115 and the texture extraction stage 12 are used for texture mapping. The materialization stage 130 implements a fog algorithm (f〇g algorithm). Transparency Test phase 135 performs an alpha test. The depth test phase 140 performs a depth test for picking occluded pixels. The transparency blending stage 145 performs a transparency blending color combination algorithm. The memory write phase 1 50 writes to the output of the pipeline. The OpenGL® graphics language is generally used to optimize the 101822.doc 1297468 system pipeline architecture illustrated in Figure 1 for fast texturing. The benefits of a deep pipeline architecture allow for the reproducibility of even the most complex scenarios. The use of three-dimensional graphics has increased in wireless telephones, personal digital assistants (PDAs), and other devices where cost and power consumption are important design requirements. Traditional deep pipeline architectures require significant wafer areas that lead to death: At the cost. In addition, even if the phase is performing quite a bit: the process = pipeline consumes significant power. This is because many phases consume approximately the same i power regardless of whether they are processing pixels. For the sake of cost and power considerations, the conventional deep pipeline architecture illustrated in Figure 1 is not suitable for many graphics applications, such as implementing 3D gaming on wireless phones and pDA. So what we want is a processor architecture that is suitable for graphics processing applications but with reduced power and size requirements. SUMMARY OF THE INVENTION A graphics processor includes a programmable arithmetic logic unit (ALU) for processing pixel packets. Phase 4 ALU stages perform scalar arithmetic operations on pixel packets to implement graphics functions. An embodiment of the method for performing a graphics processing operation on a pixel includes: identifying a function to be performed on a pixel packet - a scalar computing sequence reading line graphic function; generating a plurality of pixel packets for the pixel, each pixel packet being included a sub-group pixel attribute to be processed as an operation element in the scalar arithmetic operation sequence; reading an operation element from at least one of the pixel packets; and performing a scalar arithmetic operation according to the instruction sequence to execute the scalar arithmetic operation sequence . The embodiment of the graphics processor includes: _ having a programmable ALU stage for processing the pixel packet 101822.doc I297468 = tALU, each - programmed to perform a scalar arithmetic operation on a scalar quantity 'the scalar arithmetic The operation is silly /. The incoming instruction is executed on the incoming pixel packet, wherein the sequence of arithmetic operations is performed on the prime packet to perform graphics processing functions. Figure: The device includes a poetic logic unit (alu) that performs scalar arithmetic operations on the pixel packet. For the selected scalar arithmetic operation, the operands in the pixel packet can be formatted to improve the dynamic state. For at least one other έ 晋曾 ;;; In other data formats, in one method embodiment, a scalar arithmetic operation is identified that can be executed on a pixel packet to implement a graphics function. For each pixel to be processed, at least - like (four) packets are generated, each pixel packet includes at least - a block of the sub-group pixel attributes to be processed as a 7L process, the at least - column having - an associated instruction sequence. The assigned operands are read in a plurality of integers: at least one of the operands corresponds to an operand read by a pixel packet in a self-pixel packet column. In each alu, a scalar arithmetic calculation is performed on the assigned operands according to the sequence of instructions. For the selected scalar arithmetic operation requiring [0, the result of the surrounding, the corresponding operand of the pixel packet is formatted in S18 format, which corresponds to the range of [_2, +2] having the 8-bit fractional component The base 2 of the operand represents 'and clamps the result of the selected scalar arithmetic function to the range of [(), 丨]. For at least one other scalar arithmetic operation, the pixel packet is formatted in a different data format. The graphics processor has elements of a graphics pipeline coupled by a distributor. The allocator such as 101822.doc -8· 1297468 allows the re-domain pixel packet to respond to commands from the host via process f of f-line. In one embodiment of an apparatus, a graphics port S line includes a plurality of stages. The first-distributor faces the individual inputs of the plurality of stages. A second allocation write is coupled to the individual outputs of the plurality of stages. The first splitter and the second splitter are adapted to reconfigure the pixel packets ^, ^, and 2 by a plurality of stages of processing in response to commands from the host. In a method-embodiment, a command from a software host is received to reconfigure the processing flow of the component of the pixel packet via the (4) (four) line. In response, at least one of the dispensers is adjusted to re-establish the process from the s-process to the second process. The graphics processor includes - an arithmetic logic unit for processing pixel packets. The pixels may need to process more than one pixel packet column. Parental errors such as odd and even pixels are similar to pixel buffers to resolve ALU latency. Made into

在一方法之-實施例中，識別_純量算術運算序列，复可在像素封包上來執行以在複數個像素上實施圖形功I 將像素指派為偶像素或奇像素。為每一主u a 本京產生至少兩像素封〇列’每-像素封包包㈣於在該純量算術運算中待作為運算元處理之-子組像素屬性的至少—欄位… 等至少兩列具有一相關聯之指令 ^ 7斤列、一用以指示像包是用於奇像素還是用於偶像素的識別符。在一群像素，包列中交錯用於偶像素及奇像素之像素封包列，其中^ 中之每一列經指派用於在連續時脈週期〇Λ 〜w τ蜒理。在ALU階 101822.doc 1297468 焱令接收當前時脈週期像素封包列所讀取之至少一運=列。根據指令序列在自其尹像夸運#疋上執行純量算術計算，一素封包之處理在ALU階段尹得以交錯。的一：:：圖形管線具有像素封包經由該圖形管線之元件件以發料之可錢理流程。資料封包觸發圖形管線之元 1千以發現識别符。線之只紅例尹，所接收之資料封包觸發圖形管In a method-embodiment, a sequence of sess-quantity arithmetic operations is identified that can be performed on a pixel packet to perform a graphics function on a plurality of pixels to assign pixels as even or odd pixels. For each master ua, the Beijing generates at least two pixel-packed columns 'per-pixel packet (four) at least two columns of the sub-group pixel attributes to be processed as operands in the scalar arithmetic operation. There is an associated instruction, a flag to indicate whether the image packet is for an odd pixel or an even pixel. In a group of pixels, a column of pixel packets for even and odd pixels is interleaved, wherein each of the columns is assigned for continuous clock cycle 〇Λ ~w τ 蜒. At the ALU stage 101822.doc 1297468, the current clock cycle is received. At least one of the data = column read by the pixel packet column. According to the instruction sequence, the scalar arithmetic calculation is performed on the 像运运 , , , , , , , 。。。。。。。。。。。。。。。 The one::: graphics pipeline has a process for the pixel packet to be sent through the component of the graphics pipeline. The data packet triggers the element of the graphics pipeline to find the identifier. Only the red case of the line, the received data packet triggers the graphic tube

位詈Γ:、’以發現一用於每一元件之指示處理流程内元件的識别符。每一元件在指示其在處理流程内之相對位暫存器中寫入識別符。在—實施例中，元件在資 L㉟取識㈣之當前值，將該t前值寫人至組態暫存盗’遞增該識別符’並將具有遞增識別符的資料封包轉發至處理流程之下一元件。圖形處理器包括圖形管線。分接頭點與圖形管線之元件相關聯。一可組態之測試點選擇器監視一所選子組分接頭點，並對與該子組分接頭點之每一分接頭點相關聯之至少一條件計數統計量。在—實施例中，該可組態之測試點選擇器起作用以回應來自軟體主機之命令並為該軟體主機收集統計量。【實施方式】圖2為本發明之一實施例的方塊圖。可程式之圖形處理器 205耦合至暫存器介面21〇、主機介面22〇及記憶體介面，該 s己憶體介面諸如直接記憶體存取（DMA)引擎230，其用於使用諸如訊框緩衝器之圖形記憶體（未圖示）來進行記憶體讀 101822.doc -10- 1297468 取/寫入操作。主機介面22〇允許可程式之圖形處理器205接收來自主機之用於產生圖形影像的命令。舉例而言，主機可將頂點資料、命令及程式指令發送至可程式之圖形處理态205。諸如DMA引擎23〇之記憶體介面允許使用圖形記憶體（未圖不）來執行讀取/寫入操作。暫存器介面210提供一用於與可程式之圖形處理器2〇5的暫存器介面連接之介面。可將可程式之圖形處理器205實施為系統290之一部分，該系統包括執行軟體應用程式27〇之至少一其他中央處理 § 單兀260，該中央處理單元用作可程式之圖形處理器2〇5的主機。例示性系統290可（例如）包含諸如行動電話或個人數位助理（PDA)之掌上型單元。舉例而言，軟體應用程式27〇可包括一用於在顯示器295上產生圖形影像之圖形應用程式275。另外，如以下更詳細之描述，在某些實施例中，軟體應用程式270可包括用於執行與可程式之圖形處理器2〇5 相關聯之管理功能的圖形處理器管理軟體應用程式28〇，該等管理功能諸如（例如）管線重新組態、暫存器組態及測試。鲁在一實施例中，可程式之圖形處理器205、暫存器介面 210、主機介面220及DMA引擎230為一形成於單一積體電路 200上之嵌入式圖形處理核心25〇之部件，該單一積體電路包括主機，諸如形成於晶片上之包括中央處理單元26〇之積體電路200，該中央處理單元具有駐存於記憶體上之軟體 270。或者，圖形處理核心250可安置於第一積體電路上，且CPU 260安置於第二積體電路上。圖3為詳細說明根據本發明之一實施例之可程式之圖形 101822.doc 1297468 階段310程式化指令。光柵階段31〇處理給定之三角形之每一像素，並判定作為再現之一部分的需要對像素而計算之參數，諸如計算色彩、紋理、透明度測試、透明度混色、z 深度測試及霧化參數。在一實施例中，光柵階段31〇對像素封包計算重心係數。在重心、座標系、统中，量測三角形中相對於其頂點的距離。使用重心係數會降低所需之動態範圍，其允許使用與浮點計算相比需要較少功率之固定點算。光柵階段31〇對待處理之三角形的每一像素產生至少一像素封包。母-像素封包包括用於為處理所需之像素屬性 (例如’色彩、紋理、深度、霧化、(X，y)位置)之有效負載的攔位。另外，每-像素封包具有相關聯之旁頻帶資訊，其包括待在像素封包上執行之運算的指令㈣。光柵階段 31 〇中之指令區域（未圖示）將指令指派給像素封包。圖4說明一像素之例示性像素封包43。及糊。在—實施例 ^光柵階段㈣將像素屬性分割為兩個或兩個以上不同類 :之像素封包一，其中每一類型之像 = 用於特^類型之指令所作用〇 ^而要僅素資料分割為較小之工作將像對於特定處理運算而+ ^降低頻寬需求’且若（例如）瞀，目，丨〇僅萬要對一子組像素屬性進行運 -T則其亦會降低處理需求。丁運次訊49if。素封包具有相關聯之旁頻帶資訊410及有效負载貝Η 。例不性旁頻帶資訊包括有效欄位412、取消攔 (kill fleid)4l4、栲々納 a % A 襴位 ^己攔位、及包括當前指令之指令欄位 101822.doc * 13 - 1297468 416。例示性像素封包43G包括—第-組（S，t)紋理座標422 及424之攔位與霧化欄位426。例示性像素封包46〇包括色彩攔位462及-第二組紋理座標（s，t)464及摘。在—實施例中，每一像素封包以固定點表示來表示有效負載資訊42〇。可包括於像素封包（其中像素屬性之像素封包大小為朗立元）中之像素屬性的實例包括：—Z l6十六位元战度值；一 16位元S/T紋理座標及4位元精細度；一對色彩值，每一者具有8位元之精確度；或封裝25555 argb色彩，其中五位元各在每一 ARGB變數中。像素封包之旁頻帶資訊可包括像素之(x，y)位置。然而，在-實施例中，藉由光柵階段310錄，y)起源處產生開始跨距命令，在該起源處其開始沿著掃描線穿過三角形。使用開始跨距命令會允許自像㈣包省略（x，y)位置。該開始跨距命令通知其他實體（例如，資料寫人階段州及資料提取階段330)在掃描線開始處的初始（x，y)位置。沿著掃描線之：他像素的(x’y)位置可由像素之數目來推斷，一給定像素逐離Θ起源。在一實施例中’資料寫入階段说及資料提取階段330包括本機快取記憶體，該等本機快取記憶體經調適成遞增本機計數H，絲基於其在跨㈣始命令之後遇到之像素數目的計算來更新（x，y)位置。 >看圖5 ’在-實施例巾，光柵階段3灣待處理之每— 像素產生至少-像素封包列51()。在某些實施例中，每一列 =具:一為該列510定義一指令序列的共同旁頻帶資訊右一像素需要多於—列510，則將該等列510組織為_ 101822.doc -14- 1297468 和〇列’其隨著每-新時脈週期而被連續處理。在 :中，賴位元像素資料分料四個騎元像素屬性^ 益值’其中該等四個像素暫存器值定義像素之像素 (R0、Rl、R2及 R3)之“列，，51〇。光柵階段3丨0之迭代器暫存器㈣（未圖示）具有對應之暫存器以支援像素封包列510β在—實施例中，光柵階段則包括-支援尚達四像素封包列的暫存器集區。某些類型之像素封包屬性（諸如紋理）可需要高精確度。相反，某些類型之像素封包屬性可需要較低之精確度，諸如色彩。可配置暫存器集區以支援列51G中每—像素封包的高精確度及低精確度值。在—實施财，暫存器集區包括每列4個高精確度及4個低精確度之透視校正迭代值加上z深度值。舉例而言，以此方式允許軟體指派迭代器之精確度以用於處理特定像素封包屬性。在一實施例中，光栅階段3ι〇包括周適成追蹤紋理之整數部分的暫存器集區，從而允許將:理之小數位元作為資料封包來發送。光柵階段310可（例如）接收來自主機之需要在像素上執行運算的指令。作為回應’光栅階段31〇產生具有相關聯之指令序列的-或多個像素封包列别，其中該等像素封包列及指令經配置以執行所要之處理運算。如以下更詳細之描述’在-實施例中，ALU階段34〇允許執行純量算術運算: 其中運算元包括像素封包列51〇内之一預選子組像素屬性、常數值及像素封包上先前計算的臨時儲存結果。各種圖形運异可用公式表達為一或多個純量算術運算。 101822.doc 15- 1297468 另外，各種向量圖形運算可用公式表達為複數個純量算術運w 因此，應瞭解，可程式化本發明之可程式之圖形處理器205以在像素上執行任何圖形運算，該圖形運算可表示為一純量算術運算序列，諸如霧化運算、色彩（透明度）混色、紋理組合、透明度測試或深度測試，諸如在〇pen Gjl®Position:, 'to find an identifier for the component within the processing flow for each component. Each component writes an identifier in its relative bit register indicating its processing flow. In the embodiment, the component obtains the current value of (4) in the L35, writes the pre-t value to the configuration temporary stolen 'increment the identifier' and forwards the data packet with the incremental identifier to the processing flow. Next component. The graphics processor includes a graphics pipeline. The tap point is associated with the component of the graphics pipeline. A configurable test point selector monitors a selected sub-component joint point and counts at least one conditional count statistic associated with each tap point of the sub-component joint point. In an embodiment, the configurable test point selector functions in response to commands from the software host and collects statistics for the software host. Embodiments Fig. 2 is a block diagram showing an embodiment of the present invention. The programmable graphics processor 205 is coupled to a scratchpad interface 21, a host interface 22, and a memory interface, such as a direct memory access (DMA) engine 230, for use with, for example, a frame. The buffer's graphics memory (not shown) performs the memory read 101822.doc -10- 1297468 fetch/write operation. The host interface 22 allows the programmable graphics processor 205 to receive commands from the host for generating graphics images. For example, the host can send vertex data, commands, and program instructions to the programmable graphics processing state 205. A memory interface such as the DMA engine 23 allows the use of graphics memory (not shown) to perform read/write operations. The scratchpad interface 210 provides an interface for interfacing with the scratchpad interface of the programmable graphics processor 2〇5. The programmable graphics processor 205 can be implemented as part of a system 290 that includes at least one other central processing unit 260 executing a software application program 27 that functions as a programmable graphics processor. 5 hosts. The illustrative system 290 can, for example, include a palm-sized unit such as a mobile phone or a personal digital assistant (PDA). For example, the software application 27A can include a graphics application 275 for generating graphics images on the display 295. Additionally, as described in greater detail below, in some embodiments, the software application 270 can include a graphics processor management software application 28 for executing management functions associated with the programmable graphics processor 2.5A. These management functions such as, for example, pipeline reconfiguration, scratchpad configuration, and testing. In one embodiment, the programmable graphics processor 205, the scratchpad interface 210, the host interface 220, and the DMA engine 230 are components of an embedded graphics processing core 25 formed on a single integrated circuit 200. The single integrated circuit includes a host, such as an integrated circuit 200 formed on the wafer including a central processing unit 26, the central processing unit having a software 270 resident on the memory. Alternatively, the graphics processing core 250 may be disposed on the first integrated circuit, and the CPU 260 is disposed on the second integrated circuit. 3 is a stage 310 stylized instruction detailing a programmable graphic 101822.doc 1297468 in accordance with an embodiment of the present invention. The raster stage 31 processes each pixel of a given triangle and determines the parameters that need to be calculated for the pixel as part of the reproduction, such as computational color, texture, transparency test, transparency blending, z-depth testing, and fogging parameters. In one embodiment, the raster stage 31 计算 calculates the center of gravity coefficients for the pixel packets. In the center of gravity, coordinate system, and system, measure the distance from the vertices in the triangle. Using the center of gravity factor reduces the dynamic range required, which allows the use of fixed points that require less power than floating point calculations. The raster stage 31 produces at least one pixel packet for each pixel of the triangle to be processed. The mother-pixel packet includes an intercept for the payload of the desired pixel attributes (e.g., 'color, texture, depth, fog, (X, y) position). Additionally, each-pixel packet has associated sideband information that includes instructions (4) to be performed on the pixel packet. The instruction area (not shown) in the raster stage 31 指派 assigns instructions to the pixel packet. FIG. 4 illustrates an exemplary pixel package 43 of a pixel. And paste. In the embodiment - the grating stage (4), the pixel attribute is divided into two or more different classes: a pixel packet one, wherein each type of image = is used for the instruction of the special type, and only the data is Dividing into smaller jobs will reduce the bandwidth requirement by + ^ for a particular processing operation and if (for example) 目, 丨〇, only a subset of the pixel attributes will be processed, it will also reduce processing. demand. Ding Yun, the news 49if. The prime packet has an associated sideband information 410 and a payload. The inactive sideband information includes a valid field 412, a kill fleid 4l4, a cancel a % A field, and an instruction field including the current instruction 101822.doc * 13 - 1297468 416. The exemplary pixel packet 43G includes a block and fogging field 426 of the -th set (S, t) texture coordinates 422 and 424. The exemplary pixel packet 46 includes a color block 462 and a second set of texture coordinates (s, t) 464 and a pick. In an embodiment, each pixel packet is represented by a fixed point representation of payload information 42. Examples of pixel attributes that may be included in a pixel packet (where the pixel attribute size of the pixel attribute is Langer) include: - Z l6 hexadecimal warfare value; a 16-bit S/T texture coordinate and 4 bit Fineness; a pair of color values, each with an accuracy of 8 bits; or a 25555 argb color, where each of the five bits is in each ARGB variable. The sideband information of the pixel packet may include the (x, y) position of the pixel. However, in an embodiment, the start span command is generated by the raster stage 310, y) origin, where it begins to pass through the triangle along the scan line. Using the Start Span command will allow the self (image) package to omit the (x,y) position. The start span command informs other entities (e.g., data writer stage state and data extraction phase 330) at the initial (x, y) position at the beginning of the scan line. Along the scan line: the (x'y) position of his pixel can be inferred from the number of pixels from which a given pixel is derived. In one embodiment, the 'data writing phase and data extraction phase 330 includes native cache memory, and the native cache memory is adapted to increment the local count H, based on its cross-four command. The number of pixels encountered is calculated to update the (x, y) position. > Looking at Figure 5, in the embodiment, the raster stage 3 bay is to be processed - at least - pixel packet column 51 (). In some embodiments, each column = has a common sideband information defining a sequence of instructions for the column 510. The right pixel requires more than - column 510, and the columns 510 are organized as _101822.doc -14 - 1297468 and queues 'which are processed continuously with each new clock cycle. In:, the sub-pixel data is divided into four riding element pixel attributes ^ profit value 'where the four pixel register values define the pixels of the pixel (R0, Rl, R2 and R3) of the column, 51迭代. The iterative register (4) of the raster stage 3丨0 (not shown) has a corresponding register to support the pixel packet column 510β. In the embodiment, the raster stage includes - supporting the four-pixel packet column of the Shangda Scratchpad pools. Some types of pixel-packet attributes (such as textures) may require high precision. Conversely, some types of pixel-packet attributes may require lower precision, such as color. Configurable scratchpad pools To support the high accuracy and low accuracy of each pixel packet in column 51G. In the implementation, the scratchpad pool includes 4 high precision and 4 low precision perspective correction iteration values per column plus z depth value. For example, in this way the software assigns the iterator's precision for processing specific pixel packet attributes. In one embodiment, the raster stage 3 ι includes a register set that is adapted to track the integer portion of the texture. Zone, thus allowing The decimal bit is transmitted as a data packet. The raster stage 310 can, for example, receive an instruction from the host that needs to perform an operation on the pixel. In response to the 'raster stage 31', a multi-pixel packet with an associated instruction sequence is generated. The columns, wherein the pixel packet columns and instructions are configured to perform the desired processing operations. As described in more detail below, in the embodiment, the ALU stage 34 is capable of performing scalar arithmetic operations: where the operands comprise pixel packets A pre-selected subset of pixel attributes, constant values, and previously calculated temporary storage results on the pixel packet in column 51. Various graphical algorithms can be expressed as one or more scalar arithmetic operations. 101822.doc 15- 1297468 Additionally, Various vector graphics operations can be expressed as a plurality of scalar arithmetic operations. Thus, it should be appreciated that the programmable graphics processor 205 of the present invention can be programmed to perform any graphics operation on a pixel, which can be represented as a pure Sequence of arithmetic operations, such as atomization operations, color (transparency) color mixing, texture combining, transparency testing, or deep Degree test, such as in 〇pen Gjl®

Graphics System: A Specification (Version ι·2)中所描述的運算’其内容以引用之方式併入本文中。舉例而言，為回應光栅段3 10谓測待在像素上執行之所要的圖形處理 _ 功能（例如，霧化運算），光柵階段3 10可使用可程式之映射表或映射演算法來判定像素封包及用於執行在像素上實施圖形功能所需之純量算術運算之相關聯指令的指派。該映射可例如藉由圖形處理器管理軟體應用程式28〇程式化。再次參看圖3，隨著藉由光柵階段310來遊動三角形之每一像素，光柵階段3 1 〇產生像素封包以供進一步處理，該等像素封包係由閘道管理器階段32〇所接收。閘道管理器階段 320執行資料流控制功能。在一實施例中，閘道管理器階段 _ 320具有-用於像素封包之排程、負載平衡、資源配置及避險之相關聯的計分板325。計分板325追蹤像素之進入及引退。進入閘道管理器階段320之像素封包設定該計分板，且該計分板在完成處理之後被重設為自可程式之處理器2〇5 中所排出之像素封包。作為一說明性實例，若緊密顯示器 295具有128x32像素之區域，則計分板325可為該顯示器之每一像素維持一表以監視像素。計分板325提供若干益處。舉例而言，當三角形中之一像 101822.doc . 1297468 素位於正在被處理及處於飛行中之另一像素的頂部上時，計分板325阻止了危險。在一實施例中，計分板奶監視閒置條件，並使料分板技術（s⑽eb_ding) f訊來記錄閒置單元時間。舉例而言，若不存在有效像素，則計分板325可關閉ALU以節省功率。如以下更詳細之描述，計分板奶追縱像素封包，該等像素封包能夠與具有取消位元組之像素封包一起由ALU 350來處理，使得像素封包流過ALU35〇而未主動處自纟貫施例中，計分板325追蹤再循環像素封，包之（X，3〇位置。若像素封包被再循環，料分板325在隨後通過中將像素封包中之指令序列遞增至像素之下一指令，例如，若指令在通過數字i上係用於霧化運算，則指令在通過數子2上被迭代至透明度混色運算。育料提取階段330提取由閘道管理器32〇所傳遞之像素封包的資料。以此方式可包括（例如）藉由料一像素封包列執行適當之色彩、深度或紋理資料讀取來提取色彩、深度及紋理資料。資料提取階段33〇可（例如）藉由自記憶體介:請籲求讀取（例如，使用DMA引擎23〇來讀取訊框緩衝器（未圖示））來提取像素或質素資料。在—實施例中’ f料提取階段 330亦可管理本機快取記憶體，諸如紋理/霧化快取記憶體 332、色彩/深度快取記憶體334及用於深度資料之z快取記憶體（未圖示）。在將像素封包發送至下—階段上之前，將所提取之資料置放於對應之像素冑包欄位上。在—實施例中’資料提取階段330包括具有用於存取像素封包屬性搁位所需之資料之-指令的指令隨機存取記憶體（ram)。在某些 101822.doc -17 - 1297468 實施例中，資料提取階段330亦執行Z深度測試。在此實施例中，資料提取階段330使用一或多個深度比較測試來比較像素封包之Z深度值與所儲存之Z值。若像素之Z深度值指示該像素被閉塞，則設定取消位元。像素封包列進入算術邏輯單元（ALU)階段340以供處理。 ALU階段340具有一組包括至少一 ALU 350之ALU 350，諸如 ALU 3 5 0-0、3 50-1、35 0-2及 3 50-3。雖然說明四個 ALU 350’但是視應用而定，可在alu階段340中使用更多或更 _ 少之ALU 350。個別ALU 350讀取用於至少一像素封包列 5 1 〇之當前指令，並實施任何指令以執行該alu經程式化以支援之純量算術運算。指令包括於每一 ALU 3 5〇中，並可（例如）儲存於本機指令RAM(圖3中未圖示）中。每一 ALU 350包括用於在第一運算元乘積（a*b)及第二運算元乘積（b*c)上執行至少一算術運算的指令，其中&、b、〇及d為運算元且*為乘法。某些或所有運算元可對應於⑽如）像素封包列510内之暫存器值屬性。ALU35〇亦可具有為常 ⑩數或軟體可載入之一或多個遥瞀；γ去» ^ , ' 乂夕彳固連异儿值。在某些實施例中， ALU可支援使用來自像辛封句个《 πI对巴上之先刖運异的臨時儲存結果0 在一實施例中’每一ALU 350為可程式的。交又開關 (cr〇ssbar)(未圖示）或其他可程式之選擇器可包括於則内彳允《午^擇運异几及結果的目標以回應來自軟體 (例如’軟體應用程式270)之指令。舉例而言，在中，可使用運算命令碼自像素封包列51。内之任何暫 101822.doc •18 * 1297468 之屬性、臨時值及常數值選擇每一運算元（a、b、c、d)之源。 =此實^例巾，運#命令亦指示ALU 35q向何處發送算術運的、。果諸如使用該結果來更新像素封包、將該結果儲 ;、、、α寺值或既使用该結果來更新像素封包又將該結果儲存為臨時值。因&，舉例而言，可程式化alu以將像素封包内之特定屬性讀取為運算元，並應用由#前指令所指丁之純里异術運算。運算命令碼亦可包括用以互補運算元 (，例如，計算l_x，其中父為讀取值）、求反運算元（例如，計 π-Χ，其中X為讀取值）或鉗制運算元或結果之命令。運算命 7碼之其他實例可包括（例如）用以選擇資料袼式之命令。由ALU 350所執行之算術運算的實例為像素封包内至少一變數上之形式（a*b)+(c*d)之純量算術運算，其中a、b、eThe operations described in Graphics System: A Specification (Version ι. 2) are incorporated herein by reference. For example, in response to raster segment 3 10 pre-determining the desired graphics processing function (eg, atomization operation) to be performed on the pixel, raster stage 3 10 may determine the pixel using a programmable mapping table or mapping algorithm. Encapsulation and assignment of associated instructions for performing scalar arithmetic operations required to implement graphics functions on pixels. The mapping can be stylized, for example, by the graphics processor management software application 28. Referring again to Figure 3, as each pixel of the triangle is navigated by the raster stage 310, the raster stage 3 1 〇 produces pixel packets for further processing, which are received by the gateway manager stage 32 。. The gateway manager stage 320 performs data flow control functions. In one embodiment, the gateway manager stage _ 320 has an associated scoreboard 325 for scheduling, load balancing, resource allocation, and avoidance of pixel packets. Scoreboard 325 tracks the entry and exit of pixels. The score packet entering the gateway manager stage 320 sets the scoreboard, and the scoreboard is reset to the pixel packets ejected from the programmable processor 2〇5 after processing is completed. As an illustrative example, if the compact display 295 has an area of 128 x 32 pixels, the scoreboard 325 can maintain a table for each pixel of the display to monitor pixels. The scoreboard 325 provides several benefits. For example, when one of the triangles is like 101822.doc. 1297468 is located on top of another pixel being processed and in flight, the scoreboard 325 prevents danger. In one embodiment, the scoreboard milk monitors the idle conditions and causes the scoreboard technique (s(10)eb_ding) to record the idle unit time. For example, if there are no valid pixels, the scoreboard 325 can turn off the ALU to save power. As described in more detail below, the scoreboard milk tracks the pixel packets, which can be processed by the ALU 350 along with the pixel packets with the canceled bytes, such that the pixel packets flow through the ALU 35 without being actively activated. In one embodiment, the scoreboard 325 tracks the recirculating pixel seals, which are (X, 3 〇 position. If the pixel packets are recycled, the scoreboard 325 increments the sequence of instructions in the pixel packets to pixels in subsequent passes. The next instruction, for example, if the instruction is used for atomization operations on the number i, then the instruction is iterated over the number 2 to a translucent color mixing operation. The nurturing extraction stage 330 is extracted by the gateway manager 32 〇 The data of the pixel packets may include, for example, performing color, depth, or texture data reading by performing a color, depth, or texture data read by a pixel packet column. The data extraction phase 33 may, for example, By self-memory: please call for reading (for example, use the DMA engine 23〇 to read the frame buffer (not shown)) to extract pixel or quality data. In the embodiment The fetch stage 330 can also manage native cache memory, such as texture/atomized cache memory 332, color/depth cache memory 334, and z cache memory for depth data (not shown). Before the pixel packet is sent to the next stage, the extracted data is placed on the corresponding pixel packet field. In the embodiment, the data extraction stage 330 includes a shelf for accessing the pixel packet attribute. The required data-instruction command random access memory (ram). In some 101822.doc -17 - 1297468 embodiments, the data extraction phase 330 also performs a Z-depth test. In this embodiment, the data extraction phase 330 uses one or more depth comparison tests to compare the Z-depth value of the pixel packet with the stored Z-value. If the Z-depth value of the pixel indicates that the pixel is occluded, then the cancellation bit is set. The pixel packet column enters the arithmetic logic unit ( ALU) stage 340 for processing. ALU stage 340 has a set of ALUs 350 including at least one ALU 350, such as ALU 3 5 0-0, 3 50-1, 35 0-2, and 3 50-3. ALU 350' but depending on the application, available at More or less ALUs 350 are used in the alu stage 340. The individual ALUs 350 read the current instructions for at least one pixel packet column 5 1 , and implement any instructions to perform the alu stylization to support the scalar amount Arithmetic operations are included in each ALU 3 5 and can be stored, for example, in a native instruction RAM (not shown in Figure 3). Each ALU 350 includes a product for the first operand (a *b) and an instruction to perform at least one arithmetic operation on the second operand product (b*c), where &, b, 〇, and d are operands and * is multiplication. Some or all of the operands may correspond to (10) as in the scratchpad value column 510 within the scratchpad value attribute. ALU35〇 can also have one or more telegrams for normal or software loading; γ to » ^ , ' 乂彳彳彳异。。。。。. In some embodiments, the ALU may support the use of a temporary storage result from a singular sentence of πI on the bar. In an embodiment, each ALU 350 is programmable. A switch (cr〇ssbar) (not shown) or other programmable selector may be included in the "noon" option and the result of the response in response to the software (eg 'software application 270') Instructions. For example, in the middle, the operation command code can be used to encapsulate the column 51 from the pixel. The source, temporary value, and constant value of any temporary 101822.doc •18 * 1297468 are selected as the source of each operand (a, b, c, d). = This is the case, the # command also indicates where the ALU 35q sends the arithmetic. For example, using the result to update the pixel packet, storing the result; , , , α temple value or both using the result to update the pixel packet and storing the result as a temporary value. For example, &, programmatic alu can be used to read a specific attribute in a pixel packet as an operand, and apply the purely different operation specified by the #pre-instruction. The operation command code may also include a complementary operation element (for example, calculating l_x, where the parent is a read value), a negation operation element (for example, π-Χ, where X is a read value), or a clamp operation element or The result of the order. Other examples of operations may include, for example, commands to select data patterns. An example of an arithmetic operation performed by ALU 350 is a scalar arithmetic operation of at least one variable form (a*b)+(c*d) within a pixel packet, where a, b, e

及d為運算元且*運算為乘法。較佳地亦可程式化每一 alu 350以執行其他數學運算，諸如互補運算元及求反運算元。另外，在某些實施例中，每一ALU 35〇可自（a*b，c*d)計算最小及最大值，並執行邏輯比較（例如，若a*b等於、不等於、小於、或小於或等於c*d之邏輯結果）。在某些實施例中，每一 ALU 35〇亦可包括指令，該等指令用於基於一測試來判定是否在取消欄位414中產生取消位元，該測試諸如a*b與c*d之比較（例如，若a*b不等於c*d 則取消、若a*b等於c*d則取消、若a*b小於c*d則取消、或若a*b大於或等於c*d則取消）。可產生取消位元之alu運算的實例包括透明度測試，其中將色彩值與測試色彩值進行比較，諸如表達式IF(透明度〉透明度參考），則取消像素， 101822.doc -19- 1297468 其中透明度為色彩值，且透明度參考為參考色彩值。可產生取4位το之ALU運算的另一實例為2深度測試，其中將像素之Z值與具有相同位置之先前像素的至少一Z值進行比車乂且右該深度測試指示該像素被閉塞，則取消該像素。在員施例中，若取消位元經設定於像素封包中，則關於=理像素封包使個別ALU35G被停用。在_實施例中，當在方頻帶身訊中偵測到取消位元時，使用時脈閘控機制以使ALU35G被停用。結果，在對像素封包產生取消位元後， ALU 350未在該像素封包上浪費功率，因為其經由則階段 =0傳播。然而，應注意，具有取消位元組之像素封包仍向前傳播，從而允許其由資料寫入階段355及計分板325來解 =此方式允許所有像素封包由計分板325來解決，甚至疋忒等由取消位兀標記為不需要進一步alu處理的像素封在實施例中，若像素之任何列5 10係由取消位元來標 :，則亦取消相同像素之其他列51〇。此可(例如)藉由在； &之間轉發取消資訊或藉由追縱像素之―或多個階段（其中列5 10係由取消位元來標記）來完成。在某些實施例中，一取’肖位7L被設定，則僅像素封包列5 1〇之旁頻帶資訊 41〇(其包括取消位元）傳播至下一階段上。 & ALUP “又34〇之輸出轉到資料寫入階段355。資料寫入階段355將所處理之像素封包轉換為像素資料並將結果寫人至記憶體介面（例如’經由DMA引擎23〇)。在一實施例中，像素之寫入值在寫入緩衝器352中被積累，且像素之積累寫入被刀批寫人至記憶體。資料寫人階段355可執行之功能的 101822.doc 1297468 實例包括色彩及深度回寫與格式轉換。在某些實施例中，資料寫入階段355亦可識別待取消之像素並設絲消位元。包括再循ί辰路徑360以將像素封包再循環回至閘道管理器 320。再循核路徑36〇允許（例如）需要使用通過階段“ο 一次以上來執行一算術運算序列的處理。資料寫入階段指示引退之寫入至閘道管理器階段32〇以用於計分板技術。圖6為例示性個別ALU 35〇之方塊圖。ALu 35〇具有輸入匯流排605 ,該輸入匯流排具有用於接收對應暫存器r〇、 Rl、R2及R3中之像素封包列51〇的資料匯流排。包括指令 RAM6H)以用於ALU指令。一例示性指令組在方塊62〇中: 以說明。在一實施例中，可程式化ALU 35〇以自列51〇讀取四個20位元暫存器值中的任一值並自列51〇選擇一組運算兀。另外，可程式化ALU 350以自暫存器（T)63〇將臨時值選擇為運算元，諸如每ALU 350兩個20位元臨時值，其可自先丽結果臨時儲存，如路徑64〇所指示。ALU 35〇亦可將常數值（未圖示）選擇為運算元，其亦可由軟體來程式化。在一實施例中，第一多工器（MUX)階段645自像素封包列選擇運算元、任何臨時值630及任何常數值（未圖示）。可包括格式轉換模組650以在算術計算單元67〇中將運算元轉換為適合於八1^11 3 50之計算精確度的所要資料格式。八乙1；35〇包括用以允許在第二MUX階段660中選擇每一運算元或其互補的元件。將所得之四個運算元輸入至純量算術計算單元67〇，該純量算術計算單元可執行兩乘法及一加法。視情況可使用钳制器680將所得值鉗制至所要範圍（例如，〇至1〇)。像素 1018|2,d〇c -21- 1297468 封包列510在匯流排690上退出。在-實施例中，所選像素封包屬性可為—符號i8(si8) 格式。該S1.8格式為-具有8位元小數之基數2數目，其範圍為[-2至+2]。S1.8格式允許計算之更高的動態範i舉例而言’在處理照明之計算中，S1.8格式允許增加之動離範 ^從而導致改良之真實性。若以⑴執行之純量算術運异的結果必須在[G，l]範圍内，則可鉗制該結果以迫使該結果在[〇, im圍内，作為-說明性實例，可以Sl 8格式^ 行色彩資料的網㈣算且接著_結果。纽意，在本發明之實施例中’不同類型之像素封包可具有以不同格式^ 不之資料屬性。舉例而t，色彩資料可以S18格式之第一類型的像素封包來表示，而（s，〇紋理資料可以高精確度^ 位元格式之第二類型的像素封包來表示。在某些實施例中，像素封包位元大小係藉由最高精確度像素屬性的位元大小需求來設I舉例而言，由於紋理屬性—心色彩需要更大之精確度’因而可以像素封包大小來以高精確度表示紋理資料’諸如16位元紋理資料。Sl 8格式之改良動態範圍允許（例如）將用於—個以上之色彩成份的資料有效封裝成2G位元像素封包大小，該大小係為需要（例如）16位元之紋理資料及4位元精細度（L0D)的更高精確度資料之紋理資料而選擇。舉例而士 , ^ ^ 一。，由於母— S1.8色彩成份需要10位 70 ’因而兩個色彩成份可經封裝成2G位元像素封包。圖7說明例示性ALU階段34〇，其包括一個以上經配置為管線之ALU 3 50，J: φ兩加斗、工加，、中兩個或兩個以上ALU 35〇被鏈結在一 101822.doc * 22 1297468 起。如先前所插述，可程式化個別ALU35(m自—像素封包讀取-或多個運算元，產生算術運算的結果，並使用該結果，來更新像素封包或臨時暫存器。可指派每一⑽以讀取運算元，產生算術結果，並在將像素封包列傳遞至下一alu 之别更新一或多個像素封包或臨時值。視待執行之處理運算、ALU等待時間及效率考慮而定，可以各種方式來組態ALU階段34〇中之alu 35〇之間的資料流。如先前所描述，本發明允許程式化每一 alu以讀取像 | 素封包列内之所選運算元且使用結果來更新所選像素封包暫存器。在-實施例中，則階段34〇包括用於每一色彩通道（例如，紅色、綠色、藍色及透明度）之至少一 ALU 350。以此方式允許（例如）負載平衡，其中該等A L u經組態以在像素封包列5 10上並行地運算（儘管因管線技術而以不同之時間點）以執行類似或不同之處理任務。作為可如何程式化 ALU 350之一實例，可程式化第一 ALU35〇-〇以執行第一色彩成份的計算，可程式化第：ALU35〇_1以執行第二色彩成拳份的運算，可程式化第三ALU35〇-2以執行第三色彩成份的運异，且可程式化第四ALU 3 50-3以執行霧化運算。因此，在某些實施例中，對於一像素封包列5丨〇，可對每一 ALU 3 5〇指派不同之處理任務。另外，如以下之更詳細描述，在某些κ施例中，軟體可組態ALU 350以選擇ALU階段340内 ALU 350之資料流，包括ALU 3 5〇之執行次序。然而，由於可組態該資料流’因而應瞭解，在某些實施例中，可配置沿著一 ALU鏈的資料流，使得一 ALU35〇彳之結果更新一或 101822.doc 1297468And d is an operand and * is computed as a multiplication. Preferably, each alu 350 can also be programmed to perform other mathematical operations, such as complementary operands and negation elements. Additionally, in some embodiments, each ALU 35 can calculate a minimum and a maximum from (a*b, c*d) and perform a logical comparison (eg, if a*b is equal, not equal, less than, or Less than or equal to the logical result of c*d). In some embodiments, each ALU 35 can also include instructions for determining whether to generate a cancellation bit in the cancellation field 414 based on a test, such as a*b and c*d. Comparison (for example, if a*b is not equal to c*d, cancel, if a*b is equal to c*d, cancel, if a*b is less than c*d, cancel, or if a*b is greater than or equal to c*d cancel). An example of an alu operation that can generate a cancel bit includes a transparency test in which a color value is compared to a test color value, such as the expression IF (Transparency > Transparency Reference), then the pixel is cancelled, 101822.doc -19- 1297468 where transparency is The color value, and the transparency reference is the reference color value. Another example of an ALU operation that can produce a 4-bit το is a 2-depth test in which the Z-value of a pixel is compared to at least one Z-value of a previous pixel having the same position and the depth test indicates that the pixel is occluded. , then cancel the pixel. In the embodiment, if the cancellation bit is set in the pixel packet, then the individual ALU 35G is deactivated. In the embodiment, when a cancel bit is detected in the square band body, a clock gating mechanism is used to disable the ALU 35G. As a result, after the cancellation bit is generated for the pixel packet, the ALU 350 does not waste power on the pixel packet because it propagates via phase =0. However, it should be noted that the pixel packet with the canceled byte is still propagated forward, allowing it to be resolved by the data write phase 355 and the scoreboard 325. This way all pixel packets are allowed to be resolved by the scoreboard 325, even The pixels marked by the cancellation bit as not requiring further alu processing are encapsulated in the embodiment. If any column 5 10 of the pixel is marked by the cancel bit: the other columns 51 of the same pixel are also cancelled. This can be done, for example, by forwarding the cancellation information between & or by tracking the "or multiple phases" of the pixels (where column 5 10 is marked by the cancellation bit). In some embodiments, a 'corner 7L is set, and only the sideband information of the pixel packet column 〇1 (which includes the cancellation bit) is propagated to the next stage. & ALUP "The output of the other 34" goes to the data writing stage 355. The data writing stage 355 converts the processed pixel packet into pixel data and writes the result to the memory interface (eg 'via DMA engine 23〇) In one embodiment, the write value of the pixel is accumulated in the write buffer 352, and the accumulated write of the pixel is written to the memory by the cutter. The function of the data writer stage 355 is 101822.doc 1297468 Examples include color and depth write back and format conversion. In some embodiments, data writing stage 355 can also identify pixels to be cancelled and set up silk distracting elements. Loop back to the gateway manager 320. The re-routing path 36 allows, for example, the need to use the process of performing an arithmetic operation sequence through the stage "o more than once. Data Write Phase Indicates that the retirement is written to the Gateway Manager Phase 32 for use in the scoreboard technology. Figure 6 is a block diagram of an exemplary individual ALU 35〇. The ALu 35A has an input bus 605 having a data bus for receiving pixel packet columns 51 of the corresponding registers r, R1, R2, and R3. Includes instruction RAM6H) for ALU instructions. An exemplary set of instructions is in block 62: to illustrate. In one embodiment, the programmable ALU 35 reads any of the four 20-bit scratchpad values from column 51 and selects a set of operations from column 51. In addition, the programmable ALU 350 selects the temporary value as an operand from the scratchpad (T) 63, such as two 20-bit temporary values per ALU 350, which can be temporarily stored as a result, such as path 64. Instructed. The ALU 35〇 can also select a constant value (not shown) as an operand, which can also be programmed by software. In one embodiment, the first multiplexer (MUX) stage 645 selects operands, any temporary values 630, and any constant values (not shown) from the pixel packet column. A format conversion module 650 can be included to convert the operands into a desired data format suitable for the computational accuracy of eight 1^11 3 50 in the arithmetic calculation unit 67.八乙1; 35〇 includes elements to allow selection of each operand or its complement in the second MUX stage 660. The resulting four operands are input to a scalar arithmetic calculation unit 67, which performs two multiplications and one addition. The resulting value can be clamped to the desired range (e.g., 〇 to 1 〇) using a clamp 680, as appropriate. Pixel 1018|2, d〇c - 21 - 1297468 Packet column 510 exits on bus bar 690. In an embodiment, the selected pixel packet attribute may be in the -symbol i8 (si8) format. The S1.8 format is - the number of base 2 with an 8-bit fraction, which ranges from [-2 to +2]. The S1.8 format allows for a higher dynamic range of calculations. For example, in the calculation of processing illumination, the S1.8 format allows for an increased dynamic range, resulting in improved authenticity. If the result of the scalar arithmetic transfer performed by (1) must be in the range of [G, l], the result can be clamped to force the result to be in [〇, im, as an illustrative example, in S8 format^ The network of color data (4) is calculated and then _ results. In the embodiment of the invention, the different types of pixel packets may have data attributes in different formats. For example, t, the color data may be represented by a first type of pixel packet of the S18 format, and (s, the texture material may be represented by a second type of pixel packet of a high precision ^bit format. In some embodiments The pixel packet bit size is set by the bit size requirement of the highest precision pixel attribute. For example, since the texture attribute—heart color requires greater precision, the pixel packet size can be expressed with high precision. Texture data such as 16-bit texture data. The improved dynamic range of the S8 format allows, for example, efficient encapsulation of data for more than one color component into a 2G bit pixel packet size, which is required (for example) 16-bit texture data and 4-bit fineness (L0D) for higher-precision data texture data. For example, ^, I., because the mother-S1.8 color component requires 10 bits 70' The two color components can be packaged into 2G bit pixel packages. Figure 7 illustrates an exemplary ALU stage 34, which includes more than one ALU 3 50 configured as a pipeline, J: φ two plus buckets, work plus, Two or more ALUs 35〇 are chained together at 101822.doc * 22 1297468. As previously explained, individual ALU35 (m self-pixel packet reads - or multiple operands can be programmed to generate The result of the arithmetic operation, and use the result to update the pixel packet or temporary register. Each (10) can be assigned to read the operand, produce an arithmetic result, and update the pixel packet column to the next alu. Or multiple pixel packets or temporary values. Depending on the processing operations to be performed, ALU latency, and efficiency considerations, the data flow between alu 35〇 in the ALU phase 34〇 can be configured in various ways. The present invention allows each lu to be programmed to read selected operands within the column of the prime packet and use the result to update the selected pixel packet register. In an embodiment, then stage 34A is included for each At least one ALU 350 of a color channel (eg, red, green, blue, and transparency). In this manner, for example, load balancing is allowed, wherein the ALs are configured to operate in parallel on the pixel packet column 5 10 (although due to tube Techniques at different points in time to perform similar or different processing tasks. As an example of how the ALU 350 can be programmed, the first ALU 35〇-〇 can be programmed to perform the calculation of the first color component, which can be programmed :ALU35〇_1 to perform the second color into a punch operation, the third ALU35〇-2 can be programmed to perform the third color component, and the fourth ALU 3 50-3 can be programmed to perform the atomization. Thus, in some embodiments, for a pixel packet column 5, each ALU 35 can be assigned a different processing task. Additionally, as described in more detail below, in some κ embodiments The software configurable ALU 350 selects the data stream of the ALU 350 within the ALU stage 340, including the execution order of the ALU 3 5〇. However, since the data stream can be configured, it will be appreciated that in some embodiments, the data stream along an ALU chain can be configured such that the result of an ALU 35 is updated by one or 101822.doc 1297468

來讀取。其作為運算元由隨後之 ALU 350-1 圖8為一之圖形處理器205To read. It is used as an arithmetic unit by the subsequent ALU 350-1. FIG. 8 is a graphics processor 205.

地重新組態像素封包經由階段之處理流程。因此較佳地利用同步技術來協調在自一組態改變至另一組態期間處於飛行中之像素封包的為料流，意即，執行同步使得處於飛行中之意欲在第一組態中待處理之像素封包在該組態改變至圖8為一具有可重新組態管線之可程式之一部分之實施例的方塊圖，其中像素連理流程可組態以回應軟體命令，諸如來差第二組態之前完成其處理。在一實施例中，資料提取階段830、資料寫入階段855及個別ALU 850具有各連接至第一分配器89〇的個別輸入及各連接至弟一分配器895的個別輸出。每一分配器890及895可 (例如）包含交換器、交叉開關、路由器或MUX電路以選擇傳入像素封包至資料提取階段830、ALU 8 50及資料寫入階段855之分配流。分配器890及895判定傳入像素封包810經由資料提取階段830、資料寫入階段855及個別ALU 850的資料路徑。訊號輸入892及894允許分配器890及895接收軟體命令（例如，來自在CPU上執行之軟體應用程式），以在資料 101822.doc -24· 1297468 提取階段830、資料寫入階段855及ALU 850之間重新組離、像素封包之分配。重新組態之一實例為指派ALU 850之執行次序。重新組態之另一實例為··若判定到對於特定時間處理任務不需要資料提取階段，則繞過資料提取階段83〇。作為重新組態之又一實例，可能需要改變資料提取階段83〇耦合至ALU的次序。作為另一實例，可能需要對資料寫入階段 8 5 5進行重新排序。作為一說明性實例，可存在以下情況：其中更有效的是在資料提取之前對紋理座標進行運算，在該情況下，配置資料流以使資料提取階段83〇在alu85〇執订紋理運算之後接收像素封包。因此，可重新組態管線之一益處在於：軟體應用程式可重新組態可程式之圖形處理器205以增加效率。再-人參看圖5，如先前所論述，光柵階段31〇產生用於處Reconfigure the pixel packet through the processing flow of the stage. It is therefore preferred to utilize synchronization techniques to coordinate the flow of pixels in flight during a change from one configuration to another, meaning that synchronization is performed such that it is in flight intended to be in the first configuration. The processed pixel packet is changed in the configuration to Figure 8 is a block diagram of an embodiment of a programmable portion having a reconfigurable pipeline, wherein the pixel conjunction process is configurable in response to a software command, such as a second group Complete its processing before the state. In one embodiment, the data extraction phase 830, the data writing phase 855, and the individual ALUs 850 have individual inputs that are each coupled to the first distributor 89A and respective outputs that are coupled to the first distributor 895. Each of the splitters 890 and 895 can, for example, include a switch, crossbar switch, router or MUX circuit to select the incoming stream of incoming packets to the data extraction stage 830, the ALU 8 50, and the data write stage 855. Distributors 890 and 895 determine the incoming pixel packet 810 via the data extraction phase 830, the data write phase 855, and the individual ALU 850 data paths. Signal inputs 892 and 894 allow allocators 890 and 895 to receive software commands (e.g., from a software application executing on the CPU) to extract phase 830, data write phase 855, and ALU 850 in data 101822.doc -24· 1297468 Re-distribution and allocation of pixel packets. One instance of reconfiguration is to assign the execution order of the ALU 850. Another example of reconfiguration is if the data extraction phase is not required for a particular time processing task, then the data extraction phase 83 is bypassed. As yet another example of reconfiguration, it may be necessary to change the order in which the data extraction phase 83 is coupled to the ALU. As another example, it may be necessary to reorder the data write phase 855. As an illustrative example, there may be the following cases: where it is more efficient to operate on the texture coordinates prior to data extraction, in which case the data stream is configured to cause the data extraction phase 83 to receive after the aloe 85 texture processing operation Pixel packet. Therefore, one benefit of the reconfigurable pipeline is that the software application can reconfigure the programmable graphics processor 205 to increase efficiency. Referring again to Figure 5, as previously discussed, the grating phase 31 is generated for use at

理之像素封包列510。可進一步將該等列51〇配置為一群MO 列諸如一四列5丨〇序列，其被傳遞以用於在連續時脈週期中處理。然而，可在像素封包列51〇上執行之某些運算可需要另一像素封包列之算術運算的結果。因此，在一實施例中’光栅階段310在_群52()列中配置像素封包以解決資料 =賴性。作為_說明性實例，若—像素封包上之紋理運算而要歹】中另一像素封包之結果，則配置群52〇，使得且有依賴之紋轉算㈣素料纽於錢之Μ。參看圖9，JZ _ 、、只她例中，像素由光栅階段310交替地指派為奇或偶。每_德4 像素列之相應暫存器（R0、Rl、R2及R3) 相應地被指派為偶或奇。接著利用一或多個規則來交錯偶 101822.doc -25- 1297468 像素之偶像素封包㈣5及奇像素之㈣9ig明免資料依、每隔列進行交錯會提供額外之時脈週期以解決 ALU等待時間。51此’若偶像素之列G需要兩個時脈週期以所需之結果，則交錯奇像素之列0 由ALU等待時間所需之時間的額外時脈週期。作為 2明性實例，考慮多紋理運算，其中偶像素之列〇為混色的^且相同像素之列1對應於與需要第-混色運算之結果 :弟紋理之/ttl色。若第—運算的ALu等待時間為兩個時 2週期’則交錯會允許混色運算之結果可用於使用鼻之紋理。 /一交錯實施例中，較佳地包括旁頻帶資訊以協調交錯貝^。舉例而言’在一實施例中’每一像素封包中的旁頻帶資訊包括-偶/奇攔位以區別偶及奇列。每—勘⑽ =包括兩組對應於偶像素及奇像素之臨時暫存器的臨時，存益’以對偶/奇像素封包提供合適之臨時值。使用偶/ 奇攔位來選擇合適之臨時暫存器組，例如，對奇像素偶臨時暫存器’而對偶像素選擇一奇臨時暫存器組。在一實施例中’偶及奇像素兩者共用常數暫存器，以降低甩於偶及奇像素之常數值之儲存需要的總量。在—實施例中，、軟體主機可用常數值來設定臨時暫存器達-延長之時段，以核擬常數暫存器。雖然交錯兩個像素為-實施例時，作是應瞭解，若（例如）ALU等待時間對應於兩個以上之時脈期’則可將該交錯進一步延展為交錯兩個以上之像光栅階段3 10交錯像素封包之一益步〆 ”The pixel block column 510. The columns 51A can be further configured as a group of MO columns, such as a four column 5 丨〇 sequence, which are passed for processing in a continuous clock cycle. However, some of the operations that can be performed on the pixel packet column 51 can require the result of an arithmetic operation of another pixel packet column. Thus, in one embodiment the 'raster stage 310 configures pixel packets in the _group 52() column to account for data. As an illustrative example, if the texture operation on the pixel packet is the result of another pixel packet, then the group 52 is configured such that there is a dependency on the texture (4). Referring to Figure 9, JZ _ , in her case alone, the pixels are alternately assigned odd or even by the raster stage 310. The respective registers (R0, R1, R2, and R3) of each 4-pixel column are assigned as even or odd. Then use one or more rules to interleave the even pixel packets (4) 5 of the 10822.doc -25 - 1297468 pixels and the (4) 9ig clear data of the odd pixels. Interleaving every other column will provide an additional clock cycle to solve the ALU latency. . 51 If the even pixel column G requires two clock cycles to achieve the desired result, then the odd-numbered pixel column 0 is interleaved by the ALU waiting time for the additional clock cycle. As a two-intelligence example, a multi-texture operation is considered in which the column of even pixels is mixed color and the column 1 of the same pixel corresponds to the result of the need for the first-color mixing operation: the /ttl color of the texture. If the ALU latency of the first operation is two, 2 cycles' then the interleaving will allow the result of the color mixing operation to be used for the texture of the nose. In an interleaved embodiment, sideband information is preferably included to coordinate interleaving. For example, in one embodiment, the sideband information in each pixel packet includes an - even/odd block to distinguish between even and odd columns. Each of the surveys (10) = includes two sets of temporary temporary registers corresponding to even and odd pixels, providing a suitable temporary value for the dual/odd pixel packets. The even/odd block is used to select the appropriate temporary register set, e.g., for the odd pixel even temporary register' and the odd pixel selects an odd temporary register set. In one embodiment, both the even and odd pixels share a constant register to reduce the amount of storage required for constant values of even and odd pixels. In the embodiment, the software host can use the constant value to set the temporary temporary register up-extension period to verify the constant register. Although interleaving two pixels is an embodiment, it should be understood that if, for example, the ALU latency corresponds to more than two clock periods, then the interlace can be further extended to interlace two or more image raster stages 3 10 interlaced pixel packets

1處在於：硬體考慮到ALU 101822.doc -26- 1297468 等待時間，從而降低了軟體上之負擔以解決ALU等待時間’若（例如）光柵階段3 10未交錯像素，則將另外發生該ALU 等待時間。如先前所論述，在一可組態管線中，可組態ALU 350内之資料流。舉例而言，在硬體中，每一 ALU 3 50可大體上相同。然而，特定ALU可經組態以在資料流中具有一個以上之位置，例如，不同之執行次序。因此，需要在每一 ALU 350 中提供一識別符以指示其在資料流内之位置。可（例如）藉由 | 每一 ALU 350的直接暫存器寫入技術對每一 ALU 350提供識別符。然而，此方法具有需要顯著軟體耗用之缺點。因此’在一實施例中，利用封包技術來觸發需要組態資訊之元件’以發現其在處理流程内的相對位置並在本機暫存器中寫入相應之識別符。參看圖10，在一實施例中，ALU 350之暫存器位址空間係使用封包初始化技術之軟體可組態型，以使用資料封包將一識別（10)傳達至每一八乙11 350。每一八1^11 350可（例如） ® 包括用於接收及轉發資料封包之習知網路模組。在一實施例中’10封包1〇1〇係由軟體應用程式來起始。1£)封包1〇1〇含有初始ID碼，諸如數字。ID封包1010係在需要1]3碼之元件之前的點處注入於圖形管線中，且接著傳遞至由當前管線組態所定義之處理流程的隨後元件。在一實施例中，第一 ALU 350中之組態暫存器1020接收ID封包，將ID碼之當前值寫入至該組態暫存器中，且接著在將該1〇封包傳遞至下一 ALU上之前遞增該ID封包之10碼。繼續此處理，其中 101822.doc -27- 1297468 每一隨後之ALU 350將ID碼之當前值寫入至其組態暫存器中，且接著將具有遞增ID碼的ID封包傳遞至下一 alu。應瞭解，沿著資料流路徑的其他階段亦可以類似方式而設定之組態暫存器。舉例而言，組態流中之元件亦可包括資料提取階段或資料寫入階段，其亦具有藉由讀取仍封包而設定之組態暫存器，且其在將具有遞增⑴的①封包傳遞至該組態流中之下一元件之前遞增10碼。此暫存器組態形式之一益處在於：其不需要ALU 350單元之間的硬體差異，從而 Φ 允許資料流經由管線的軟體重新組態。因此，例如，在一實施例中，圖形處理器管理應用程式28〇僅需要產生初始1〇封包1010,諸如藉由經由主機介面220發佈用以產生ID封包 1010之命令，該命令係由ID封包產生器1030所接收。在一替代實施例中，使用廣播封包技術將1£>碼寫入至組態暫存器中以觸發需要寫入組態暫存器之元件以發現其 ID。在此實施例中，該等元件（例如，ALU 35〇)可使用網路協定來發現其ID。廣播封包技術可用於（例如）管線經分支 ® 以允許該管線之分支並行地處理像素之實施例中。圖Π說明一包括診斷監視能力的實施例。在一實施例中，存在一連串沿著圖形處理器2〇5之元件的分接頭，諸如分接頭與每一 ALU 35〇及資料提取階段33〇相關聯。同樣亦可在其他階段處包括分接頭。可組態之測試點選擇器丨丨〇5 經调適成允許監視諸如兩個分接頭112〇及113〇之所選分接頭’以回應諸如來自圖形處理器管理應用程式28〇之軟體命令的軟體命令。可（例如）使用多工器來實施可組態之測試點 101822.doc -28- I297468 L擇器1105。在-實施例中，包括至少—計數器⑴〇以用於每一所選測試點的統計量收集。在一實施例中，由軟體所產生之追蹤記錄（instrument)封包提供關於待監視之分接頭之資訊且為所選測試點啟用計數。另外，可錄暫存H崎基於管狀運算模式㈣計量收集 (例如，可提供追蹤記錄暫存器以允許軟體為特定類型之圖形運算啟料數，諸如當發生透明度混色運算時啟用統計計數卜可組態之測試點選擇器11〇5之一益處在於··其允許諸如圖形處理H管理應用程式之軟體具有僅對所關注之測試點而收集的統計資料，從而降低了硬體複雜性及成本’同時仍允許軟體分析可程式之處理器2〇5之行為的任何 P刀可（例如）選擇所關注之測試點以收集與處理特定種類貧料之該等ALU 350相關聯的統計量，該等ALU35〇諸如處理紋理貝料的ALU 350。另外，可對特定圖形運算（諸如透明度混色）啟用統計量收集。在一實施例中’可'組態之測試點選擇器11〇5利用三線協定。諸如ALU 350-0之具有有效負載資料之每一元件產生一有效訊號，該有效訊號可（例如）向下流動至下一元件（例如，ALU 350-1)。就緒接收有效負載之元件產生就緒訊號，該就緒訊號可(例如）向上流動至前一元件。然而，若元件未就緒接收有效負載’則該元件產生未就緒訊號，該未就緒訊號可（例如）對應於未確定該就緒訊號。啟用訊號對應於為監視而啟用之元件’該監視諸如藉由軟體控制經由對鄰近於被監狀點而儲存的監視啟用控制位元的管線式暫存器 101822.doc -29- 1297468 寫入Λ號可自產生S亥訊號之元件或接收該等訊號之元件直接分接。可使用所選分接頭點處之有效、就緒及未就緒訊號來判定運异狀態。一轉移狀態對應於一時脈記號，該時脈記號具有一用於向下游流動之資料之有效的有效負載（意即，有效位元組）及一來自下游區塊之就緒訊號以在該下游區塊中接收資料（例如，在分接頭點1120處，—來自ALU_0之有效訊號，及在分接頭點113〇處，—來自ALlM之就緒訊號）。 -等待狀態對應於-具有有效的有效負載之時脈記號，該有效的有效負載被閉塞，因為下面之區塊未就緒接收資料 (例如，在分接頭點1120處，一來IALU_0之有效訊號，及在分接頭點1130處’ 一來自ALU-i之未就緒訊號）。在此實施例中，可收集所選分接頭點上的統計量，諸如計數轉移狀悲及等待狀態被偵測之時脈週期的數目。本發明之實施例提供各種可用於嵌入式圖形處理器核心 250中之益處。在一緊密系統中，低功率掌上型系統29〇、功率、空間及CPU能力可受到相當地限制。在一實施例中，當不需要處理時對ALU 350進行時脈閘控（例如，藉由偵測取消位元），從而降低了處理功率需求。另外，光柵階段31〇僅需要產生用於被處理之子組像素資料的像素封包，從而亦降低了功率需求。可程式之ALU階段34〇與具有用於執行專用圖形功能之專用階段的習知管線相比需要較小之晶片區域，從而降低了成本。可將可程式之處理器2〇5實施為由軟體可組態之區塊’從而提供了改良之效率。測試監視可 101822.doc -30- 1297468 經組態以测試一子組測試點，從而降低了軟體之頻寬及分析需求。該等及其他先前所描述之特徵使得所關注之可程式之圖形處理器205用於嵌入式圖形處理器核心25〇中。用於解釋目的之上述描述使用特定術語來提供對本發明之完整理解。然而，熟習此項技術者將顯而易見，不需要特定細節以實踐本發明。因此，提出本發明之特定實施例之上述描述以用於說明及描述之目的。其並不意欲為詳盡的或將本發明限制於所揭示之精確形式；顯然，鑒於以上 | 之教示，許多修改及變化為可能的。選擇及描述該等實施例以最好地解釋本發明之原理及其實際應用，其藉此使熟習此項技術者能夠最好地利用本發明及具有各種修改之各種實施例，該等修改適合於所涵蓋之特定使用。以下之申請專利範圍及其相等物意欲應定義本發明之範疇。【圖式簡單說明】圖1為三維圖形之先前技術管線之圖； ” ®2為包括根據本發明之一實施例之可程式之圖形處理 Φ 器的積體電路之方塊圖；圖3為根據本發明之一實施例之可程式之圖形處理器的方塊圖；圖4說明根據本發明之一實施例的例示性像素封包；圖5說明根據本發明之一實施例將像素封包配置成一群像素封包列的例示性配置；圖6為根據本發明之一實施例之單—算術邏輯單元的方塊圖； 101822.doc -31 · 1297468 圖7為根據本發明之—實施例之—含兩個算術邏輯單元之序列的方塊圖；One is that the hardware considers the ALU 101822.doc -26- 1297468 latency, which reduces the burden on the software to solve the ALU latency. If the raster phase 3 10 is not interlaced, for example, the ALU will occur separately. waiting time. As discussed previously, in a configurable pipeline, the data flow within the ALU 350 can be configured. For example, in hardware, each ALU 3 50 can be substantially the same. However, a particular ALU can be configured to have more than one location in the data stream, e.g., a different order of execution. Therefore, an identifier needs to be provided in each ALU 350 to indicate its location within the data stream. Each ALU 350 can be provided with an identifier, for example, by |direct register write technology for each ALU 350. However, this approach has the disadvantage of requiring significant software consumption. Thus, in one embodiment, a packet technique is used to trigger an element that needs to configure information to find its relative position within the process flow and to write a corresponding identifier in the local register. Referring to Fig. 10, in one embodiment, the scratchpad address space of the ALU 350 uses a software configurable type of packet initialization techniques to communicate an identification (10) to each october 11 350 using a data packet. Each of the eight 1^11 350 can, for example, include a conventional network module for receiving and forwarding data packets. In one embodiment, the '10 packet 1' is initiated by the software application. 1 £) Packet 1〇1〇 Contains the initial ID code, such as a number. The ID packet 1010 is injected into the graphics pipeline at a point before the component requiring 1]3 code and then passed to subsequent components of the processing flow defined by the current pipeline configuration. In an embodiment, the configuration register 1020 in the first ALU 350 receives the ID packet, writes the current value of the ID code into the configuration register, and then passes the 1〇 packet to the next The 10 code of the ID packet is incremented before an ALU. Continuing with this process, 101822.doc -27- 1297468 each subsequent ALU 350 writes the current value of the ID code into its configuration register, and then passes the ID packet with the incremental ID code to the next alu . It should be understood that the configuration registers can be configured in a similar manner along other stages of the data flow path. For example, the components in the configuration stream may also include a data extraction phase or a data write phase, which also has a configuration register set by reading still packets, and which will have 1 packet with increments (1). The code is incremented by 10 yards before being passed to the next component in the configuration stream. One benefit of this form of register configuration is that it does not require hardware differences between ALU 350 units, so Φ allows data flow to be reconfigured via pipeline software. Thus, for example, in one embodiment, the graphics processor management application 28 only needs to generate an initial packet 1010, such as by issuing a command to generate an ID packet 1010 via the host interface 220, the command being packetized by the ID. The generator 1030 receives it. In an alternate embodiment, a broadcast packet technique is used to write a 1 £> code into the configuration register to trigger an element that needs to be written to the configuration register to find its ID. In this embodiment, the elements (e.g., ALU 35A) can use a network protocol to discover their ID. Broadcast packet techniques can be used, for example, in embodiments where the pipeline is branched ® to allow the branches of the pipeline to process pixels in parallel. The figure illustrates an embodiment including diagnostic monitoring capabilities. In one embodiment, there is a series of taps along the components of graphics processor 2〇5, such as taps associated with each ALU 35〇 and data extraction phase 33〇. Taps can also be included at other stages. The configurable test point selector 丨丨〇5 is adapted to allow monitoring of selected taps such as two taps 112〇 and 113〇 in response to software commands such as those from the graphics processor management application 28 Software command. A configurable test point can be implemented, for example, using a multiplexer 101822.doc -28- I297468 L selector 1105. In an embodiment, at least - a counter (1) is included for statistical collection for each selected test point. In one embodiment, the instrumentation packet generated by the software provides information about the taps to be monitored and enables counting for the selected test points. In addition, the recordable temporary storage Hsaki is based on the tubular operation mode (4) metering collection (for example, a tracking record register can be provided to allow the software to be a specific type of graphics operation number, such as enabling statistical counting when a transparency color mixing operation occurs. One of the benefits of the configured test point selector 11〇5 is that it allows software such as the graphics processing H management application to have statistics collected only for the test points of interest, thereby reducing hardware complexity and cost. 'A P-knife that still allows the software to analyze the behavior of the programmable processor 2〇5 can, for example, select a test point of interest to collect statistics associated with the ALU 350 that processes a particular type of poor material, which ALU 35 such as ALU 350 for processing textured beakers. Additionally, statistic collection can be enabled for specific graphics operations, such as transparency blending. In one embodiment, the 'configurable' test point selector 11〇5 utilizes a three-wire protocol Each component, such as ALU 350-0, having payload data produces a valid signal that can, for example, flow down to the next component (eg For example, ALU 350-1). A component ready to receive a payload generates a ready signal that can, for example, flow up to the previous component. However, if the component is not ready to receive a payload, the component generates a not ready signal. The not ready signal may, for example, correspond to an undetermined ready signal. The enable signal corresponds to an element enabled for monitoring 'this monitoring is enabled by means of software control via a monitoring enable control bit stored adjacent to the monitored point Pipelined register 101822.doc -29- 1297468 Write the nickname to directly tap the component that generated the S signal or the component that receives the signal. The valid, ready, and not available at the selected tap point can be used. The ready signal determines the different state of the transfer. A transfer state corresponds to a clock mark having a valid payload (ie, a valid byte) for downstream data and a downstream block The ready signal to receive data in the downstream block (eg, at tap point 1120, - a valid signal from ALU_0, and at tap point 113〇, - The ready signal from ALMM. - The wait state corresponds to a clock token with a valid payload, the valid payload being blocked because the following block is not ready to receive data (eg, at tap point 1120, A valid signal from IVAL_0, and a 'not ready signal from ALU-i' at tap point 1130. In this embodiment, statistics on selected tap points can be collected, such as counting shifts. The number of clock cycles in which the wait state is detected. Embodiments of the present invention provide various benefits that can be used in the embedded graphics processor core 250. In a compact system, low power palm-sized systems 29, power, space, and CPU power can be fairly limited. In one embodiment, the ALU 350 is clocked (e.g., by detecting cancellation bits) when processing is not required, thereby reducing processing power requirements. In addition, the raster stage 31〇 only needs to generate pixel packets for the processed subset of pixel data, thereby also reducing power requirements. The programmable ALU stage 34(R) requires a smaller wafer area than a conventional pipeline having a dedicated stage for performing dedicated graphics functions, thereby reducing cost. The programmable processor 2〇5 can be implemented as a software configurable block' to provide improved efficiency. Test Monitoring 101822.doc -30- 1297468 is configured to test a subset of test points, thereby reducing the bandwidth and analysis requirements of the software. These and other previously described features enable the tangible graphics processor 205 of interest to be used in the embedded graphics processor core. The above description for the purpose of explanation is used to provide a complete understanding of the invention. However, it will be apparent to those skilled in the art that <RTIgt; Accordingly, the above description of the specific embodiments of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. It is obvious that many modifications and variations are possible in light of the teachings of the above. The embodiments were chosen and described in order to best explain the embodiments of the invention, For the specific use covered. The scope of the claims below and the equivalents thereof are intended to define the scope of the invention. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram of a prior art pipeline of a three-dimensional graph; "2 is a block diagram of an integrated circuit including a programmable graphics processing Φ device according to an embodiment of the present invention; FIG. 3 is based on A block diagram of a programmable graphics processor in accordance with an embodiment of the present invention; FIG. 4 illustrates an exemplary pixel packet in accordance with an embodiment of the present invention; FIG. 5 illustrates a pixel packet configured as a group of pixels in accordance with an embodiment of the present invention. Illustrative configuration of a packet column; Figure 6 is a block diagram of a single-arithmetic logic unit in accordance with an embodiment of the present invention; 101822.doc - 31 · 1297468 Figure 7 is an embodiment of the present invention - containing two arithmetic a block diagram of the sequence of logical units;

圖8為根據本發明之一實施例之可組態之可程式之圖3 處理器的方塊圖； V 圖9說明像素封包列根據本發明之一實施例之交錯；圖10為說明根據本發明之一實施例之具有組態暫存器之算術邏輯單元的方塊圖；及圖11為說明根據本發明之一實施例之可組態测試點選擇器的方塊圖。相同之參考數字在整個該等圖式之若干視圖中指的是對應之部分。【主要元件符號說明】 105 轉換階段 110 設定/光柵階段 115 紋理位址階段 120 紋理提取階段 130 霧化階段 135 透明度測試階段 140 深度測試階段 145 透明度混色階段 150 記憶體寫入階段 200 積體電路 205 圖形處理器 210 暫存器介面丨.doc -32- 12974688 is a block diagram of a configurable programmable processor of FIG. 3 in accordance with an embodiment of the present invention; FIG. 9 illustrates a staggered row of pixel packs in accordance with an embodiment of the present invention; FIG. A block diagram of an arithmetic logic unit having a configuration register in one embodiment; and FIG. 11 is a block diagram illustrating a configurable test point selector in accordance with an embodiment of the present invention. The same reference numbers are used in the corresponding drawings throughout the drawings. [Major component symbol description] 105 Conversion phase 110 Setting/raster phase 115 Texture address phase 120 Texture extraction phase 130 Atomization phase 135 Transparency test phase 140 Depth test phase 145 Transparency color mixing phase 150 Memory writing phase 200 Integrated circuit 205 Graphics Processor 210 Scratchpad Interface 丨.doc -32- 1297468

220 主機介面 230 直接記憶體存取（DMA)引擎 250 圖形處理核心 260 中央處理單元 270 軟體應用程式 275 圖形應用程式 280 圖形處理器管理軟體應用程式 290 系統 295 顯示器 305 設定階段 308 頂點緩衝為 310 光栅階段 320 閘道管理器階段 325 計分板 330 資料提取階段 331 紋理/霧化快取記憶體 334 色彩/深度快取記憶體 340 算術邏輯單元（ALU)階段 350 算術邏輯單元（ALU) 350-0 、 350-1 、 ALU 350-2 > 350-3 355 資料寫入階段 360 再循環路徑 410 旁頻帶資訊 101822.doc -33- 1297468220 Host Interface 230 Direct Memory Access (DMA) Engine 250 Graphics Processing Core 260 Central Processing Unit 270 Software Application 275 Graphics Application 280 Graphics Processor Management Software Application 290 System 295 Display 305 Setup Phase 308 Vertex Buffer to 310 Raster Stage 320 Gateway Manager Stage 325 Scoreboard 330 Data Extraction Phase 331 Texture/Atomization Cache Memory 334 Color/Deep Cache Memory 340 Arithmetic Logic Unit (ALU) Stage 350 Arithmetic Logic Unit (ALU) 350-0 350-1 , ALU 350-2 > 350-3 355 Data Write Phase 360 Recycling Path 410 Side Band Information 101822.doc -33- 1297468

412 有效搁位 414 取消欄位 416 指令欄位 420 有效負載資訊 422〜 424 第一組（s，t)紋理座標 426 霧化欄位 430、 460 像素封包 462 色彩欄位 464 - 466 第二組紋理座標（s，t) 510 像素封包列 520 群 605 輸入匯流排 610 指令RAM 620 方塊 630 暫存器（T) 640 路徑 645 第一多工器（MUX)階段 650 格式轉換模組 660 第二MUX階段 670 算術計算單元 680 钳制器 690 匯流排 810 像素封包 830 資料提取階段 101822.doc -34- 1297468 850 ALU 855 資料寫入階段 890 、 895 分配器 892 訊號輸入 905 偶像素封包列 910 奇列 1010 ID封包 1020 組態暫存器 1030 ID封包產生器 1105 可組態之測試點選擇器 1110 計數器 1120 、 1130 分接頭/分接頭點 R0、R1、R2、R3 像素封包412 Effective Shelf 414 Cancel Field 416 Command Field 420 Payload Information 422~ 424 First Set (s, t) Texture Coordinates 426 Air Flow Field 430, 460 Pixel Packet 462 Color Field 464 - 466 Second Set of Textures Coordinate (s, t) 510 pixel packet column 520 group 605 input bus 610 instruction RAM 620 block 630 register (T) 640 path 645 first multiplexer (MUX) stage 650 format conversion module 660 second MUX stage 670 Arithmetic Calculation Unit 680 Clamp 690 Bus 810 Pixel Packet 830 Data Extraction Stage 101822.doc -34- 1297468 850 ALU 855 Data Write Stage 890, 895 Distributor 892 Signal Input 905 Even Pixel Packet Column 910 Chile 1010 ID Packet 1020 Configuration Register 1030 ID Packet Generator 1105 Configurable Test Point Selector 1110 Counter 1120, 1130 Tap/Tap Point R0, R1, R2, R3 Pixel Packet

101822.doc 35-101822.doc 35-

Claims

\6S Patent Application Range: 1. A graphics processor comprising: a raster stage that receives data about a picture to be rasterized, the t phase of the t generating a plurality of pixels for each pixel to be processed The second parent-pixel packet includes payload information identifying at least the pixel attribute to be processed and having the sideband information associated with the sequence containing at least the instruction to be executed on the image (four) packet; A programmable arithmetic logic unit (ALU) stage for processing the pixel packets, the ALW segment comprising at least one ALU, each alu being programmed to: have-group at least one possible scalar arithmetic operation, the group The scalar arithmetic operation is performed on an incoming pixel packet having a corresponding current instruction; wherein an arithmetic operation sequence is performed on the plurality of pixel packets to perform a graphics processing function. The graphics processor of claim 1, wherein the sideband information includes a cancel field, and each ALU is gated to respond to detecting a cancel bit in the cancel block to reduce the plurality of bits. ALu's power consumption. The graphics processor of claim 2, wherein the at least one ALU responds to the detection of a true value of one of the logical comparisons of the 异, and sets the cancellation bit 0 such as the graphics processor of the request item 1, The form of the arithmetic operation is a*b + c*d, where a, b, e, and d are operands and * is a multiplication operation. The graphics processor of claim 1, wherein the ALU stage is adapted to perform a network function. The image processor of claim 1, wherein the ALU stage performs at least one of an atomization, texture mapping, transparency blending, z-test, or a transparency test.

A graphics processor as claimed in claim 1, wherein the raster stage produces at least one pixel packet column for each of the pixels, each column having a plurality of pixel packets transmitted to the ALU phase in a particular clock cycle. The graphics processor of claim 1, further comprising a gateway manager stage having a scoreboard for tracking pixel packets received from the raster phase. 9. The graphics processor of claim 8, further comprising: a data extraction phase for extracting data of the pixel packets; and a data writing phase for performing the process received from the ALU phase One of the pixel data of the processed pixel packet is written to the memory. 10. The graphics processor of claim 9, further comprising: a recirculation path from the data write phase to the gateway manager phase for recycling pixel packets for additional pass through the ALU phase. 11. A graphics processor as claimed in claim 1, wherein information about the pixel packets marked as cancelled in the alu phase is provided to the scoreboard to resolve the cancelled packet pixels. 12. A graphics processor, comprising: at least a stage for converting and setting a top black share of a picture to be rasterized, and a field segment 'receiving information about a picture to be rasterized, the 101822. Doc 1297468 The first gate k# is generated for each pixel of the pattern to be used by a pattern /1, a #m^π, and to the pixel packet column, and the graph operation sequence π j is expressed as a scalar arithmetic one. a gateway manager, comprising: a scoreboard for tracking the processing of the pixel packet; - a data extraction phase 'to extract data for each pixel packet column; a second phase' comprising a plurality of programmable An Arithmetic Logic Unit (ALU) is used to process each ^ ^ ^ ^ ^ ^ ^ ^ ^ , , , , , , 每一每一每一每一每一每一每一每一 a a a a a a a a a a a a a a a a a a a a a a a a An operand, 至少至少至少来执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行执行In the attribute register One of the lesser; and the negative material is written to the white white and the memory is used to perform one of the pixel data of the processed pixel packet received from the plurality of copies; wherein the plurality of pixel packets are on the plurality of pixel packets Execution—A sequence of arithmetic operations to perform the graphics processing function. 13. The graphics processor of claim 12, wherein the plurality of ALUs are configured with a - pipe. 14. The graphics processor of claim 12, wherein each of the pixel packets includes payload information identifying at least one pixel attribute to be processed, and each column has at least one of identifying to be performed on each of the pixel packets of the column The associated sideband information of an instruction. 15. The graphics processor of claim 12, further comprising a recirculation path 101822.doc 1297468, the money path for retrieving the pixel packet from the data writer phase to the gateway management, whereby the Then one or two people or more are used to process the pixel packet. 16. The graphics processor of item 12, wherein each is adapted to allow the human body host to select an operation element to be selected from the column and - a constant value and a temporary value At least one, . , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

The arithmetic processing in the middle, , and 被 is deactivated in response to (4) to - the pixel packet is marked as cancelled. 18. The graphics processor of claim 12, wherein the scalar arithmetic operation has the form ^ c d, wherein a, b, c & d are operands and * is a multiplication operation. 19. The graphics processor of claim 12, wherein the alu phase is adapted to perform a network function. 20. The graphics processor of claim 12, wherein the ALU stage performs at least one of atomization, texture mapping, transparency blending, :z testing, or a transparency test. 21. The graphics processor of claim 12, wherein the raster stage and the plurality of ALUs are adapted to convert a vector arithmetic operation into a scalar arithmetic operation sequence. 22. A graphics system, comprising: a central processing unit having a graphics software module; a programmable graphics processor receiving vertex information from the graphics software module and for programming the programmable The process of the stage of the graphics processor 101822.doc 1297468 The instructional program 'The programmable graphics processor comprises: 尤树丨自丨, which generates a plurality of pixel seals for each 待I to be processed in response to the An instruction of the graphics software module, each image packet includes identifying at least a "pixel attribute" to be processed: a message: and having an associated sideband identifying at least: an instruction to be executed on each of the pixel packets Information; and a programmable logic unit (ALU) stage comprising a plurality of ALUs configured to process the pixel packets, each alu being assigned by the graphics software module to receive The pixel packet reads the selected operand, performs a scalar arithmetic operation in response to the current instruction to generate a result, and executes the result to update a pixel attribute register and store the result And storing at least one of the temporary values, wherein a scalar arithmetic operation sequence is performed on the plurality of pixel packets to perform a graphics processing function on each of the pixels. 23. The graphics system of claim 22, wherein the sideband information comprises a cancellation field, and each ALU is gated in response to detecting a cancellation bit in the cancellation field to reduce the plural The power consumption of the ALU. 24. The graphics system of claim 22, wherein the arithmetic operation is of the form a*b + c*d, wherein a, b, c, and d are operands and * is a multiplication operation. 25. The graphics system of claim 22, wherein the ALU phase is adapted to perform a network function. 26. The graphics system of claim 22, wherein the ALU stage performs at least one of a fogging operation, a texture mapping, a transparency blending, a z-test, or a transparency test. The graphics system of claim 22, wherein the raster stage generates at least one pixel packet column for each of the pixels, each column having a plurality of pixels transmitted to the ALU phase in a particular clock cycle Packet. 28. The graphics system of claim 22, further comprising a gateway manager stage slave circuit having a scoreboard for tracking pixel packets received from the raster stage. The graphics system of claim 22, further comprising: a data extraction phase for extracting data of the pixel packets; a write phase for performing the process received from the ALU phase A memory write of one of the pixel data of the processed pixel packet; and a recirculation path from the data write phase to the gateway manager phase for recycling pixel packets for additional pass through the ALU phase. 30. The graphics system of claim 29, wherein information regarding pixel packets marked as cancelled in the ALU phase is provided to the scoreboard to resolve the cancelled packet pixels. An embedded processor, comprising: a scratchpad interface for a host to program a scratchpad of a graphics core; a host interface for a host to perform with the graphics core a δ mnemonic interface for the graphics core to read and write data; a programmable graphics processor disposed in the graphics core, the programmable graphics processor comprising: at least one stage , which is used to set and convert the 101822.doc 1297468 vertex of the primitive to be rasterized; a raster phase that receives information about the original image to be rasterized. The raster unit is each pixel to be processed using a graphics operation Generating at least one pixel packet sequence, the graphics operation can be expressed as a purely denier arithmetic operation sequence, each pixel packet includes payload information identifying at least - pixel attributes to be processed, and each column has a signature identifying each of the columns - associated information of at least one instruction executed on the pixel packet; ❿ - a closed channel manager comprising - a scoreboard for tracking the processing of the pixel packet - a data extraction phase for extracting data for each pixel packet column; '- ALU phase, which includes a plurality of programmable arithmetic logic units (then) for processing each of the pixel packet columns, each receiving An input pixel packet column and output an output pixel packet column, each ALU I: = = prime packet column reads at least - an operation element, using the to perform - scalar arithmetic and execution of the result is written to a 守守守守及及及及及及及及及及及及及守守守守守守守守守守守守守守守守守守守守守守守守守守守守守守守守守守Sealing in the plurality of pixel seals: one memory write; performing the graphics processing function:, executing an arithmetic operation sequence on the block 32. The embedded process w of claim 31, wherein the plurality of ALU systems Configured in a 101822.doc 1297468 pipeline. 33. = embedded processor of claim 31, with an ingress-recirculation path. The % path is used for the data write phase to the gateway management. = stage recycling pixel packet The pixel packet can be processed by using the stage-times or more. 34. The embedded processor of claim 31, wherein each - is configurable to allow the selected from the column + β i and - often At least one of a value and a temporary value. 35. The embedded process 11 of item 31, wherein at least the stage is configured to cancel the condition and mark the pixel packet as cancelled (one is cancelled). In response to detecting that a pixel packet is marked as A, as in the case of request item 31, it enters the processor, and then executes the -3 centrally-sized application to operate the graphics core. /, the middle 4 graphics core and the central 37 - a type of graphics processing on a pixel:: two - on the wafer. For at least one of the pixels to be executed on a pixel: ^' contains: a 'scaling arithmetic operation performed on the pixel packet 2 = actual knowledge = in a graphics function; ^ only knowing that at least a plurality of pixel packets are generated for the pixel, Each image: one of the scalar arithmetic operation sequences to be treated as an operand:: genus =, the plurality of pixel packets have - the phase of the sinusin at least - the arithmetic logic single 7 sequence, the operand; W from 4 money The prime packet reads 10l822.doc 1297468. In the δ hai at least _ A Τ τ γ » 瞀, ALU, the scalar arithmetic operation sequence is executed according to the instruction sequence, and the scalar arithmetic operation sequence is used to implement the at least one graphics function. %, the method of item 37, further comprising: determining that a pixel does not require further processing; assigning a cancel state to at least one pixel packet; and using each subsequent one encountered by the at least one pixel packet Arithmetic calculations in the ALU to save power. ^A method of clause 37, wherein the graphics function includes a texture combination, ^ &'then", a transparency blend, a transparency test, or an atomization At least one of the following. 40. If the request item 37*, the method 'where the scalar arithmetic operation is of the form a*b + c , where a, b, c, and d are operands and * is a multiplication operation. A method for performing a graphics processing operation on a pixel, comprising: a graphics, a power consumption, and a scalar arithmetic operation sequence performed on a pixel and a cymbal to perform on the pixel a graphics function; generating at least one pixel of the pixel to be processed in the continuous clock cycle 'per-pixel packet (four) at least one of the subset of pixel attributes to be processed as an operand in the scalar arithmetic operation sequence a shelf, the at least one column having an associated sequence of instructions; in each of the plurality of arithmetic logic units (ALUs), the read operand 'at least one of the operands corresponds to a self-styled packet The operand read by one of the pixel packets; ', in each ALU, according to the sequence of instructions assigned to the ship 101822.doc 1297468 42. 43.

44. 45. 46.

47. 48. 49. A scalar arithmetic calculation is performed on το to perform the scalar arithmetic operation sequence for performing the graphics function. The method of claim 41, wherein each of the ALUs performs at least one of: updating a pixel packet using a result of the scalar arithmetic operation; and storing a result of the scalar arithmetic operation to It is used as an operand in one of the arithmetic operations performed in the later clock cycle. The method of claim 41, further comprising: identifying a pixel that does not require further processing, and in response, marking at least one pixel packet of the pixel as canceled; and j in each ALU, deactivating has been marked as being Cancel the arithmetic of the pixel packet of the pixel s. The method of claim 43 wherein the identification is performed in the data extraction phase. The method of item 43, wherein the identification is in a data writing stage =: 41, which further comprises: assigning the one to be read, 5, #彳#彳彳之-corresponding scalar This is the same as the current instruction. As a result of the method of item 41, the pot pushes a cow. ‘, ^匕3: Extracting the Pixel Packets As in the method of claim 41, the improvement includes: pixel data that is processed by the graphics table t. (4) The power month b is written as in the method of claim 41, and the increment is referred to as a flute-old 匕3. In the 4ALU, 7 is used for the first process and the dan is followed by the %-processed pixel 101822.doc 1297468 Packet. The method of [00], wherein the method further comprises the step of performing the first graphics function on the 2-pixel packet column and the second graphics 1 wherein the one-group ALU is on the --group pixel packet The first type of graphics function is performed, and the second group of ALUs are executed on the second packet - the second type of the TM function. ,

51. A method of performing a graphics processing operation, comprising: formulating a plurality of arithmetic logic units (then), taking a selected operation 自 from a pixel-packet column, and performing a selected scalar arithmetic operation in response to - a selected current instruction associated with the pixel packet column; for at least a graphics operation to be performed on the pixel, identifying at least one corresponding scalar arithmetic operation/pair to be performed on the sub-group of the pixel The pixel generates a column of pixel packets, the per-pixel packet including at least one attribute associated with the pixel to be processed as at least one operand = the block 'the pixel packet has — indicating the sufficiency arithmetic to be executed The associated instruction of the different sequence; in the ALU, the selected operations are read in the pixel packet column and the selected scalar arithmetic operation corresponding to the associated current instruction is executed. 52. The method of claim 51, further comprising: stylizing a data extraction phase to extract data from the pixel packet. 101822.doc -11 - 1297468 3 The method of claim 51, which further comprises: stylizing the raster phase = mapping the graphics operation to - pixel packet assignment and - associated instructions. 54. The method of item 51, wherein the method further comprises: stylizing at least one, 屯比乂乂 test and if the pixel packet compares the scalar quantity: if the test fails, marking the pixel packet as got canceled. A method for performing a graphics processing operation on a pixel, A comprising: • performing at least one graphics function to be performed on a pixel, and identifying a scalar arithmetic operation 1 executable on a 1-bit packet to implement the at least one graphics The pixel of the second processing generates at least a pixel packet column, and each pixel packet includes a per-blocker for the sub-transportant attribute to be processed as an operand 'the at least-column has an associated instruction Sequence; in each of = a number of arithmetic logic units (ALUs), reading the assigned 2' of at least one of the operands corresponding to - reading from one of the pixel packets in a column of pixel packets Arithmetic: - In the ALU, a scalar arithmetic calculation is performed on the assigned operations 7C according to the sequence of instructions; the selected pure for the result of a need-in the range [[,1] Measure: s two irS1.8 format to format the corresponding operand of the pixel packet, : the formula corresponds to the base-base 2 representation of the [-2, +2] range of operators with -8-bit fractional components, and The selected scalar arithmetic operation is clamped to the [0, 1] range; At least - Other scalar arithmetic operation to - different data formats 101822.doc -12- 1297468 corresponds to the pixel format of the packet. 56. The method of claim 55, wherein the result of the at least one-ton 曰~ is less than one μ in the range of [0, 1], and the nasal operation is in the (4) S1.8 format. a calculation. 57. The method of claim 56, and in the middle of the eighth, the other scalar arithmetic operation is an operation on the texture represented by a format different from S1.8. 58. For example, the method of claim 57, the rainy solid of the Gansian* 〃 mother-in-one pixel packet has a fixed bit size for the pixel packet _ u ^ pixel packet including high precision (s, t ),, Wen Li Bei Yu, and the color of the material. The cut pixel packet includes at least two color components 59. - a method of performing a graphics processing operation on the pixel, . = one of the color components of a pixel is performed: the second power - the identification of a younger - scalar The sequence of arithmetic operations is implemented to perform the first-graph function, which requires a scalar arithmetic operation to result in the range of [〇, 1]; ^ for the texture to be associated with a pixel, (1) another second graphic function 'recognition-second scalar arithmetic operation sequence to implement the first function; / or pixel to generate at least one pixel packet column, each pixel block having a fixed length of at least one of the service elements The meta size 'and includes = at least _ rotten bits of one of the sub-pixel attributes of the operand, the at least one column having an associated sequence of instructions; for each child associated with the first graphical function 11 S1.8 The format is to encapsulate at least two color components, and the S1.8 type is slanted in a 1 豕 Beijing packet, and is operated in a range of [-2, +2] of 101822.doc -13 - 1297468 - 8 bit fractional components. One base of the yuan 2 is not; Each pixel packet associated with the second graphics function encapsulates a single high-precision texture requiring more than 8 bits; in each of a plurality of arithmetic logic units (ALUs), the assigned operand is read, and Performing a scalar arithmetic calculation on the assigned operands according to the sequence of instructions; wherein for the first graphics function, the color component is selected in the 818 format as an operand, and a result is processed to the [ 0, 丨] range, and for the second graphics function, the texture is selected as an operand in a format having an accuracy greater than 8 bits. 60. A graphics processor, comprising: at least a stage, which is used to set and convert a pedestal of the image to be rasterized to receive information about the original image of the image to be rasterized. Each pixel of the processing generates at least one image (four) packet sequence, and the material-object material counts the recordable operation sequence; a gateway manager, 1_, includes a scoreboard for tracking the processing of the pixel packet, _ batting extraction stage, which is used to extract the data of each pixel packet column; (ALU^^ into the pixel sealing spoon, processing the parent one of the pixel packet column, each ALU receives an input... 匕列亚轮出-output pixel The packet column, each ALU receives the image 44 44 u + from the - 101822.doc -14 - 1297468, reads the at least one operand, and uses the at least one operand to execute the thousand _ # e - Purely setting the nose operation, generating a result, and executing the write to a temporary value or using the result to update at least one of the output column - the pixel packet; and the bedding write phase, Used to perform the reception from the plurality of ALUs, The pixel data of the processed pixel packet is a memory write; wherein an arithmetic operation sequence is performed on the plurality of pixel packets to perform the graphic processing function; and the second black raster ρ white segment is in a S8 format a type of scalar arithmetic operation pixel packet, the S 8 format corresponding to a base 2 of the operand having one of the bins, and a parent-ALU Processing the first type of scalar operation, which clamps a result within the range of [〇, 1]; the second pair of scalar arithmetic operations requiring a precision greater than 8 bits for the raster stage Formatting the pixel packet. 6^• The graphics processor of claim 60, wherein the singular type of the _ type has corresponded to the - operation on the -color component, and at least two color components are included: In the pixel sealing spoon generated by the scalar operation, and the scalar operation of the second type corresponds to the operation, the operation has: a size larger than one half of the size of the pixel packet; 62. The graphics processor of claim 60 corresponds to a pattern One of the operations 63. The graphics processor of claim 62, wherein the second type of scalar operation wherein each-pixel packet has - 101822.doc -15 - I297468 at least 20 bits in size, one, ,, The attribute is at least 16 bits and has an associated 4-bit fineness, and each _si wipes the element, so that each of the images can contain two Sh8 operands in the packet. = type operation 'and may include a set of texture attributes for the operation of the pixel in each pixel packet for use in the second type of graphics processor, which includes ··, -raster Stage 'which receives data about the original to be rasterized, the =gate phase generates a plurality of pixel packets for each pixel to be processed, and each prime packet includes a payload identifying at least the pixel attribute to be processed ^ ' Sensing associated sideband information for at least one instruction executed on each of the pixel packets; a programmable arithmetic logic unit (ALU) stage for processing the pixel packets, the ALU stage including a plurality of ALUs , each A Lu has one, and at least, a possible arithmetic operation, and the unique arithmetic operation system _ has an incoming pixel packet corresponding to the current imperative command to execute; a data extraction phase for extracting data of the pixel packet a stuffing stage for performing memory writing of one of pixel data of the processed pixel packets received from the ALU stage; the first tooling's face to the ALU stage, the data extraction stage and The data is written into the individual inputs of the phase; and the distributor is integrated into the ALU phase, the data extraction phase and the individual output of the data writing phase; the first dispenser and the second dispenser are adapted Reconfigure the pixel 101822.doc. -16 - 1297468 The process of processing the data extraction phase, the ALU phase, and the ALU write phase in response to a command from a host. 65. The graphics processor of claim 64, wherein the first allocator and the second allocator are adapted to allow at least a portion of the graphics processor to be bypassed in response to a software command. 66. The graphics processor of claim 65, wherein the first allocator and the second allocator are adapted to allow the data extraction phase to be bypassed in response to a software command. 67. The graphics processor of claim 64, wherein the first allocator and the second allocator are spliced to each of the ALU stages - to allow an ALU execution order to be assigned in response to a software command. 68. The graphics processor of claim 64, wherein the graphics processor is adapted to allow reconfiguration of the processing flow to have a data extraction after a scalar arithmetic operation. 69. The graphics processor of claim 64, wherein the graphics processor is adapted to allow the host to reconfigure the processing flow to have a data fetch prior to a scalar arithmetic operation. 70. A method for operating a graphics pipeline, the graphics pipeline having a rasterizer eight for generating a pixel packet; a data extraction phase for extracting data of the pixel packet; and an ALU phase having at least one alu For performing a scalar arithmetic operation on a pixel packet; a data writing phase for writing pixel data; and a distributor coupled to the data extraction phase, the data writing phase, and the ALU phase; The method includes: responding to the first command, stylizing the allocators to define a pixel packet via 101822.doc -17- 1297468, the first processing flow by the data extraction phase, the ALU phase, and the data writing phase; and the response a second command 'programming the allocators to define a pixel packet via the data extraction phase, the ALU phase, and the second phase of the data writing phase; wherein the software host can have any of a plurality of processing flows To configure the pipeline. The method of claim 70, wherein the first processing flow includes the data extraction, the ALU phase, and the data writing phase, and the second processing flow bypasses the data extraction phase. The method of claim 7, wherein the first processing flow includes a first execution order of the one of the alus, and the second processing flow includes a second execution order of the ALUs. 73. The method of claim 7, wherein the first processing flow comprises data extraction prior to a scalar arithmetic operation, and the second processing flow includes a data extraction subsequent to a scalar arithmetic operation. 74. A method of operating a graphics pipeline, the graphics pipeline having: a rasterizer for generating a pixel packet; a data extraction phase for extracting data of the pixel packet; and an ALU phase having at least one ALU For performing a quantization arithmetic operation on a pixel packet; a data writing phase for writing pixel data; and a distributor coupled to the data extraction phase, the data writing phase, and the ALU phase; The method includes: receiving a command from a software host to reconfigure the pipeline from the pixel packet via the data extraction phase, the ALU phase, and the data writing phase 101822.doc -18- 1297468 Up to the second processing flow of the pixel packet via the data extraction phase, the ALU phase, and the data writing phase; and adjusting the distributors to reconfigure the pipeline from the first processing flow to the second processing Process. 75. The method of claim 74, wherein the first processing flow includes the data extraction phase, the ALU phase, and the data writing phase, and the second processing flow bypasses the data extraction phase. The method of claim 74, wherein the first processing flow includes a first execution order of the ones of the mus, and the second processing flow includes a second execution order of the ALUs. 77.:: method of claim 74 wherein the first processing flow includes data extraction prior to a scalar = operation, and the processing flow includes a data extraction after a purely different operation. 78. A graphics processor, comprising: a plurality of components for processing a pixel packet; a sub-distributor that engages to an individual input of the plurality of components, and a second distributor that is coupled to the plurality of components Individual outputs of the components; the first distributor and the adapter are adapted to reconfigure the pixel packets via the plurality of components - in response to the process - from;; host commands. 79: request item 78 a graphics processor, wherein the components are in stages. 80. - Operation - a method of graphics pipelines. The graphics pipeline has a graphics pipeline having a plurality of components for processing pixel packets, the method comprising 101822.doc • 19 - 1297468 The mouth should be command-command, stylized into one of the plurality of components, ♦. 疋像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素像素= allocator to define a pixel packet, one of the software hosts can borrow the I state of the pipeline. The method of grouping any of several processing flows 'these components are staged. A method of performing a graphics processing operation, comprising: sounding a scalar arithmetic operation to be performed on a plurality of pixel 封3 prime packets == this function, identifying an image function; π sequence, to implement the graphics to pixel Assigned as an even pixel or an odd pixel; generating at least two pixels for the pixel, including for processing in the scalar arithmetic operation bank w J 'mother one pixel packet-early feng s column to be processed as an arithmetic element At least-blocking of pixel attributes, such at least two columns having an associated sequence of instructions and _ using 4 /, • ... for the pixel packet is used for - odd pixels and for identification of an even pixel a packet sequence that is interleaved in a group of pixel packets, where each column of the group is processed in the pixel period of the prime and odd pixels; the bottom is used for a plurality of arithmetic logic units in the continuous clock PALU phase (in the case of each), for a current clock cycle to receive a pixel packet column 'and the instruction sequence is from the pixel packet column 乂 Kangkouhai line a scalar arithmetic calculation; M (10) at least one operation element on the 101822 .doc -20- 1297468 wherein the processing of pixel packets is interleaved in the ALUs. 83. As before request item 82, the pixel is packed between columns. The method, wherein a pixel packet column requires a result from the first, and the interlace is selected to resolve the ALU wait time 84. The method of claim 82, further comprising: storing both the even pixel and the odd pixel in each of the methods A shared group constant value. 85. The method of claim 82, further comprising: storing in each (10). One of the first set of temporary values and one of the even pixels of the second set of temporary values. 6. The method of claim 85, further comprising: using the identifier to select a (four)-set temporary value for the pixel block of the pixel and select the second set of temporary values for the image of the odd pixel. 87. As in the method of claim 85, j: push half a spoonful of people • suppress _, enter 乂匕έ • store the temporary values for a sufficient period of time to simulate a constant register. 88. One of the identifiers executed in the component of the configurable graphics pipeline: the method of writing the device 'The configurable graphics pipeline has pixel packets = components of the graphics pipeline - more than one possible processing flow, The method comprises: receiving, by a location, a data packet of the component that triggers the graphics pipeline is used for an identifier of the component in the processing flow; and entering a software indicating the processing flow A command is generated to generate an identifier that triggers the relative position of each of the components to be written in a configuration register. The method of claim 88, wherein the data packet of the first component is responded to. 101822.doc -21- I297468 9〇.: The method of claim 88, wherein the data packet is injected into the pipeline at a position of one of the components requiring configuration information, (1) consulting 4 A 八甲转组State:: The mother-continuous component reads an identifier in the data packet—the current value, writes the current value to its configured temporary storage effect, and 1 increments the identification payment. The next element of the data flow that increments the identifier. L. The method of claim 88, wherein the elements comprise a different logic unit for processing the packet. And a graphics processor comprising a raster stage that receives data about a picture to be rasterized, the raster stage generating a plurality of pixel packets for each pixel to be processed, each pixel packet including identification At least - payload information of the pixel attribute to be processed, and having associated sideband information identifying at least one instruction to be executed on each of the pixel packets; a programmable arithmetic logic unit (ALU) stage, In processing the pixel packets, the ALU stage includes a plurality of ALUs, each ALU having a set of at least one possible arithmetic operation performed on an incoming pixel packet having a corresponding current imperative command a data extraction phase for extracting data of the pixel packets; a data writing phase for performing memory writing of one of pixel data of the processed pixel packets received from the ALU phase; a first distributor coupled to the segment, the data extraction phase and the individual input of the data writing phase; and a second dispenser, Coupling to the ALU stage 'the data extraction stage I0l822.doc -22- 1297468 and the individual output of the data writing stage; 5 Hai's knife and the brother * a distribution benefit is adapted into a re-grouping pixel packet via the Processing flow from the data extraction phase, the ALU phase, and the ALU write phase in response to a command from a host; wherein each ALU of the ALU phase is adapted to receive an identification initiated by a software identification code The packet, each ALU writes the current value of one of the identifiers of one of the identification packets into a configuration register, increments the identifier, and forwards the identification packet to the next ALU. _93. A graphics processor comprising: a graphics pipeline having a set of tap points associated with elements of the graphics pipeline; a configurable test point selector receiving the same from a software host The testable point selector is adapted to monitor a sub-assembly joint point selected by a software command and to associate at least each of the sub-joint points associated with the sub-assembly joint point A condition to count the statistic·, φ where the statistics of a sub-component joint point are collected for the software host. 94. The graphics processor of claim 93, wherein the graphics pipeline includes an arithmetic logic unit (ALU) chain for processing pixel packets. 95. The graphics processor of claim 94, wherein the subcomponent joint points are comprised of two tap points associated with successive elements in the graphics pipeline. 96. The graphics processor of claim 95, wherein an indication of the payload data, a valid signal flowing from a first component to a second component, and a determination of whether the second component is capable of receiving the payload data The ready signal 10l822.doc -23 - 1297468 and l regard these two tap points. For example, the graphics processor of % seeking 96, wherein a transition state is counted for each clock cycle when a valid condition and a precondition are detected, and each of the detection, 丨, and a ready condition is used. One clock cycle counts a wait state. ▲: The graphics processor of claim 94, wherein the transition state and a wait state are monitored at each of two selected tap points associated with the alu.

The graphics processor of claim 98, wherein the transition state corresponds to a downstream ALU in a-ready state and an upstream ALU in an active state. Dan further includes 1〇〇. The graphics processor host of claim 93 to enable the statistics collection of the trace record register ν 101 · If the request item 93 is a graphic imaginary crying sorrow ^ the heart is processed, wherein the group can be The test point selector includes at least one counter. 102. A method of monitoring a graphics processor, comprising: 2: receiving a command with two test points capable of transmitting a payload to a second component of the second component; a test point; and a statistic for collecting at least two conditions associated with the first element and the parent. The first condition is - for the first element, the component two: the loaded valid signal 'and the second condition is - for the heart of the piece to indicate that the second element is ready to receive a payload 101822. Doc -24- 1297468 The signal number. 104. The method of claim 103, wherein a clock state is counted for each clock cycle in which a valid signal and a ready signal are present, and a clock cycle is counted for each valid signal but no ready signal is present. Bear. 105. The method of claim 102, wherein the collection of statistics is performed by preselecting an operational mode for one of the graphics processors. As in the method of claim 1 () 2, the further step includes: receiving - a command to enable statistics collection for the two test points.

The method in which one of the graphics processors is selected to collect the statistic associated with the graphics operation, and the statistic collection statistic is enabled. 108. If the party to claim 1〇2 is the software host that enables the collection of statistics in response to a value of one of the trace record registers.

101822.doc -25-