TWI297468B - Graphics processor, graphics system, embedded processor, method of performing a graphics processing operation, method of operating a graphics pipeline, method of performing a register write, and method of monitoring a graphics processor - Google Patents
Graphics processor, graphics system, embedded processor, method of performing a graphics processing operation, method of operating a graphics pipeline, method of performing a register write, and method of monitoring a graphics processor Download PDFInfo
- Publication number
- TWI297468B TWI297468B TW094115854A TW94115854A TWI297468B TW I297468 B TWI297468 B TW I297468B TW 094115854 A TW094115854 A TW 094115854A TW 94115854 A TW94115854 A TW 94115854A TW I297468 B TWI297468 B TW I297468B
- Authority
- TW
- Taiwan
- Prior art keywords
- pixel
- packet
- graphics
- alu
- phase
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/005—General purpose rendering architectures
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09G—ARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
- G09G5/00—Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
- G09G5/36—Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators characterised by the display of a graphic pattern, e.g. using an all-points-addressable [APA] memory
- G09G5/37—Details of the operation on graphic patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Computer Graphics (AREA)
- Image Generation (AREA)
- Image Processing (AREA)
- Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
- Advance Control (AREA)
Abstract
Description
1297468 九、發明說明: 【發明所屬之技術領域】 本發明大體係關於可程式之虛採哭 .^ 八 < 慝理窃。更特定言之,本發 明係針對用於圖形應用之低功率可程式之處理器。 【先前技術】 w 產生二維圖形影像在各種電子遊戲及其他應用中被關 /主。通# ’用於產生情景之二維影像之某些步驟包括產生 待顯不之物體的二維模型。形成幾何圖原(ge〇metHcal • primitive)(例如,三角形),其連同深度資訊一起被映射至 二維投影。再現(繪製)圖原包括在圖原之每一二維投影上内 插諸如深度及色彩之參數。 圖形處理單元(GPU)—般用於圖形系統中以產生三維影 像’以回應來自中央處理單元之指令(instructi〇n)。現代Gpu 一般利用圖形管線處理資料。圖1為傳統管線架構之先前技 術圖式,該架構為一具有專用於執行特定功能之階段的 深管線。轉換階段1 〇 5執行圖原之幾何計算且亦可執行 φ 裁剪運算(dipping operation)。設定/光柵階段110將圖原光 栅化。紋理位址階段115及紋理提取階段12〇係用於紋理映 射。務化階段130實施霧化演算法(f〇g algorithm)。透明度 測試階段135執行透明度測試(alpha test)。深度測試階段140 執行用於挑選閉塞像素之深度測試。透明度混色階段145執 行透明度混色色彩組合演算法。記憶體寫入階段1 50寫入管 線之輸出。 一般使用OpenGL®圖形語言來最佳化圖1中所說明之傳 101822.doc 1297468 統咖管線架構以用於快速紋理化。深管線架構之益處在 允許陕速、咼品質地再現甚至複雜之情景。 在無線電話、個人數位助理(PDA)及成本與功率消耗為重 要設計需求之其他裝置中對利用三維圖形之關注曰益增 」而傳統之深管線架構要求顯著之晶片區域,其導 致成亡:於所要成本。另外,即使階段正執行相當少:處 理=管線亦消耗顯著功率。此係因為許多階段消耗大約 相同i之功率而無論其是否正處理像素。 、出於成本及功率考慮,圖1中所說明之習知深管線架構不 適口於許多圖形應用,諸如在無線電話及pDA上實施三維 遊戲。 因此’吾人所要的是一適合於圖形處理應用但功率及大 小需求降低之處理器架構。 【發明内容】 一圖形處理器包括-用於處理像素封包之可程式之算術 邏輯單元(ALU)階段4ALU階段中在像素封包上執行純量 算術運算以實施圖形功能。 在像素上執行圖形處理運算之方法的—實施例包括:識 別待在像素封包上執狀—純量算料算序列讀行圖形 功能;為該像素產生複數個像素封包,每—像素封包包括 在該純量算術運算序列中待作為運算元處理的—子組像素 屬性;自至少一彻中的像素封包讀取運算元;及根據指 令序列來執行純量算術運算以執行該純量算術運算序列。 圖形處理器之-實施例包括:_具有用於處理像素封包 101822.doc I297468 =tALU的可程式之ALU階段,每-彻經程式化以 在一且能之純量算術運算’該純量算術運算係 傻/ 。之當前指令的傳入像素封包上執行,其中在 素封包上執行-算術運算序列以執行圖形處理功能。 圖:處,器包括詩在像素封包上執行純量算術運算之 、 异術邏輯單元(alu)。對於所選純量算術運算,可 格式化像素封包中的運算元以改良動態範 国 對於至少一其他έφ晉曾你;、富曾 格式化像素封包。以其他資料格式來 在一方法之—實施例中,識別純量算術運算,其可在像 素封包上來執行以實施圖形功能。為待處理之每―像素產 生至少-像㈣包列,每—像素封包包括詩待作為運算 7L處理之-子組像素屬性的至少—攔位,該至少—列具有 -相關聯之指令序列。經指派之運算元在複數個彻:每 -者中被讀取,該等運算元之至少―者對應於一自—像素 封包列内之一像素封包所讀取的運算元。在每一 alu中, 根據指令序列在經指派之運算元上執行一純量算術計算。 對於需要[0, 圍内之結果的所選純量算術運算,以S18 格式來格式化像素封包之對應運算元,該S18格式對應於 具有8位元小數成份之[_2, +2]範圍内之運算元的基數2表 示’並將所選純量算術函數的結果钳制至[(),丨]之範圍。對 於至少一其他純量算術運算,以不同資料格式來格式化像 素封包。 圖形處理器具有由分配器所耦合之圖形管線的元件。該 101822.doc -8· 1297468 等分配器允許重新域像素封包經由f線之處理流程1 回應來自主機之命令。 在一設備之一實施例中,一圖形瞢 口办S線包括複數個階段。 第-分配器麵合至該複數個階段的個別輸入。第二分配写 耦合至該複數個階段的個別輸出。該第一分配器及該第二 分配器經調適成重新組態像素封包纟 、 ^、、二由複數個階段之處理 流程,以回應來自主機之命令。 在一方法之-實施例中,接收來自軟體主機之命令,以 重新組態像素封包經㈣㈣線之元件的處理流程。作為 回應,調整至少一分配器以將管峻自 s線自弟—處理流程重新組 悲至弟二處理流程。 圖形處理器包括-用於處理像素封包之算術邏輯單元 像素可需要處理—個以上之像素封包列。 父錯諸如奇像素及偶像素之不同像夸 U彳冢素的像素封包以解決 ALU等待時間。 所成1297468 IX. Description of the invention: [Technical field to which the invention pertains] The large system of the present invention is about the imaginary cry of the program. ^ 八 < 慝理窃. More specifically, the present invention is directed to a low power programmable processor for graphics applications. [Prior Art] w Producing 2D graphics images is turned off/main in various video games and other applications. Some of the steps used to generate a two-dimensional image of a scene include generating a two-dimensional model of the object to be displayed. A geometric primitive (ge〇metHcal • primitive) (eg, a triangle) is formed, which is mapped to the two-dimensional projection along with the depth information. Reproducing (drawing) the map originally includes interpolating parameters such as depth and color on each of the two-dimensional projections of the original image. A graphics processing unit (GPU) is typically used in graphics systems to generate three-dimensional images' in response to instructions from the central processing unit (instructi). Modern Gpus typically use graphics pipelines to process data. Figure 1 is a prior art diagram of a conventional pipeline architecture, which is a deep pipeline with stages dedicated to performing specific functions. The conversion phase 1 〇 5 performs the original geometry calculation and can also perform the φ dipping operation. The set/raster stage 110 rasterizes the picture. The texture address stage 115 and the texture extraction stage 12 are used for texture mapping. The materialization stage 130 implements a fog algorithm (f〇g algorithm). Transparency Test phase 135 performs an alpha test. The depth test phase 140 performs a depth test for picking occluded pixels. The transparency blending stage 145 performs a transparency blending color combination algorithm. The memory write phase 1 50 writes to the output of the pipeline. The OpenGL® graphics language is generally used to optimize the 101822.doc 1297468 system pipeline architecture illustrated in Figure 1 for fast texturing. The benefits of a deep pipeline architecture allow for the reproducibility of even the most complex scenarios. The use of three-dimensional graphics has increased in wireless telephones, personal digital assistants (PDAs), and other devices where cost and power consumption are important design requirements. Traditional deep pipeline architectures require significant wafer areas that lead to death: At the cost. In addition, even if the phase is performing quite a bit: the process = pipeline consumes significant power. This is because many phases consume approximately the same i power regardless of whether they are processing pixels. For the sake of cost and power considerations, the conventional deep pipeline architecture illustrated in Figure 1 is not suitable for many graphics applications, such as implementing 3D gaming on wireless phones and pDA. So what we want is a processor architecture that is suitable for graphics processing applications but with reduced power and size requirements. SUMMARY OF THE INVENTION A graphics processor includes a programmable arithmetic logic unit (ALU) for processing pixel packets. Phase 4 ALU stages perform scalar arithmetic operations on pixel packets to implement graphics functions. An embodiment of the method for performing a graphics processing operation on a pixel includes: identifying a function to be performed on a pixel packet - a scalar computing sequence reading line graphic function; generating a plurality of pixel packets for the pixel, each pixel packet being included a sub-group pixel attribute to be processed as an operation element in the scalar arithmetic operation sequence; reading an operation element from at least one of the pixel packets; and performing a scalar arithmetic operation according to the instruction sequence to execute the scalar arithmetic operation sequence . The embodiment of the graphics processor includes: _ having a programmable ALU stage for processing the pixel packet 101822.doc I297468 = tALU, each - programmed to perform a scalar arithmetic operation on a scalar quantity 'the scalar arithmetic The operation is silly /. The incoming instruction is executed on the incoming pixel packet, wherein the sequence of arithmetic operations is performed on the prime packet to perform graphics processing functions. Figure: The device includes a poetic logic unit (alu) that performs scalar arithmetic operations on the pixel packet. For the selected scalar arithmetic operation, the operands in the pixel packet can be formatted to improve the dynamic state. For at least one other έ 晋 曾 ;;; In other data formats, in one method embodiment, a scalar arithmetic operation is identified that can be executed on a pixel packet to implement a graphics function. For each pixel to be processed, at least - like (four) packets are generated, each pixel packet includes at least - a block of the sub-group pixel attributes to be processed as a 7L process, the at least - column having - an associated instruction sequence. The assigned operands are read in a plurality of integers: at least one of the operands corresponds to an operand read by a pixel packet in a self-pixel packet column. In each alu, a scalar arithmetic calculation is performed on the assigned operands according to the sequence of instructions. For the selected scalar arithmetic operation requiring [0, the result of the surrounding, the corresponding operand of the pixel packet is formatted in S18 format, which corresponds to the range of [_2, +2] having the 8-bit fractional component The base 2 of the operand represents 'and clamps the result of the selected scalar arithmetic function to the range of [(), 丨]. For at least one other scalar arithmetic operation, the pixel packet is formatted in a different data format. The graphics processor has elements of a graphics pipeline coupled by a distributor. The allocator such as 101822.doc -8· 1297468 allows the re-domain pixel packet to respond to commands from the host via process f of f-line. In one embodiment of an apparatus, a graphics port S line includes a plurality of stages. The first-distributor faces the individual inputs of the plurality of stages. A second allocation write is coupled to the individual outputs of the plurality of stages. The first splitter and the second splitter are adapted to reconfigure the pixel packets ^, ^, and 2 by a plurality of stages of processing in response to commands from the host. In a method-embodiment, a command from a software host is received to reconfigure the processing flow of the component of the pixel packet via the (4) (four) line. In response, at least one of the dispensers is adjusted to re-establish the process from the s-process to the second process. The graphics processor includes - an arithmetic logic unit for processing pixel packets. The pixels may need to process more than one pixel packet column. Parental errors such as odd and even pixels are similar to pixel buffers to resolve ALU latency. Made into
在一方法之-實施例中,識別_純量算術運算序列,复 可在像素封包上來執行以在複數個像素上實施圖形功I 將像素指派為偶像素或奇像素。為每一 主u a 本京產生至少兩像 素封〇列’每-像素封包包㈣於在該純量算術運算 中待作為運算元處理之-子組像素屬性的至少—欄位… 等至少兩列具有一相關聯之指令 ^ 7斤列、一用以指示像 包是用於奇像素還是用於偶像素的識別符。在一群像素, 包列中交錯用於偶像素及奇像素之像素封包列,其中^ 中之每一列經指派用於在連續時脈週期 〇Λ 〜w τ蜒理。在ALU階 101822.doc 1297468 焱令接收當前時脈週期 像素封包列所讀取之至少一運=列。根據指令序列在自 其尹像夸運#疋上執行純量算術計算, 一素封包之處理在ALU階段尹得以交錯。 的一:::圖形管線具有像素封包經由該圖形管線之元件 件以發料之可錢理流程。資料封包觸發圖形管線之元 1千以發現識别符。 線 之只紅例尹,所接收之資料封包觸發圖形管In a method-embodiment, a sequence of sess-quantity arithmetic operations is identified that can be performed on a pixel packet to perform a graphics function on a plurality of pixels to assign pixels as even or odd pixels. For each master ua, the Beijing generates at least two pixel-packed columns 'per-pixel packet (four) at least two columns of the sub-group pixel attributes to be processed as operands in the scalar arithmetic operation. There is an associated instruction, a flag to indicate whether the image packet is for an odd pixel or an even pixel. In a group of pixels, a column of pixel packets for even and odd pixels is interleaved, wherein each of the columns is assigned for continuous clock cycle 〇Λ ~w τ 蜒. At the ALU stage 101822.doc 1297468, the current clock cycle is received. At least one of the data = column read by the pixel packet column. According to the instruction sequence, the scalar arithmetic calculation is performed on the 像 运 运 , , , , , , , 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 The one::: graphics pipeline has a process for the pixel packet to be sent through the component of the graphics pipeline. The data packet triggers the element of the graphics pipeline to find the identifier. Only the red case of the line, the received data packet triggers the graphic tube
位詈Γ:、’以發現一用於每一元件之指示處理流程内元件 的識别符。每一元件在指示其在處理流程内之相對位 暫存器中寫入識別符。在—實施例中,元件在資 L㉟取識㈣之當前值,將該t前值寫人至組態暫 存盗’遞增該識別符’並將具有遞增識別符的資料封包轉 發至處理流程之下一元件。 圖形處理器包括圖形管線。分接頭點與圖形管線之元件 相關聯。一可組態之測試點選擇器監視一所選子組分接頭 點,並對與該子組分接頭點之每一分接頭點相關聯之至少 一條件計數統計量。在—實施例中,該可組態之測試點選 擇器起作用以回應來自軟體主機之命令並為該軟體主機收 集統計量。 【實施方式】 圖2為本發明之一實施例的方塊圖。可程式之圖形處理器 205耦合至暫存器介面21〇、主機介面22〇及記憶體介面,該 s己憶體介面諸如直接記憶體存取(DMA)引擎230,其用於使 用諸如訊框緩衝器之圖形記憶體(未圖示)來進行記憶體讀 101822.doc -10- 1297468 取/寫入操作。主機介面22〇允許可程式之圖形處理器205接 收來自主機之用於產生圖形影像的命令。舉例而言,主機 可將頂點資料、命令及程式指令發送至可程式之圖形處理 态205。諸如DMA引擎23〇之記憶體介面允許使用圖形記憶 體(未圖不)來執行讀取/寫入操作。暫存器介面210提供一用 於與可程式之圖形處理器2〇5的暫存器介面連接之介面。 可將可程式之圖形處理器205實施為系統290之一部分, 該系統包括執行軟體應用程式27〇之至少一其他中央處理 § 單兀260,該中央處理單元用作可程式之圖形處理器2〇5的 主機。例示性系統290可(例如)包含諸如行動電話或個人數 位助理(PDA)之掌上型單元。舉例而言,軟體應用程式27〇 可包括一用於在顯示器295上產生圖形影像之圖形應用程 式275。另外,如以下更詳細之描述,在某些實施例中,軟 體應用程式270可包括用於執行與可程式之圖形處理器2〇5 相關聯之管理功能的圖形處理器管理軟體應用程式28〇,該 等管理功能諸如(例如)管線重新組態、暫存器組態及測試。 鲁 在一實施例中,可程式之圖形處理器205、暫存器介面 210、主機介面220及DMA引擎230為一形成於單一積體電路 200上之嵌入式圖形處理核心25〇之部件,該單一積體電路 包括主機,諸如形成於晶片上之包括中央處理單元26〇之積 體電路200,該中央處理單元具有駐存於記憶體上之軟體 270。或者,圖形處理核心250可安置於第一積體電路上, 且CPU 260安置於第二積體電路上。 圖3為詳細說明根據本發明之一實施例之可程式之圖形 101822.doc 1297468 階段310程式化指令。光柵階段31〇處理給定之三角形之每 一像素,並判定作為再現之一部分的需要對像素而計算之 參數,諸如計算色彩、紋理、透明度測試、透明度混色、z 深度測試及霧化參數。在一實施例中,光柵階段31〇對像素 封包計算重心係數。在重心、座標系、统中,量測三角形中相 對於其頂點的距離。使用重心係數會降低所需之動態範 圍,其允許使用與浮點計算相比需要較少功率之固定點 算。 光柵階段31〇對待處理之三角形的每一像素產生至少一 像素封包。母-像素封包包括用於為處理所需之像素屬性 (例如’色彩、紋理、深度、霧化、(X,y)位置)之有效負載 的攔位。另外,每-像素封包具有相關聯之旁頻帶資訊, 其包括待在像素封包上執行之運算的指令㈣。光柵階段 31 〇中之指令區域(未圖示)將指令指派給像素封包。 圖4說明一像素之例示性像素封包43。及糊。在—實施例 ^光柵階段㈣將像素屬性分割為兩個或兩個以上不同類 :之像素封包一,其中每一類型之像 = 用於特^類型之指令所作用〇 ^而要僅 素資料分割為較小之工作 將像 對於特定處理運算而+ ^降低頻寬需求’且若(例如) 瞀,目,丨〇僅萬要對一子組像素屬性進行運 -T則其亦會降低處理需求。 丁運 次訊49if。素封包具有相關聯之旁頻帶資訊410及有效負载 貝Η 。例不性旁頻帶資訊包括有效欄位412、取消攔 (kill fleid)4l4、栲々納 a % A 襴位 ^己攔位、及包括當前指令之指令欄位 101822.doc * 13 - 1297468 416。例示性像素封包43G包括—第-組(S,t)紋理座標422 及424之攔位與霧化欄位426。例示性像素封包46〇包括色彩 攔位462及-第二組紋理座標(s,t)464及摘。在—實施例 中,每一像素封包以固定點表示來表示有效負載資訊42〇。 可包括於像素封包(其中像素屬性之像素封包大小為朗立 元)中之像素屬性的實例包括:—Z l6十六位元战度值; 一 16位元S/T紋理座標及4位元精細度;一對色彩值,每一 者具有8位元之精確度;或封裝25555 argb色彩,其中五 位元各在每一 ARGB變數中。 像素封包之旁頻帶資訊可包括像素之(x,y)位置。然而, 在-實施例中,藉由光柵階段310錄,y)起源處產生開始 跨距命令,在該起源處其開始沿著掃描線穿過三角形。使 用開始跨距命令會允許自像㈣包省略(x,y)位置。該開始 跨距命令通知其他實體(例如,資料寫人階段州及資料提 取階段330)在掃描線開始處的初始(x,y)位置。沿著掃描線 之:他像素的(x’y)位置可由像素之數目來推斷,一給定像 素逐離Θ起源。在一實施例中’資料寫入階段说及資料提 取階段330包括本機快取記憶體,該等本機快取記憶體經調 適成遞增本機計數H,絲基於其在跨㈣始命令之後遇 到之像素數目的計算來更新(x,y)位置。 >看圖5 ’在-實施例巾,光柵階段3灣待處理之每— 像素產生至少-像素封包列51()。在某些實施例中,每一列 =具:一為該列510定義一指令序列的共同旁頻帶資訊 右一像素需要多於—列510,則將該等列510組織為_ 101822.doc -14- 1297468 和〇列’其隨著每-新時脈週期而被連續處理。在 :中,賴位元像素資料分料四個騎元像素屬性^ 益值’其中該等四個像素暫存器值定義像素之像素 (R0、Rl、R2及 R3)之“列,,51〇。 光柵階段3丨0之迭代器暫存器㈣(未圖示)具有對應之暫 存器以支援像素封包列510β在—實施例中,光柵階段則 包括-支援尚達四像素封包列的暫存器集區。某些類型之 像素封包屬性(諸如紋理)可需要高精確度。相反,某些類型 之像素封包屬性可需要較低之精確度,諸如色彩。可配置 暫存器集區以支援列51G中每—像素封包的高精確度及低 精確度值。在—實施财,暫存器集區包括每列4個高精確 度及4個低精確度之透視校正迭代值加上z深度值。舉例而 言,以此方式允許軟體指派迭代器之精確度以用於處理特 定像素封包屬性。在一實施例中,光栅階段3ι〇包括周 適成追蹤紋理之整數部分的暫存器集區,從而允許將:理 之小數位元作為資料封包來發送。 光柵階段310可(例如)接收來自主機之需要在像素上執行 運算的指令。作為回應’光栅階段31〇產生具有相關聯之指 令序列的-或多個像素封包列别,其中該等像素封包列及 指令經配置以執行所要之處理運算。如以下更詳細之描 述’在-實施例中,ALU階段34〇允許執行純量算術運算: 其中運算元包括像素封包列51〇内之一預選子組像素屬 性、常數值及像素封包上先前計算的臨時儲存結果。 各種圖形運异可用公式表達為一或多個純量算術運算。 101822.doc 15- 1297468 另外,各種向量圖形運算可用公式表達為複數個純量算術 運w 因此,應瞭解,可程式化本發明之可程式之圖形處 理器205以在像素上執行任何圖形運算,該圖形運算可表示 為一純量算術運算序列,諸如霧化運算、色彩(透明度)混 色、紋理組合、透明度測試或深度測試,諸如在〇pen Gjl®Position:, 'to find an identifier for the component within the processing flow for each component. Each component writes an identifier in its relative bit register indicating its processing flow. In the embodiment, the component obtains the current value of (4) in the L35, writes the pre-t value to the configuration temporary stolen 'increment the identifier' and forwards the data packet with the incremental identifier to the processing flow. Next component. The graphics processor includes a graphics pipeline. The tap point is associated with the component of the graphics pipeline. A configurable test point selector monitors a selected sub-component joint point and counts at least one conditional count statistic associated with each tap point of the sub-component joint point. In an embodiment, the configurable test point selector functions in response to commands from the software host and collects statistics for the software host. Embodiments Fig. 2 is a block diagram showing an embodiment of the present invention. The programmable graphics processor 205 is coupled to a scratchpad interface 21, a host interface 22, and a memory interface, such as a direct memory access (DMA) engine 230, for use with, for example, a frame. The buffer's graphics memory (not shown) performs the memory read 101822.doc -10- 1297468 fetch/write operation. The host interface 22 allows the programmable graphics processor 205 to receive commands from the host for generating graphics images. For example, the host can send vertex data, commands, and program instructions to the programmable graphics processing state 205. A memory interface such as the DMA engine 23 allows the use of graphics memory (not shown) to perform read/write operations. The scratchpad interface 210 provides an interface for interfacing with the scratchpad interface of the programmable graphics processor 2〇5. The programmable graphics processor 205 can be implemented as part of a system 290 that includes at least one other central processing unit 260 executing a software application program 27 that functions as a programmable graphics processor. 5 hosts. The illustrative system 290 can, for example, include a palm-sized unit such as a mobile phone or a personal digital assistant (PDA). For example, the software application 27A can include a graphics application 275 for generating graphics images on the display 295. Additionally, as described in greater detail below, in some embodiments, the software application 270 can include a graphics processor management software application 28 for executing management functions associated with the programmable graphics processor 2.5A. These management functions such as, for example, pipeline reconfiguration, scratchpad configuration, and testing. In one embodiment, the programmable graphics processor 205, the scratchpad interface 210, the host interface 220, and the DMA engine 230 are components of an embedded graphics processing core 25 formed on a single integrated circuit 200. The single integrated circuit includes a host, such as an integrated circuit 200 formed on the wafer including a central processing unit 26, the central processing unit having a software 270 resident on the memory. Alternatively, the graphics processing core 250 may be disposed on the first integrated circuit, and the CPU 260 is disposed on the second integrated circuit. 3 is a stage 310 stylized instruction detailing a programmable graphic 101822.doc 1297468 in accordance with an embodiment of the present invention. The raster stage 31 processes each pixel of a given triangle and determines the parameters that need to be calculated for the pixel as part of the reproduction, such as computational color, texture, transparency test, transparency blending, z-depth testing, and fogging parameters. In one embodiment, the raster stage 31 计算 calculates the center of gravity coefficients for the pixel packets. In the center of gravity, coordinate system, and system, measure the distance from the vertices in the triangle. Using the center of gravity factor reduces the dynamic range required, which allows the use of fixed points that require less power than floating point calculations. The raster stage 31 produces at least one pixel packet for each pixel of the triangle to be processed. The mother-pixel packet includes an intercept for the payload of the desired pixel attributes (e.g., 'color, texture, depth, fog, (X, y) position). Additionally, each-pixel packet has associated sideband information that includes instructions (4) to be performed on the pixel packet. The instruction area (not shown) in the raster stage 31 指派 assigns instructions to the pixel packet. FIG. 4 illustrates an exemplary pixel package 43 of a pixel. And paste. In the embodiment - the grating stage (4), the pixel attribute is divided into two or more different classes: a pixel packet one, wherein each type of image = is used for the instruction of the special type, and only the data is Dividing into smaller jobs will reduce the bandwidth requirement by + ^ for a particular processing operation and if (for example) 目, 丨〇, only a subset of the pixel attributes will be processed, it will also reduce processing. demand. Ding Yun, the news 49if. The prime packet has an associated sideband information 410 and a payload. The inactive sideband information includes a valid field 412, a kill fleid 4l4, a cancel a % A field, and an instruction field including the current instruction 101822.doc * 13 - 1297468 416. The exemplary pixel packet 43G includes a block and fogging field 426 of the -th set (S, t) texture coordinates 422 and 424. The exemplary pixel packet 46 includes a color block 462 and a second set of texture coordinates (s, t) 464 and a pick. In an embodiment, each pixel packet is represented by a fixed point representation of payload information 42. Examples of pixel attributes that may be included in a pixel packet (where the pixel attribute size of the pixel attribute is Langer) include: - Z l6 hexadecimal warfare value; a 16-bit S/T texture coordinate and 4 bit Fineness; a pair of color values, each with an accuracy of 8 bits; or a 25555 argb color, where each of the five bits is in each ARGB variable. The sideband information of the pixel packet may include the (x, y) position of the pixel. However, in an embodiment, the start span command is generated by the raster stage 310, y) origin, where it begins to pass through the triangle along the scan line. Using the Start Span command will allow the self (image) package to omit the (x,y) position. The start span command informs other entities (e.g., data writer stage state and data extraction phase 330) at the initial (x, y) position at the beginning of the scan line. Along the scan line: the (x'y) position of his pixel can be inferred from the number of pixels from which a given pixel is derived. In one embodiment, the 'data writing phase and data extraction phase 330 includes native cache memory, and the native cache memory is adapted to increment the local count H, based on its cross-four command. The number of pixels encountered is calculated to update the (x, y) position. > Looking at Figure 5, in the embodiment, the raster stage 3 bay is to be processed - at least - pixel packet column 51 (). In some embodiments, each column = has a common sideband information defining a sequence of instructions for the column 510. The right pixel requires more than - column 510, and the columns 510 are organized as _101822.doc -14 - 1297468 and queues 'which are processed continuously with each new clock cycle. In:, the sub-pixel data is divided into four riding element pixel attributes ^ profit value 'where the four pixel register values define the pixels of the pixel (R0, Rl, R2 and R3) of the column, 51迭代. The iterative register (4) of the raster stage 3丨0 (not shown) has a corresponding register to support the pixel packet column 510β. In the embodiment, the raster stage includes - supporting the four-pixel packet column of the Shangda Scratchpad pools. Some types of pixel-packet attributes (such as textures) may require high precision. Conversely, some types of pixel-packet attributes may require lower precision, such as color. Configurable scratchpad pools To support the high accuracy and low accuracy of each pixel packet in column 51G. In the implementation, the scratchpad pool includes 4 high precision and 4 low precision perspective correction iteration values per column plus z depth value. For example, in this way the software assigns the iterator's precision for processing specific pixel packet attributes. In one embodiment, the raster stage 3 ι includes a register set that is adapted to track the integer portion of the texture. Zone, thus allowing The decimal bit is transmitted as a data packet. The raster stage 310 can, for example, receive an instruction from the host that needs to perform an operation on the pixel. In response to the 'raster stage 31', a multi-pixel packet with an associated instruction sequence is generated. The columns, wherein the pixel packet columns and instructions are configured to perform the desired processing operations. As described in more detail below, in the embodiment, the ALU stage 34 is capable of performing scalar arithmetic operations: where the operands comprise pixel packets A pre-selected subset of pixel attributes, constant values, and previously calculated temporary storage results on the pixel packet in column 51. Various graphical algorithms can be expressed as one or more scalar arithmetic operations. 101822.doc 15- 1297468 Additionally, Various vector graphics operations can be expressed as a plurality of scalar arithmetic operations. Thus, it should be appreciated that the programmable graphics processor 205 of the present invention can be programmed to perform any graphics operation on a pixel, which can be represented as a pure Sequence of arithmetic operations, such as atomization operations, color (transparency) color mixing, texture combining, transparency testing, or deep Degree test, such as in 〇pen Gjl®
Graphics System: A Specification (Version ι·2)中所描述 的運算’其内容以引用之方式併入本文中。舉例而言,為 回應光栅段3 10谓測待在像素上執行之所要的圖形處理 _ 功能(例如,霧化運算),光柵階段3 10可使用可程式之映射 表或映射演算法來判定像素封包及用於執行在像素上實施 圖形功能所需之純量算術運算之相關聯指令的指派。該映 射可例如藉由圖形處理器管理軟體應用程式28〇程式化。 再次參看圖3,隨著藉由光柵階段310來遊動三角形之每 一像素,光柵階段3 1 〇產生像素封包以供進一步處理,該等 像素封包係由閘道管理器階段32〇所接收。閘道管理器階段 320執行資料流控制功能。在一實施例中,閘道管理器階段 _ 320具有-用於像素封包之排程、負載平衡、資源配置及避 險之相關聯的計分板325。計分板325追蹤像素之進入及引 退。進入閘道管理器階段320之像素封包設定該計分板,且 該計分板在完成處理之後被重設為自可程式之處理器2〇5 中所排出之像素封包。作為一說明性實例,若緊密顯示器 295具有128x32像素之區域,則計分板325可為該顯示器之 每一像素維持一表以監視像素。 計分板325提供若干益處。舉例而言,當三角形中之一像 101822.doc . 1297468 素位於正在被處理及處於飛行中之另一像素的頂部上時, 計分板325阻止了危險。在一實施例中,計分板奶監視閒 置條件,並使料分板技術(s⑽eb_ding) f訊來記錄閒置 單元時間。舉例而言,若不存在有效像素,則計分板325可 關閉ALU以節省功率。如以下更詳細之描述,計分板奶追 縱像素封包,該等像素封包能夠與具有取消位元組之像素 封包一起由ALU 350來處理,使得像素封包流過ALU35〇而 未主動處自纟貫施例中,計分板325追蹤再循環像素封 ,包之(X,3〇位置。若像素封包被再循環,料分板325在隨 後通過中將像素封包中之指令序列遞增至像素之下一指 令,例如,若指令在通過數字i上係用於霧化運算,則指令 在通過數子2上被迭代至透明度混色運算。 育料提取階段330提取由閘道管理器32〇所傳遞之像素封 包的資料。以此方式可包括(例如)藉由料一像素封包列執 行適當之色彩、深度或紋理資料讀取來提取色彩、深度及 紋理資料。資料提取階段33〇可(例如)藉由自記憶體介:請 籲求讀取(例如,使用DMA引擎23〇來讀取訊框緩衝器(未圖 示))來提取像素或質素資料。在—實施例中’ f料提取階段 330亦可管理本機快取記憶體,諸如紋理/霧化快取記憶體 332、色彩/深度快取記憶體334及用於深度資料之z快取記 憶體(未圖示)。在將像素封包發送至下—階段上之前,將所 提取之資料置放於對應之像素冑包欄位上。在—實施例 中’資料提取階段330包括具有用於存取像素封包屬性搁位 所需之資料之-指令的指令隨機存取記憶體(ram)。在某些 101822.doc -17 - 1297468 實施例中,資料提取階段330亦執行Z深度測試。在此實施 例中,資料提取階段330使用一或多個深度比較測試來比較 像素封包之Z深度值與所儲存之Z值。若像素之Z深度值指 示該像素被閉塞,則設定取消位元。 像素封包列進入算術邏輯單元(ALU)階段340以供處理。 ALU階段340具有一組包括至少一 ALU 350之ALU 350,諸 如 ALU 3 5 0-0、3 50-1、35 0-2及 3 50-3。雖然說明四個 ALU 350’但是視應用而定,可在alu階段340中使用更多或更 _ 少之ALU 350。個別ALU 350讀取用於至少一像素封包列 5 1 〇之當前指令,並實施任何指令以執行該alu經程式化以 支援之純量算術運算。指令包括於每一 ALU 3 5〇中,並可(例 如)儲存於本機指令RAM(圖3中未圖示)中。 每一 ALU 350包括用於在第一運算元乘積(a*b)及第二運 算元乘積(b*c)上執行至少一算術運算的指令,其中&、b、〇 及d為運算元且*為乘法。某些或所有運算元可對應於⑽如) 像素封包列510内之暫存器值屬性。ALU35〇亦可具有為常 ⑩數或軟體可載入之一或多個遥瞀;γ去» ^ , ' 乂夕彳固連异儿值。在某些實施例中, ALU可支援使用來自像辛封句 个《 πI对巴上之先刖運异的臨時儲存結 果0 在一實施例中’每一ALU 350為可程式的。交又開關 (cr〇ssbar)(未圖示)或其他可程式之選擇器可包括於則 内彳允《午^擇運异几及結果的目標以回應來自軟體 (例如’軟體應用程式270)之指令。舉例而言,在 中,可使用運算命令碼自像素封包列51。内之任何暫 101822.doc •18 * 1297468 之屬性、臨時值及常數值選擇每一運算元(a、b、c、d)之源。 =此實^例巾,運#命令亦指示ALU 35q向何處發送算術運 的、。果諸如使用該結果來更新像素封包、將該結果儲 ;、、、α寺值或既使用该結果來更新像素封包又將該結果 儲存為臨時值。因&,舉例而言,可程式化alu以將像素 封包内之特定屬性讀取為運算元,並應用由#前指令所指 丁之純里异術運算。運算命令碼亦可包括用以互補運算元 (,例如,計算l_x,其中父為讀取值)、求反運算元(例如,計 π-Χ,其中X為讀取值)或鉗制運算元或結果之命令。運算命 7碼之其他實例可包括(例如)用以選擇資料袼式之命令。 由ALU 350所執行之算術運算的實例為像素封包内至少 一變數上之形式(a*b)+(c*d)之純量算術運算,其中a、b、eThe operations described in Graphics System: A Specification (Version ι. 2) are incorporated herein by reference. For example, in response to raster segment 3 10 pre-determining the desired graphics processing function (eg, atomization operation) to be performed on the pixel, raster stage 3 10 may determine the pixel using a programmable mapping table or mapping algorithm. Encapsulation and assignment of associated instructions for performing scalar arithmetic operations required to implement graphics functions on pixels. The mapping can be stylized, for example, by the graphics processor management software application 28. Referring again to Figure 3, as each pixel of the triangle is navigated by the raster stage 310, the raster stage 3 1 〇 produces pixel packets for further processing, which are received by the gateway manager stage 32 。. The gateway manager stage 320 performs data flow control functions. In one embodiment, the gateway manager stage _ 320 has an associated scoreboard 325 for scheduling, load balancing, resource allocation, and avoidance of pixel packets. Scoreboard 325 tracks the entry and exit of pixels. The score packet entering the gateway manager stage 320 sets the scoreboard, and the scoreboard is reset to the pixel packets ejected from the programmable processor 2〇5 after processing is completed. As an illustrative example, if the compact display 295 has an area of 128 x 32 pixels, the scoreboard 325 can maintain a table for each pixel of the display to monitor pixels. The scoreboard 325 provides several benefits. For example, when one of the triangles is like 101822.doc. 1297468 is located on top of another pixel being processed and in flight, the scoreboard 325 prevents danger. In one embodiment, the scoreboard milk monitors the idle conditions and causes the scoreboard technique (s(10)eb_ding) to record the idle unit time. For example, if there are no valid pixels, the scoreboard 325 can turn off the ALU to save power. As described in more detail below, the scoreboard milk tracks the pixel packets, which can be processed by the ALU 350 along with the pixel packets with the canceled bytes, such that the pixel packets flow through the ALU 35 without being actively activated. In one embodiment, the scoreboard 325 tracks the recirculating pixel seals, which are (X, 3 〇 position. If the pixel packets are recycled, the scoreboard 325 increments the sequence of instructions in the pixel packets to pixels in subsequent passes. The next instruction, for example, if the instruction is used for atomization operations on the number i, then the instruction is iterated over the number 2 to a translucent color mixing operation. The nurturing extraction stage 330 is extracted by the gateway manager 32 〇 The data of the pixel packets may include, for example, performing color, depth, or texture data reading by performing a color, depth, or texture data read by a pixel packet column. The data extraction phase 33 may, for example, By self-memory: please call for reading (for example, use the DMA engine 23〇 to read the frame buffer (not shown)) to extract pixel or quality data. In the embodiment The fetch stage 330 can also manage native cache memory, such as texture/atomized cache memory 332, color/depth cache memory 334, and z cache memory for depth data (not shown). Before the pixel packet is sent to the next stage, the extracted data is placed on the corresponding pixel packet field. In the embodiment, the data extraction stage 330 includes a shelf for accessing the pixel packet attribute. The required data-instruction command random access memory (ram). In some 101822.doc -17 - 1297468 embodiments, the data extraction phase 330 also performs a Z-depth test. In this embodiment, the data extraction phase 330 uses one or more depth comparison tests to compare the Z-depth value of the pixel packet with the stored Z-value. If the Z-depth value of the pixel indicates that the pixel is occluded, then the cancellation bit is set. The pixel packet column enters the arithmetic logic unit ( ALU) stage 340 for processing. ALU stage 340 has a set of ALUs 350 including at least one ALU 350, such as ALU 3 5 0-0, 3 50-1, 35 0-2, and 3 50-3. ALU 350' but depending on the application, available at More or less ALUs 350 are used in the alu stage 340. The individual ALUs 350 read the current instructions for at least one pixel packet column 5 1 , and implement any instructions to perform the alu stylization to support the scalar amount Arithmetic operations are included in each ALU 3 5 and can be stored, for example, in a native instruction RAM (not shown in Figure 3). Each ALU 350 includes a product for the first operand (a *b) and an instruction to perform at least one arithmetic operation on the second operand product (b*c), where &, b, 〇, and d are operands and * is multiplication. Some or all of the operands may correspond to (10) as in the scratchpad value column 510 within the scratchpad value attribute. ALU35〇 can also have one or more telegrams for normal or software loading; γ to » ^ , ' 乂 彳 彳 彳 异 。 。 。 。 。. In some embodiments, the ALU may support the use of a temporary storage result from a singular sentence of πI on the bar. In an embodiment, each ALU 350 is programmable. A switch (cr〇ssbar) (not shown) or other programmable selector may be included in the "noon" option and the result of the response in response to the software (eg 'software application 270') Instructions. For example, in the middle, the operation command code can be used to encapsulate the column 51 from the pixel. The source, temporary value, and constant value of any temporary 101822.doc •18 * 1297468 are selected as the source of each operand (a, b, c, d). = This is the case, the # command also indicates where the ALU 35q sends the arithmetic. For example, using the result to update the pixel packet, storing the result; , , , α temple value or both using the result to update the pixel packet and storing the result as a temporary value. For example, &, programmatic alu can be used to read a specific attribute in a pixel packet as an operand, and apply the purely different operation specified by the #pre-instruction. The operation command code may also include a complementary operation element (for example, calculating l_x, where the parent is a read value), a negation operation element (for example, π-Χ, where X is a read value), or a clamp operation element or The result of the order. Other examples of operations may include, for example, commands to select data patterns. An example of an arithmetic operation performed by ALU 350 is a scalar arithmetic operation of at least one variable form (a*b)+(c*d) within a pixel packet, where a, b, e
及d為運算元且*運算為乘法。較佳地亦可程式化每一 alu 350以執行其他數學運算,諸如互補運算元及求反運算元。 另外,在某些實施例中,每一ALU 35〇可自(a*b,c*d)計算 最小及最大值,並執行邏輯比較(例如,若a*b等於、不等 於、小於、或小於或等於c*d之邏輯結果)。 在某些實施例中,每一 ALU 35〇亦可包括指令,該等指 令用於基於一測試來判定是否在取消欄位414中產生取消 位元,該測試諸如a*b與c*d之比較(例如,若a*b不等於c*d 則取消、若a*b等於c*d則取消、若a*b小於c*d則取消、或 若a*b大於或等於c*d則取消)。可產生取消位元之alu運算 的實例包括透明度測試,其中將色彩值與測試色彩值進行 比較,諸如表達式IF(透明度〉透明度參考),則取消像素, 101822.doc -19- 1297468 其中透明度為色彩值,且透明度參考為參考色彩值。可產 生取4位το之ALU運算的另一實例為2深度測試,其中將像 素之Z值與具有相同位置之先前像素的至少一Z值進行比 車乂且右該深度測試指示該像素被閉塞,則取消該像素。 在員施例中,若取消位元經設定於像素封包中,則關 於=理像素封包使個別ALU35G被停用。在_實施例中,當 在方頻帶身訊中偵測到取消位元時,使用時脈閘控機制以 使ALU35G被停用。結果,在對像素封包產生取消位元後, ALU 350未在該像素封包上浪費功率,因為其經由則階段 =0傳播。然而,應注意,具有取消位元組之像素封包仍向 前傳播,從而允許其由資料寫入階段355及計分板325來解 =此方式允許所有像素封包由計分板325來解決,甚至 疋忒等由取消位兀標記為不需要進一步alu處理的像素封 在實施例中,若像素之任何列5 10係由取消位元來標 :,則亦取消相同像素之其他列51〇。此可(例如)藉由在; &之間轉發取消資訊或藉由追縱像素之―或多個階段(其 中列5 10係由取消位元來標記)來完成。在某些實施例中, 一取’肖位7L被設定,則僅像素封包列5 1〇之旁頻帶資訊 41〇(其包括取消位元)傳播至下一階段上。 & ALUP “又34〇之輸出轉到資料寫入階段355。資料寫入階 段355將所處理之像素封包轉換為像素資料並將結果寫人 至記憶體介面(例如’經由DMA引擎23〇)。在一實施例中, 像素之寫入值在寫入緩衝器352中被積累,且像素之積累寫 入被刀批寫人至記憶體。資料寫人階段355可執行之功能的 101822.doc 1297468 實例包括色彩及深度回寫與格式轉換。在某些實施例中, 資料寫入階段355亦可識別待取消之像素並設絲消位元。 包括再循ί辰路徑360以將像素封包再循環回至閘道管理器 320。再循核路徑36〇允許(例如)需要使用通過階段“ο 一次以上來執行一算術運算序列的處理。資料寫入階段 指示引退之寫入至閘道管理器階段32〇以用於計分板技術。 圖6為例示性個別ALU 35〇之方塊圖。ALu 35〇具有輸入 匯流排605 ,該輸入匯流排具有用於接收對應暫存器r〇、 Rl、R2及R3中之像素封包列51〇的資料匯流排。包括指令 RAM6H)以用於ALU指令。一例示性指令組在方塊62〇中: 以說明。在一實施例中,可程式化ALU 35〇以自列51〇讀取 四個20位元暫存器值中的任一值並自列51〇選擇一組運算 兀。另外,可程式化ALU 350以自暫存器(T)63〇將臨時值選 擇為運算元,諸如每ALU 350兩個20位元臨時值,其可自先 丽結果臨時儲存,如路徑64〇所指示。ALU 35〇亦可將常數 值(未圖示)選擇為運算元,其亦可由軟體來程式化。在一實 施例中,第一多工器(MUX)階段645自像素封包列選擇運算 元、任何臨時值630及任何常數值(未圖示)。可包括格式轉 換模組650以在算術計算單元67〇中將運算元轉換為適合於 八1^11 3 50之計算精確度的所要資料格式。八乙1;35〇包括用以 允許在第二MUX階段660中選擇每一運算元或其互補的元 件。將所得之四個運算元輸入至純量算術計算單元67〇,該 純量算術計算單元可執行兩乘法及一加法。視情況可使用 钳制器680將所得值鉗制至所要範圍(例如,〇至1〇)。像素 1018|2,d〇c -21- 1297468 封包列510在匯流排690上退出。 在-實施例中,所選像素封包屬性可為—符號i8(si8) 格式。該S1.8格式為-具有8位元小數之基數2數目,其範 圍為[-2至+2]。S1.8格式允許計算之更高的動態範i舉例 而言’在處理照明之計算中,S1.8格式允許增加之動離範 ^從而導致改良之真實性。若以⑴執行之純量算術運 异的結果必須在[G,l]範圍内,則可鉗制該結果以迫使該結 果在[〇, im圍内,作為-說明性實例,可以Sl 8格式^ 行色彩資料的網㈣算且接著_結果。纽意,在本發 明之實施例中’不同類型之像素封包可具有以不同格式^ 不之資料屬性。舉例而t,色彩資料可以S18格式之第一 類型的像素封包來表示,而(s,〇紋理資料可以高精確度^ 位元格式之第二類型的像素封包來表示。在某些實施例 中,像素封包位元大小係藉由最高精確度像素屬性的位元 大小需求來設I舉例而言,由於紋理屬性—心色彩需 要更大之精確度’因而可以像素封包大小來以高精確度 表示紋理資料’諸如16位元紋理資料。Sl 8格式之改良動 態範圍允許(例如)將用於—個以上之色彩成份的資料有效 封裝成2G位元像素封包大小,該大小係為需要(例如)16位元 之紋理資料及4位元精細度(L0D)的更高精確度資料之紋理 資料而選擇。舉例而士 , ^ ^ 一 。,由於母— S1.8色彩成份需要10位 70 ’因而兩個色彩成份可經封裝成2G位元像素封包。 圖7說明例示性ALU階段34〇,其包括一個以上經配置為 管線之ALU 3 50,J: φ兩加斗、工加 ,、中兩個或兩個以上ALU 35〇被鏈結在一 101822.doc * 22 1297468 起。如先前所插述,可程式化個別ALU35(m自—像素封包 讀取-或多個運算元,產生算術運算的結果,並使用該結 果,來更新像素封包或臨時暫存器。可指派每一⑽以讀取 運算元,產生算術結果,並在將像素封包列傳遞至下一alu 之别更新一或多個像素封包或臨時值。 視待執行之處理運算、ALU等待時間及效率考慮而定, 可以各種方式來組態ALU階段34〇中之alu 35〇之間的資料 流。如先前所描述,本發明允許程式化每一 alu以讀取像 | 素封包列内之所選運算元且使用結果來更新所選像素封包 暫存器。在-實施例中,則階段34〇包括用於每一色彩通 道(例如,紅色、綠色、藍色及透明度)之至少一 ALU 350。 以此方式允許(例如)負載平衡,其中該等A L u經組態以在像 素封包列5 10上並行地運算(儘管因管線技術而以不同之時 間點)以執行類似或不同之處理任務。作為可如何程式化 ALU 350之一實例,可程式化第一 ALU35〇-〇以執行第一色 彩成份的計算,可程式化第:ALU35〇_1以執行第二色彩成 拳份的運算,可程式化第三ALU35〇-2以執行第三色彩成份的 運异,且可程式化第四ALU 3 50-3以執行霧化運算。因此, 在某些實施例中,對於一像素封包列5丨〇,可對每一 ALU 3 5〇 指派不同之處理任務。另外,如以下之更詳細描述,在某 些κ施例中,軟體可組態ALU 350以選擇ALU階段340内 ALU 350之資料流,包括ALU 3 5〇之執行次序。然而,由於 可組態該資料流’因而應瞭解,在某些實施例中,可配置 沿著一 ALU鏈的資料流,使得一 ALU35〇彳之結果更新一或 101822.doc 1297468And d is an operand and * is computed as a multiplication. Preferably, each alu 350 can also be programmed to perform other mathematical operations, such as complementary operands and negation elements. Additionally, in some embodiments, each ALU 35 can calculate a minimum and a maximum from (a*b, c*d) and perform a logical comparison (eg, if a*b is equal, not equal, less than, or Less than or equal to the logical result of c*d). In some embodiments, each ALU 35 can also include instructions for determining whether to generate a cancellation bit in the cancellation field 414 based on a test, such as a*b and c*d. Comparison (for example, if a*b is not equal to c*d, cancel, if a*b is equal to c*d, cancel, if a*b is less than c*d, cancel, or if a*b is greater than or equal to c*d cancel). An example of an alu operation that can generate a cancel bit includes a transparency test in which a color value is compared to a test color value, such as the expression IF (Transparency > Transparency Reference), then the pixel is cancelled, 101822.doc -19- 1297468 where transparency is The color value, and the transparency reference is the reference color value. Another example of an ALU operation that can produce a 4-bit το is a 2-depth test in which the Z-value of a pixel is compared to at least one Z-value of a previous pixel having the same position and the depth test indicates that the pixel is occluded. , then cancel the pixel. In the embodiment, if the cancellation bit is set in the pixel packet, then the individual ALU 35G is deactivated. In the embodiment, when a cancel bit is detected in the square band body, a clock gating mechanism is used to disable the ALU 35G. As a result, after the cancellation bit is generated for the pixel packet, the ALU 350 does not waste power on the pixel packet because it propagates via phase =0. However, it should be noted that the pixel packet with the canceled byte is still propagated forward, allowing it to be resolved by the data write phase 355 and the scoreboard 325. This way all pixel packets are allowed to be resolved by the scoreboard 325, even The pixels marked by the cancellation bit as not requiring further alu processing are encapsulated in the embodiment. If any column 5 10 of the pixel is marked by the cancel bit: the other columns 51 of the same pixel are also cancelled. This can be done, for example, by forwarding the cancellation information between & or by tracking the "or multiple phases" of the pixels (where column 5 10 is marked by the cancellation bit). In some embodiments, a 'corner 7L is set, and only the sideband information of the pixel packet column 〇1 (which includes the cancellation bit) is propagated to the next stage. & ALUP "The output of the other 34" goes to the data writing stage 355. The data writing stage 355 converts the processed pixel packet into pixel data and writes the result to the memory interface (eg 'via DMA engine 23〇) In one embodiment, the write value of the pixel is accumulated in the write buffer 352, and the accumulated write of the pixel is written to the memory by the cutter. The function of the data writer stage 355 is 101822.doc 1297468 Examples include color and depth write back and format conversion. In some embodiments, data writing stage 355 can also identify pixels to be cancelled and set up silk distracting elements. Loop back to the gateway manager 320. The re-routing path 36 allows, for example, the need to use the process of performing an arithmetic operation sequence through the stage "o more than once. Data Write Phase Indicates that the retirement is written to the Gateway Manager Phase 32 for use in the scoreboard technology. Figure 6 is a block diagram of an exemplary individual ALU 35〇. The ALu 35A has an input bus 605 having a data bus for receiving pixel packet columns 51 of the corresponding registers r, R1, R2, and R3. Includes instruction RAM6H) for ALU instructions. An exemplary set of instructions is in block 62: to illustrate. In one embodiment, the programmable ALU 35 reads any of the four 20-bit scratchpad values from column 51 and selects a set of operations from column 51. In addition, the programmable ALU 350 selects the temporary value as an operand from the scratchpad (T) 63, such as two 20-bit temporary values per ALU 350, which can be temporarily stored as a result, such as path 64. Instructed. The ALU 35〇 can also select a constant value (not shown) as an operand, which can also be programmed by software. In one embodiment, the first multiplexer (MUX) stage 645 selects operands, any temporary values 630, and any constant values (not shown) from the pixel packet column. A format conversion module 650 can be included to convert the operands into a desired data format suitable for the computational accuracy of eight 1^11 3 50 in the arithmetic calculation unit 67.八乙1; 35〇 includes elements to allow selection of each operand or its complement in the second MUX stage 660. The resulting four operands are input to a scalar arithmetic calculation unit 67, which performs two multiplications and one addition. The resulting value can be clamped to the desired range (e.g., 〇 to 1 〇) using a clamp 680, as appropriate. Pixel 1018|2, d〇c - 21 - 1297468 Packet column 510 exits on bus bar 690. In an embodiment, the selected pixel packet attribute may be in the -symbol i8 (si8) format. The S1.8 format is - the number of base 2 with an 8-bit fraction, which ranges from [-2 to +2]. The S1.8 format allows for a higher dynamic range of calculations. For example, in the calculation of processing illumination, the S1.8 format allows for an increased dynamic range, resulting in improved authenticity. If the result of the scalar arithmetic transfer performed by (1) must be in the range of [G, l], the result can be clamped to force the result to be in [〇, im, as an illustrative example, in S8 format^ The network of color data (4) is calculated and then _ results. In the embodiment of the invention, the different types of pixel packets may have data attributes in different formats. For example, t, the color data may be represented by a first type of pixel packet of the S18 format, and (s, the texture material may be represented by a second type of pixel packet of a high precision ^bit format. In some embodiments The pixel packet bit size is set by the bit size requirement of the highest precision pixel attribute. For example, since the texture attribute—heart color requires greater precision, the pixel packet size can be expressed with high precision. Texture data such as 16-bit texture data. The improved dynamic range of the S8 format allows, for example, efficient encapsulation of data for more than one color component into a 2G bit pixel packet size, which is required (for example) 16-bit texture data and 4-bit fineness (L0D) for higher-precision data texture data. For example, ^, I., because the mother-S1.8 color component requires 10 bits 70' The two color components can be packaged into 2G bit pixel packages. Figure 7 illustrates an exemplary ALU stage 34, which includes more than one ALU 3 50 configured as a pipeline, J: φ two plus buckets, work plus, Two or more ALUs 35〇 are chained together at 101822.doc * 22 1297468. As previously explained, individual ALU35 (m self-pixel packet reads - or multiple operands can be programmed to generate The result of the arithmetic operation, and use the result to update the pixel packet or temporary register. Each (10) can be assigned to read the operand, produce an arithmetic result, and update the pixel packet column to the next alu. Or multiple pixel packets or temporary values. Depending on the processing operations to be performed, ALU latency, and efficiency considerations, the data flow between alu 35〇 in the ALU phase 34〇 can be configured in various ways. The present invention allows each lu to be programmed to read selected operands within the column of the prime packet and use the result to update the selected pixel packet register. In an embodiment, then stage 34A is included for each At least one ALU 350 of a color channel (eg, red, green, blue, and transparency). In this manner, for example, load balancing is allowed, wherein the ALs are configured to operate in parallel on the pixel packet column 5 10 (although due to tube Techniques at different points in time to perform similar or different processing tasks. As an example of how the ALU 350 can be programmed, the first ALU 35〇-〇 can be programmed to perform the calculation of the first color component, which can be programmed :ALU35〇_1 to perform the second color into a punch operation, the third ALU35〇-2 can be programmed to perform the third color component, and the fourth ALU 3 50-3 can be programmed to perform the atomization. Thus, in some embodiments, for a pixel packet column 5, each ALU 35 can be assigned a different processing task. Additionally, as described in more detail below, in some κ embodiments The software configurable ALU 350 selects the data stream of the ALU 350 within the ALU stage 340, including the execution order of the ALU 3 5〇. However, since the data stream can be configured, it will be appreciated that in some embodiments, the data stream along an ALU chain can be configured such that the result of an ALU 35 is updated by one or 101822.doc 1297468
來讀取。 其作為運算元由隨後 之 ALU 350-1 圖8為一 之圖形處理器205To read. It is used as an arithmetic unit by the subsequent ALU 350-1. FIG. 8 is a graphics processor 205.
地重新組態像素封包經由階段之處理流程。因此較佳地利 用同步技術來協調在自一組態改變至另一組態期間處於飛 行中之像素封包的為料流,意即,執行同步使得處於飛行 中之意欲在第一組態中待處理之像素封包在該組態改變至 圖8為一具有可重新組態管線之可程式 之一部分之實施例的方塊圖,其中像素連 理流程可組態以回應軟體命令,諸如來差 第二組態之前完成其處理。 在一實施例中,資料提取階段830、資料寫入階段855及 個別ALU 850具有各連接至第一分配器89〇的個別輸入及各 連接至弟一分配器895的個別輸出。每一分配器890及895可 (例如)包含交換器、交叉開關、路由器或MUX電路以選擇 傳入像素封包至資料提取階段830、ALU 8 50及資料寫入階 段855之分配流。分配器890及895判定傳入像素封包810經 由資料提取階段830、資料寫入階段855及個別ALU 850的資 料路徑。訊號輸入892及894允許分配器890及895接收軟體 命令(例如,來自在CPU上執行之軟體應用程式),以在資料 101822.doc -24· 1297468 提取階段830、資料寫入階段855及ALU 850之間重新組離、像 素封包之分配。重新組態之一實例為指派ALU 850之執行次 序。重新組態之另一實例為··若判定到對於特定時間處理 任務不需要資料提取階段,則繞過資料提取階段83〇。作為 重新組態之又一實例,可能需要改變資料提取階段83〇耦合 至ALU的次序。作為另一實例,可能需要對資料寫入階段 8 5 5進行重新排序。作為一說明性實例,可存在以下情況: 其中更有效的是在資料提取之前對紋理座標進行運算,在 該情況下,配置資料流以使資料提取階段83〇在alu85〇執 订紋理運算之後接收像素封包。因此,可重新組態管線之 一益處在於:軟體應用程式可重新組態可程式之圖形處理 器205以增加效率。 再-人參看圖5,如先前所論述,光柵階段31〇產生用於處Reconfigure the pixel packet through the processing flow of the stage. It is therefore preferred to utilize synchronization techniques to coordinate the flow of pixels in flight during a change from one configuration to another, meaning that synchronization is performed such that it is in flight intended to be in the first configuration. The processed pixel packet is changed in the configuration to Figure 8 is a block diagram of an embodiment of a programmable portion having a reconfigurable pipeline, wherein the pixel conjunction process is configurable in response to a software command, such as a second group Complete its processing before the state. In one embodiment, the data extraction phase 830, the data writing phase 855, and the individual ALUs 850 have individual inputs that are each coupled to the first distributor 89A and respective outputs that are coupled to the first distributor 895. Each of the splitters 890 and 895 can, for example, include a switch, crossbar switch, router or MUX circuit to select the incoming stream of incoming packets to the data extraction stage 830, the ALU 8 50, and the data write stage 855. Distributors 890 and 895 determine the incoming pixel packet 810 via the data extraction phase 830, the data write phase 855, and the individual ALU 850 data paths. Signal inputs 892 and 894 allow allocators 890 and 895 to receive software commands (e.g., from a software application executing on the CPU) to extract phase 830, data write phase 855, and ALU 850 in data 101822.doc -24· 1297468 Re-distribution and allocation of pixel packets. One instance of reconfiguration is to assign the execution order of the ALU 850. Another example of reconfiguration is if the data extraction phase is not required for a particular time processing task, then the data extraction phase 83 is bypassed. As yet another example of reconfiguration, it may be necessary to change the order in which the data extraction phase 83 is coupled to the ALU. As another example, it may be necessary to reorder the data write phase 855. As an illustrative example, there may be the following cases: where it is more efficient to operate on the texture coordinates prior to data extraction, in which case the data stream is configured to cause the data extraction phase 83 to receive after the aloe 85 texture processing operation Pixel packet. Therefore, one benefit of the reconfigurable pipeline is that the software application can reconfigure the programmable graphics processor 205 to increase efficiency. Referring again to Figure 5, as previously discussed, the grating phase 31 is generated for use at
理之像素封包列510。可進一步將該等列51〇配置為一群MO 列諸如一四列5丨〇序列,其被傳遞以用於在連續時脈週期 中處理。然而,可在像素封包列51〇上執行之某些運算可需 要另一像素封包列之算術運算的結果。因此,在一實施例 中’光栅階段310在_群52()列中配置像素封包以解決資料 =賴性。作為_說明性實例,若—像素封包上之紋理運算 而要歹】中另一像素封包之結果,則配置群52〇,使得且有 依賴之紋轉算㈣素料纽於錢之Μ。 參看圖9,JZ _ 、、 只她例中,像素由光栅階段310交替地指 派為奇或偶。每_德4 像素列之相應暫存器(R0、Rl、R2及R3) 相應地被指派為偶或奇。接著利用一或多個規則來交錯偶 101822.doc -25- 1297468 像素之偶像素封包㈣5及奇像素之㈣9ig明免資料依 、 每隔列進行交錯會提供額外之時脈週期以解決 ALU等待時間。51此’若偶像素之列G需要兩個時脈週期以 所需之結果,則交錯奇像素之列0 由ALU等待時間所需之時間的額外時脈週期。作為 2明性實例,考慮多紋理運算,其中偶像素之列〇為混色 的^且相同像素之列1對應於與需要第-混色運算之結果 :弟紋理之/ttl色。若第—運算的ALu等待時間為兩個時 2週期’則交錯會允許混色運算之結果可用於使用 鼻之紋理。 /一交錯實施例中,較佳地包括旁頻帶資訊以協調交錯 貝^。舉例而言’在一實施例中’每一像素封包中的旁 頻帶資訊包括-偶/奇攔位以區別偶及奇列。每—勘⑽ =包括兩組對應於偶像素及奇像素之臨時暫存器的臨時 ,存益’以對偶/奇像素封包提供合適之臨時值。使用偶/ 奇攔位來選擇合適之臨時暫存器組,例如,對奇像素 偶臨時暫存器’而對偶像素選擇一奇臨時暫存器組。在一 實施例中’偶及奇像素兩者共用常數暫存器,以降低甩於 偶及奇像素之常數值之儲存需要的總量。在—實施例中,、 軟體主機可用常數值來設定臨時暫存器達-延長之時段, 以核擬常數暫存器。雖然交錯兩個像素為-實施例時,作 是應瞭解,若(例如)ALU等待時間對應於兩個以上之時脈 期’則可將該交錯進一步延展為交錯兩個以上之像 光栅階段3 10交錯像素封包之一益步〆 ”The pixel block column 510. The columns 51A can be further configured as a group of MO columns, such as a four column 5 丨〇 sequence, which are passed for processing in a continuous clock cycle. However, some of the operations that can be performed on the pixel packet column 51 can require the result of an arithmetic operation of another pixel packet column. Thus, in one embodiment the 'raster stage 310 configures pixel packets in the _group 52() column to account for data. As an illustrative example, if the texture operation on the pixel packet is the result of another pixel packet, then the group 52 is configured such that there is a dependency on the texture (4). Referring to Figure 9, JZ _ , in her case alone, the pixels are alternately assigned odd or even by the raster stage 310. The respective registers (R0, R1, R2, and R3) of each 4-pixel column are assigned as even or odd. Then use one or more rules to interleave the even pixel packets (4) 5 of the 10822.doc -25 - 1297468 pixels and the (4) 9ig clear data of the odd pixels. Interleaving every other column will provide an additional clock cycle to solve the ALU latency. . 51 If the even pixel column G requires two clock cycles to achieve the desired result, then the odd-numbered pixel column 0 is interleaved by the ALU waiting time for the additional clock cycle. As a two-intelligence example, a multi-texture operation is considered in which the column of even pixels is mixed color and the column 1 of the same pixel corresponds to the result of the need for the first-color mixing operation: the /ttl color of the texture. If the ALU latency of the first operation is two, 2 cycles' then the interleaving will allow the result of the color mixing operation to be used for the texture of the nose. In an interleaved embodiment, sideband information is preferably included to coordinate interleaving. For example, in one embodiment, the sideband information in each pixel packet includes an - even/odd block to distinguish between even and odd columns. Each of the surveys (10) = includes two sets of temporary temporary registers corresponding to even and odd pixels, providing a suitable temporary value for the dual/odd pixel packets. The even/odd block is used to select the appropriate temporary register set, e.g., for the odd pixel even temporary register' and the odd pixel selects an odd temporary register set. In one embodiment, both the even and odd pixels share a constant register to reduce the amount of storage required for constant values of even and odd pixels. In the embodiment, the software host can use the constant value to set the temporary temporary register up-extension period to verify the constant register. Although interleaving two pixels is an embodiment, it should be understood that if, for example, the ALU latency corresponds to more than two clock periods, then the interlace can be further extended to interlace two or more image raster stages 3 10 interlaced pixel packets
1處在於:硬體考慮到ALU 101822.doc -26- 1297468 等待時間,從而降低了軟體上之負擔以解決ALU等待時 間’若(例如)光柵階段3 10未交錯像素,則將另外發生該ALU 等待時間。 如先前所論述,在一可組態管線中,可組態ALU 350内 之資料流。舉例而言,在硬體中,每一 ALU 3 50可大體上相 同。然而,特定ALU可經組態以在資料流中具有一個以上 之位置,例如,不同之執行次序。因此,需要在每一 ALU 350 中提供一識別符以指示其在資料流内之位置。可(例如)藉由 | 每一 ALU 350的直接暫存器寫入技術對每一 ALU 350提供 識別符。然而,此方法具有需要顯著軟體耗用之缺點。因 此’在一實施例中,利用封包技術來觸發需要組態資訊之 元件’以發現其在處理流程内的相對位置並在本機暫存器 中寫入相應之識別符。 參看圖10,在一實施例中,ALU 350之暫存器位址空間 係使用封包初始化技術之軟體可組態型,以使用資料封包 將一識別(10)傳達至每一八乙11 350。每一八1^11 350可(例如) ® 包括用於接收及轉發資料封包之習知網路模組。在一實施 例中’10封包1〇1〇係由軟體應用程式來起始。1£)封包1〇1〇 含有初始ID碼,諸如數字。ID封包1010係在需要1]3碼之元 件之前的點處注入於圖形管線中,且接著傳遞至由當前管 線組態所定義之處理流程的隨後元件。在一實施例中,第 一 ALU 350中之組態暫存器1020接收ID封包,將ID碼之當 前值寫入至該組態暫存器中,且接著在將該1〇封包傳遞至 下一 ALU上之前遞增該ID封包之10碼。繼續此處理,其中 101822.doc -27- 1297468 每一隨後之ALU 350將ID碼之當前值寫入至其組態暫存器 中,且接著將具有遞增ID碼的ID封包傳遞至下一 alu。應 瞭解,沿著資料流路徑的其他階段亦可以類似方式而設定 之組態暫存器。舉例而言,組態流中之元件亦可包括資料 提取階段或資料寫入階段,其亦具有藉由讀取仍封包而設 定之組態暫存器,且其在將具有遞增⑴的①封包傳遞至該 組態流中之下一元件之前遞增10碼。此暫存器組態形式之 一益處在於:其不需要ALU 350單元之間的硬體差異,從而 Φ 允許資料流經由管線的軟體重新組態。因此,例如,在一 實施例中,圖形處理器管理應用程式28〇僅需要產生初始1〇 封包1010,諸如藉由經由主機介面220發佈用以產生ID封包 1010之命令,該命令係由ID封包產生器1030所接收。 在一替代實施例中,使用廣播封包技術將1£>碼寫入至組 態暫存器中以觸發需要寫入組態暫存器之元件以發現其 ID。在此實施例中,該等元件(例如,ALU 35〇)可使用網路 協定來發現其ID。廣播封包技術可用於(例如)管線經分支 ® 以允許該管線之分支並行地處理像素之實施例中。 圖Π說明一包括診斷監視能力的實施例。在一實施例 中,存在一連串沿著圖形處理器2〇5之元件的分接頭,諸如 分接頭與每一 ALU 35〇及資料提取階段33〇相關聯。同樣亦 可在其他階段處包括分接頭。可組態之測試點選擇器丨丨〇5 經调適成允許監視諸如兩個分接頭112〇及113〇之所選分接 頭’以回應諸如來自圖形處理器管理應用程式28〇之軟體命 令的軟體命令。可(例如)使用多工器來實施可組態之測試點 101822.doc -28- I297468 L擇器1105。在-實施例中,包括至少—計數器⑴〇以用 於每一所選測試點的統計量收集。在一實施例中,由軟體 所產生之追蹤記錄(instrument)封包提供關於待監視之分接 頭之資訊且為所選測試點啟用計數。另外,可 錄暫存H崎基於管狀運算模式㈣計量收集 (例如,可提供追蹤記錄暫存器以允許軟體為特定類型之圖 形運算啟料數,諸如當發生透明度混色運算時啟用統計 計數卜可組態之測試點選擇器11〇5之一益處在於··其允許 諸如圖形處理H管理應用程式之軟體具有僅對所關注 之測試點而收集的統計資料,從而降低了硬體複雜性及成 本’同時仍允許軟體分析可程式之處理器2〇5之行為的任何 P刀可(例如)選擇所關注之測試點以收集與處理特定種類 貧料之該等ALU 350相關聯的統計量,該等ALU35〇諸如處 理紋理貝料的ALU 350。另外,可對特定圖形運算(諸如透 明度混色)啟用統計量收集。 在一實施例中’可'組態之測試點選擇器11〇5利用三線協 定。諸如ALU 350-0之具有有效負載資料之每一元件產生一 有效訊號,該有效訊號可(例如)向下流動至下一元件(例 如,ALU 350-1)。就緒接收有效負載之元件產生就緒訊號, 該就緒訊號可(例如)向上流動至前一元件。然而,若元件未 就緒接收有效負載’則該元件產生未就緒訊號,該未就緒 訊號可(例如)對應於未確定該就緒訊號。啟用訊號對應於為 監視而啟用之元件’該監視諸如藉由軟體控制經由對鄰近 於被監狀點而儲存的監視啟用控制位元的管線式暫存器 101822.doc -29- 1297468 寫入Λ號可自產生S亥訊號之元件或接收該等訊號之元件 直接分接。 可使用所選分接頭點處之有效、就緒及未就緒訊號來判 定運异狀態。一轉移狀態對應於一時脈記號,該時脈記號 具有一用於向下游流動之資料之有效的有效負載(意即,有 效位元組)及一來自下游區塊之就緒訊號以在該下游區塊 中接收資料(例如,在分接頭點1120處,—來自ALU_0之有 效訊號,及在分接頭點113〇處,—來自ALlM之就緒訊號)。 -等待狀態對應於-具有有效的有效負載之時脈記號,該 有效的有效負載被閉塞,因為下面之區塊未就緒接收資料 (例如,在分接頭點1120處,一來IALU_0之有效訊號,及 在分接頭點1130處’ 一來自ALU-i之未就緒訊號)。在此實 施例中,可收集所選分接頭點上的統計量,諸如計數轉移 狀悲及等待狀態被偵測之時脈週期的數目。 本發明之實施例提供各種可用於嵌入式圖形處理器核心 250中之益處。在一緊密系統中,低功率掌上型系統29〇、 功率、空間及CPU能力可受到相當地限制。在一實施例中, 當不需要處理時對ALU 350進行時脈閘控(例如,藉由偵測 取消位元),從而降低了處理功率需求。另外,光柵階段31〇 僅需要產生用於被處理之子組像素資料的像素封包,從而 亦降低了功率需求。可程式之ALU階段34〇與具有用於執行 專用圖形功能之專用階段的習知管線相比需要較小之晶片 區域,從而降低了成本。可將可程式之處理器2〇5實施為由 軟體可組態之區塊’從而提供了改良之效率。測試監視可 101822.doc -30- 1297468 經組態以测試一子組測試點,從而降低了軟體之頻寬及分 析需求。該等及其他先前所描述之特徵使得所關注之可程 式之圖形處理器205用於嵌入式圖形處理器核心25〇中。 用於解釋目的之上述描述使用特定術語來提供對本發明 之完整理解。然而,熟習此項技術者將顯而易見,不需要 特定細節以實踐本發明。因此,提出本發明之特定實施例 之上述描述以用於說明及描述之目的。其並不意欲為詳盡 的或將本發明限制於所揭示之精確形式;顯然,鑒於以上 | 之教示,許多修改及變化為可能的。選擇及描述該等實施 例以最好地解釋本發明之原理及其實際應用,其藉此使熟 習此項技術者能夠最好地利用本發明及具有各種修改之各 種實施例,該等修改適合於所涵蓋之特定使用。以下之申 請專利範圍及其相等物意欲應定義本發明之範疇。 【圖式簡單說明】 圖1為三維圖形之先前技術管線之圖; ” ®2為包括根據本發明之一實施例之可程式之圖形處理 Φ 器的積體電路之方塊圖; 圖3為根據本發明之一實施例之可程式之圖形處理器的 方塊圖; 圖4說明根據本發明之一實施例的例示性像素封包; 圖5說明根據本發明之一實施例將像素封包配置成一群 像素封包列的例示性配置; 圖6為根據本發明之一實施例之單—算術邏輯單元的方 塊圖; 101822.doc -31 · 1297468 圖7為根據本發明之—實施例之—含兩個算術邏輯單元 之序列的方塊圖;One is that the hardware considers the ALU 101822.doc -26- 1297468 latency, which reduces the burden on the software to solve the ALU latency. If the raster phase 3 10 is not interlaced, for example, the ALU will occur separately. waiting time. As discussed previously, in a configurable pipeline, the data flow within the ALU 350 can be configured. For example, in hardware, each ALU 3 50 can be substantially the same. However, a particular ALU can be configured to have more than one location in the data stream, e.g., a different order of execution. Therefore, an identifier needs to be provided in each ALU 350 to indicate its location within the data stream. Each ALU 350 can be provided with an identifier, for example, by |direct register write technology for each ALU 350. However, this approach has the disadvantage of requiring significant software consumption. Thus, in one embodiment, a packet technique is used to trigger an element that needs to configure information to find its relative position within the process flow and to write a corresponding identifier in the local register. Referring to Fig. 10, in one embodiment, the scratchpad address space of the ALU 350 uses a software configurable type of packet initialization techniques to communicate an identification (10) to each october 11 350 using a data packet. Each of the eight 1^11 350 can, for example, include a conventional network module for receiving and forwarding data packets. In one embodiment, the '10 packet 1' is initiated by the software application. 1 £) Packet 1〇1〇 Contains the initial ID code, such as a number. The ID packet 1010 is injected into the graphics pipeline at a point before the component requiring 1]3 code and then passed to subsequent components of the processing flow defined by the current pipeline configuration. In an embodiment, the configuration register 1020 in the first ALU 350 receives the ID packet, writes the current value of the ID code into the configuration register, and then passes the 1〇 packet to the next The 10 code of the ID packet is incremented before an ALU. Continuing with this process, 101822.doc -27- 1297468 each subsequent ALU 350 writes the current value of the ID code into its configuration register, and then passes the ID packet with the incremental ID code to the next alu . It should be understood that the configuration registers can be configured in a similar manner along other stages of the data flow path. For example, the components in the configuration stream may also include a data extraction phase or a data write phase, which also has a configuration register set by reading still packets, and which will have 1 packet with increments (1). The code is incremented by 10 yards before being passed to the next component in the configuration stream. One benefit of this form of register configuration is that it does not require hardware differences between ALU 350 units, so Φ allows data flow to be reconfigured via pipeline software. Thus, for example, in one embodiment, the graphics processor management application 28 only needs to generate an initial packet 1010, such as by issuing a command to generate an ID packet 1010 via the host interface 220, the command being packetized by the ID. The generator 1030 receives it. In an alternate embodiment, a broadcast packet technique is used to write a 1 £> code into the configuration register to trigger an element that needs to be written to the configuration register to find its ID. In this embodiment, the elements (e.g., ALU 35A) can use a network protocol to discover their ID. Broadcast packet techniques can be used, for example, in embodiments where the pipeline is branched ® to allow the branches of the pipeline to process pixels in parallel. The figure illustrates an embodiment including diagnostic monitoring capabilities. In one embodiment, there is a series of taps along the components of graphics processor 2〇5, such as taps associated with each ALU 35〇 and data extraction phase 33〇. Taps can also be included at other stages. The configurable test point selector 丨丨〇5 is adapted to allow monitoring of selected taps such as two taps 112〇 and 113〇 in response to software commands such as those from the graphics processor management application 28 Software command. A configurable test point can be implemented, for example, using a multiplexer 101822.doc -28- I297468 L selector 1105. In an embodiment, at least - a counter (1) is included for statistical collection for each selected test point. In one embodiment, the instrumentation packet generated by the software provides information about the taps to be monitored and enables counting for the selected test points. In addition, the recordable temporary storage Hsaki is based on the tubular operation mode (4) metering collection (for example, a tracking record register can be provided to allow the software to be a specific type of graphics operation number, such as enabling statistical counting when a transparency color mixing operation occurs. One of the benefits of the configured test point selector 11〇5 is that it allows software such as the graphics processing H management application to have statistics collected only for the test points of interest, thereby reducing hardware complexity and cost. 'A P-knife that still allows the software to analyze the behavior of the programmable processor 2〇5 can, for example, select a test point of interest to collect statistics associated with the ALU 350 that processes a particular type of poor material, which ALU 35 such as ALU 350 for processing textured beakers. Additionally, statistic collection can be enabled for specific graphics operations, such as transparency blending. In one embodiment, the 'configurable' test point selector 11〇5 utilizes a three-wire protocol Each component, such as ALU 350-0, having payload data produces a valid signal that can, for example, flow down to the next component (eg For example, ALU 350-1). A component ready to receive a payload generates a ready signal that can, for example, flow up to the previous component. However, if the component is not ready to receive a payload, the component generates a not ready signal. The not ready signal may, for example, correspond to an undetermined ready signal. The enable signal corresponds to an element enabled for monitoring 'this monitoring is enabled by means of software control via a monitoring enable control bit stored adjacent to the monitored point Pipelined register 101822.doc -29- 1297468 Write the nickname to directly tap the component that generated the S signal or the component that receives the signal. The valid, ready, and not available at the selected tap point can be used. The ready signal determines the different state of the transfer. A transfer state corresponds to a clock mark having a valid payload (ie, a valid byte) for downstream data and a downstream block The ready signal to receive data in the downstream block (eg, at tap point 1120, - a valid signal from ALU_0, and at tap point 113〇, - The ready signal from ALMM. - The wait state corresponds to a clock token with a valid payload, the valid payload being blocked because the following block is not ready to receive data (eg, at tap point 1120, A valid signal from IVAL_0, and a 'not ready signal from ALU-i' at tap point 1130. In this embodiment, statistics on selected tap points can be collected, such as counting shifts. The number of clock cycles in which the wait state is detected. Embodiments of the present invention provide various benefits that can be used in the embedded graphics processor core 250. In a compact system, low power palm-sized systems 29, power, space, and CPU power can be fairly limited. In one embodiment, the ALU 350 is clocked (e.g., by detecting cancellation bits) when processing is not required, thereby reducing processing power requirements. In addition, the raster stage 31〇 only needs to generate pixel packets for the processed subset of pixel data, thereby also reducing power requirements. The programmable ALU stage 34(R) requires a smaller wafer area than a conventional pipeline having a dedicated stage for performing dedicated graphics functions, thereby reducing cost. The programmable processor 2〇5 can be implemented as a software configurable block' to provide improved efficiency. Test Monitoring 101822.doc -30- 1297468 is configured to test a subset of test points, thereby reducing the bandwidth and analysis requirements of the software. These and other previously described features enable the tangible graphics processor 205 of interest to be used in the embedded graphics processor core. The above description for the purpose of explanation is used to provide a complete understanding of the invention. However, it will be apparent to those skilled in the art that <RTIgt; Accordingly, the above description of the specific embodiments of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. It is obvious that many modifications and variations are possible in light of the teachings of the above. The embodiments were chosen and described in order to best explain the embodiments of the invention, For the specific use covered. The scope of the claims below and the equivalents thereof are intended to define the scope of the invention. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram of a prior art pipeline of a three-dimensional graph; "2 is a block diagram of an integrated circuit including a programmable graphics processing Φ device according to an embodiment of the present invention; FIG. 3 is based on A block diagram of a programmable graphics processor in accordance with an embodiment of the present invention; FIG. 4 illustrates an exemplary pixel packet in accordance with an embodiment of the present invention; FIG. 5 illustrates a pixel packet configured as a group of pixels in accordance with an embodiment of the present invention. Illustrative configuration of a packet column; Figure 6 is a block diagram of a single-arithmetic logic unit in accordance with an embodiment of the present invention; 101822.doc - 31 · 1297468 Figure 7 is an embodiment of the present invention - containing two arithmetic a block diagram of the sequence of logical units;
圖8為根據本發明之一實施例之可組態之可程式之圖3 處理器的方塊圖; V 圖9說明像素封包列根據本發明之一實施例之交錯; 圖10為說明根據本發明之一實施例之具有組態暫存器之 算術邏輯單元的方塊圖;及 圖11為說明根據本發明之一實施例之可組態测試點選擇 器的方塊圖。 相同之參考數字在整個該等圖式之若干視圖中指的是對 應之部分。 【主要元件符號說明】 105 轉換階段 110 設定/光柵階段 115 紋理位址階段 120 紋理提取階段 130 霧化階段 135 透明度測試階段 140 深度測試階段 145 透明度混色階段 150 記憶體寫入階段 200 積體電路 205 圖形處理器 210 暫存器介面 丨.doc -32- 12974688 is a block diagram of a configurable programmable processor of FIG. 3 in accordance with an embodiment of the present invention; FIG. 9 illustrates a staggered row of pixel packs in accordance with an embodiment of the present invention; FIG. A block diagram of an arithmetic logic unit having a configuration register in one embodiment; and FIG. 11 is a block diagram illustrating a configurable test point selector in accordance with an embodiment of the present invention. The same reference numbers are used in the corresponding drawings throughout the drawings. [Major component symbol description] 105 Conversion phase 110 Setting/raster phase 115 Texture address phase 120 Texture extraction phase 130 Atomization phase 135 Transparency test phase 140 Depth test phase 145 Transparency color mixing phase 150 Memory writing phase 200 Integrated circuit 205 Graphics Processor 210 Scratchpad Interface 丨.doc -32- 1297468
220 主機介面 230 直接記憶體存取(DMA)引擎 250 圖形處理核心 260 中央處理單元 270 軟體應用程式 275 圖形應用程式 280 圖形處理器管理軟體應用程式 290 系統 295 顯示器 305 設定階段 308 頂點緩衝為 310 光栅階段 320 閘道管理器階段 325 計分板 330 資料提取階段 331 紋理/霧化快取記憶體 334 色彩/深度快取記憶體 340 算術邏輯單元(ALU)階段 350 算術邏輯單元(ALU) 350-0 、 350-1 、 ALU 350-2 > 350-3 355 資料寫入階段 360 再循環路徑 410 旁頻帶資訊 101822.doc -33- 1297468220 Host Interface 230 Direct Memory Access (DMA) Engine 250 Graphics Processing Core 260 Central Processing Unit 270 Software Application 275 Graphics Application 280 Graphics Processor Management Software Application 290 System 295 Display 305 Setup Phase 308 Vertex Buffer to 310 Raster Stage 320 Gateway Manager Stage 325 Scoreboard 330 Data Extraction Phase 331 Texture/Atomization Cache Memory 334 Color/Deep Cache Memory 340 Arithmetic Logic Unit (ALU) Stage 350 Arithmetic Logic Unit (ALU) 350-0 350-1 , ALU 350-2 > 350-3 355 Data Write Phase 360 Recycling Path 410 Side Band Information 101822.doc -33- 1297468
412 有效搁位 414 取消欄位 416 指令欄位 420 有效負載資訊 422〜 424 第一組(s,t)紋理座標 426 霧化欄位 430、 460 像素封包 462 色彩欄位 464 - 466 第二組紋理座標(s,t) 510 像素封包列 520 群 605 輸入匯流排 610 指令RAM 620 方塊 630 暫存器(T) 640 路徑 645 第一多工器(MUX)階段 650 格式轉換模組 660 第二MUX階段 670 算術計算單元 680 钳制器 690 匯流排 810 像素封包 830 資料提取階段 101822.doc -34- 1297468 850 ALU 855 資料寫入階段 890 、 895 分配器 892 訊號輸入 905 偶像素封包列 910 奇列 1010 ID封包 1020 組態暫存器 1030 ID封包產生器 1105 可組態之測試點選擇器 1110 計數器 1120 、 1130 分接頭/分接頭點 R0、R1、R2、R3 像素封包412 Effective Shelf 414 Cancel Field 416 Command Field 420 Payload Information 422~ 424 First Set (s, t) Texture Coordinates 426 Air Flow Field 430, 460 Pixel Packet 462 Color Field 464 - 466 Second Set of Textures Coordinate (s, t) 510 pixel packet column 520 group 605 input bus 610 instruction RAM 620 block 630 register (T) 640 path 645 first multiplexer (MUX) stage 650 format conversion module 660 second MUX stage 670 Arithmetic Calculation Unit 680 Clamp 690 Bus 810 Pixel Packet 830 Data Extraction Stage 101822.doc -34- 1297468 850 ALU 855 Data Write Stage 890, 895 Distributor 892 Signal Input 905 Even Pixel Packet Column 910 Chile 1010 ID Packet 1020 Configuration Register 1030 ID Packet Generator 1105 Configurable Test Point Selector 1110 Counter 1120, 1130 Tap/Tap Point R0, R1, R2, R3 Pixel Packet
101822.doc 35-101822.doc 35-
Claims (1)
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/846,226 US7268786B2 (en) | 2004-05-14 | 2004-05-14 | Reconfigurable pipeline for low power programmable processor |
US10/845,714 US7250953B2 (en) | 2004-05-14 | 2004-05-14 | Statistics instrumentation for low power programmable processor |
US10/846,097 US7091982B2 (en) | 2004-05-14 | 2004-05-14 | Low power programmable processor |
US10/846,106 US7389006B2 (en) | 2004-05-14 | 2004-05-14 | Auto software configurable register address space for low power programmable processor |
US10/846,110 US7142214B2 (en) | 2004-05-14 | 2004-05-14 | Data format for low power programmable processor |
US10/846,334 US7199799B2 (en) | 2004-05-14 | 2004-05-14 | Interleaving of pixels for low power programmable processor |
Publications (2)
Publication Number | Publication Date |
---|---|
TW200609842A TW200609842A (en) | 2006-03-16 |
TWI297468B true TWI297468B (en) | 2008-06-01 |
Family
ID=35429081
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW094115854A TWI297468B (en) | 2004-05-14 | 2005-05-16 | Graphics processor, graphics system, embedded processor, method of performing a graphics processing operation, method of operating a graphics pipeline, method of performing a register write, and method of monitoring a graphics processor |
Country Status (6)
Country | Link |
---|---|
EP (1) | EP1759380B1 (en) |
JP (1) | JP4914829B2 (en) |
KR (1) | KR100865811B1 (en) |
AT (1) | ATE534114T1 (en) |
TW (1) | TWI297468B (en) |
WO (1) | WO2005114646A2 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8570332B2 (en) | 2009-05-25 | 2013-10-29 | Institute For Information Industry | Graphics processing system with power-gating control function, power-gating control method, and computer program products thereof |
TWI457843B (en) * | 2010-04-30 | 2014-10-21 | Applied Materials Inc | Methods for monitoring processing equipment, and computer readable medium for recording related instructions thereon |
US10430919B2 (en) | 2017-05-12 | 2019-10-01 | Google Llc | Determination of per line buffer unit memory allocation |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8687010B1 (en) | 2004-05-14 | 2014-04-01 | Nvidia Corporation | Arbitrary size texture palettes for use in graphics systems |
US8743142B1 (en) | 2004-05-14 | 2014-06-03 | Nvidia Corporation | Unified data fetch graphics processing system and method |
US8736620B2 (en) | 2004-05-14 | 2014-05-27 | Nvidia Corporation | Kill bit graphics processing system and method |
US7091982B2 (en) | 2004-05-14 | 2006-08-15 | Nvidia Corporation | Low power programmable processor |
US8736628B1 (en) | 2004-05-14 | 2014-05-27 | Nvidia Corporation | Single thread graphics processing system and method |
US7805589B2 (en) * | 2006-08-31 | 2010-09-28 | Qualcomm Incorporated | Relative address generation |
US8537168B1 (en) | 2006-11-02 | 2013-09-17 | Nvidia Corporation | Method and system for deferred coverage mask generation in a raster stage |
US8314803B2 (en) | 2007-08-15 | 2012-11-20 | Nvidia Corporation | Buffering deserialized pixel data in a graphics processor unit pipeline |
US8736624B1 (en) | 2007-08-15 | 2014-05-27 | Nvidia Corporation | Conditional execution flag in graphics applications |
US8775777B2 (en) | 2007-08-15 | 2014-07-08 | Nvidia Corporation | Techniques for sourcing immediate values from a VLIW |
US9183607B1 (en) | 2007-08-15 | 2015-11-10 | Nvidia Corporation | Scoreboard cache coherence in a graphics pipeline |
US20090046105A1 (en) * | 2007-08-15 | 2009-02-19 | Bergland Tyson J | Conditional execute bit in a graphics processor unit pipeline |
US8521800B1 (en) | 2007-08-15 | 2013-08-27 | Nvidia Corporation | Interconnected arithmetic logic units |
US8599208B2 (en) | 2007-08-15 | 2013-12-03 | Nvidia Corporation | Shared readable and writeable global values in a graphics processor unit pipeline |
US8698823B2 (en) * | 2009-04-08 | 2014-04-15 | Nvidia Corporation | System and method for deadlock-free pipelining |
US8471858B2 (en) * | 2009-06-02 | 2013-06-25 | Qualcomm Incorporated | Displaying a visual representation of performance metrics for rendered graphics elements |
US9411595B2 (en) | 2012-05-31 | 2016-08-09 | Nvidia Corporation | Multi-threaded transactional memory coherence |
US9824009B2 (en) | 2012-12-21 | 2017-11-21 | Nvidia Corporation | Information coherency maintenance systems and methods |
US10102142B2 (en) | 2012-12-26 | 2018-10-16 | Nvidia Corporation | Virtual address based memory reordering |
US9317251B2 (en) | 2012-12-31 | 2016-04-19 | Nvidia Corporation | Efficient correction of normalizer shift amount errors in fused multiply add operations |
US9805478B2 (en) | 2013-08-14 | 2017-10-31 | Arm Limited | Compositing plural layer of image data for display |
GB2517185B (en) * | 2013-08-14 | 2020-03-04 | Advanced Risc Mach Ltd | Graphics tile compositing control |
US9569385B2 (en) | 2013-09-09 | 2017-02-14 | Nvidia Corporation | Memory transaction ordering |
US10483981B2 (en) * | 2016-12-30 | 2019-11-19 | Microsoft Technology Licensing, Llc | Highspeed/low power symbol compare |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5230039A (en) * | 1991-02-19 | 1993-07-20 | Silicon Graphics, Inc. | Texture range controls for improved texture mapping |
US5611038A (en) * | 1991-04-17 | 1997-03-11 | Shaw; Venson M. | Audio/video transceiver provided with a device for reconfiguration of incompatibly received or transmitted video and audio information |
US7068272B1 (en) * | 2000-05-31 | 2006-06-27 | Nvidia Corporation | System, method and article of manufacture for Z-value and stencil culling prior to rendering in a computer graphics processing pipeline |
US6771264B1 (en) * | 1998-08-20 | 2004-08-03 | Apple Computer, Inc. | Method and apparatus for performing tangent space lighting and bump mapping in a deferred shading graphics processor |
WO2001042903A1 (en) * | 1999-12-07 | 2001-06-14 | Hitachi, Ltd. | Data processing apparatus and data processing system |
JP2003030641A (en) * | 2001-07-19 | 2003-01-31 | Nec System Technologies Ltd | Plotting device, parallel plotting method therefor and parallel plotting program |
US6909432B2 (en) * | 2002-02-27 | 2005-06-21 | Hewlett-Packard Development Company, L.P. | Centralized scalable resource architecture and system |
-
2005
- 2005-05-13 AT AT05749863T patent/ATE534114T1/en active
- 2005-05-13 JP JP2007513444A patent/JP4914829B2/en active Active
- 2005-05-13 EP EP05749863A patent/EP1759380B1/en active Active
- 2005-05-13 WO PCT/US2005/016967 patent/WO2005114646A2/en active Application Filing
- 2005-05-13 KR KR1020067023690A patent/KR100865811B1/en active IP Right Grant
- 2005-05-16 TW TW094115854A patent/TWI297468B/en active
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8570332B2 (en) | 2009-05-25 | 2013-10-29 | Institute For Information Industry | Graphics processing system with power-gating control function, power-gating control method, and computer program products thereof |
TWI457843B (en) * | 2010-04-30 | 2014-10-21 | Applied Materials Inc | Methods for monitoring processing equipment, and computer readable medium for recording related instructions thereon |
US10430919B2 (en) | 2017-05-12 | 2019-10-01 | Google Llc | Determination of per line buffer unit memory allocation |
TWI684132B (en) * | 2017-05-12 | 2020-02-01 | 美商谷歌有限責任公司 | Determination of per line buffer unit memory allocation |
US10685423B2 (en) | 2017-05-12 | 2020-06-16 | Google Llc | Determination of per line buffer unit memory allocation |
TWI750557B (en) * | 2017-05-12 | 2021-12-21 | 美商谷歌有限責任公司 | Determination of per line buffer unit memory allocation |
Also Published As
Publication number | Publication date |
---|---|
EP1759380B1 (en) | 2011-11-16 |
WO2005114646A2 (en) | 2005-12-01 |
EP1759380A2 (en) | 2007-03-07 |
KR20070028368A (en) | 2007-03-12 |
KR100865811B1 (en) | 2008-10-28 |
WO2005114646A3 (en) | 2007-05-24 |
EP1759380A4 (en) | 2009-01-21 |
ATE534114T1 (en) | 2011-12-15 |
JP4914829B2 (en) | 2012-04-11 |
TW200609842A (en) | 2006-03-16 |
JP2007538319A (en) | 2007-12-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI297468B (en) | Graphics processor, graphics system, embedded processor, method of performing a graphics processing operation, method of operating a graphics pipeline, method of performing a register write, and method of monitoring a graphics processor | |
US7969446B2 (en) | Method for operating low power programmable processor | |
JP4639232B2 (en) | Improved scalability in fragment shading pipeline | |
EP3274966B1 (en) | Facilitating true three-dimensional virtual representation of real objects using dynamic three-dimensional shapes | |
US20080204461A1 (en) | Auto Software Configurable Register Address Space For Low Power Programmable Processor | |
CN101620725B (en) | Hybrid multisample/supersample antialiasing | |
CN108206937B (en) | Method and device for improving intelligent analysis performance | |
US9916634B2 (en) | Facilitating efficient graphics command generation and execution for improved graphics performance at computing devices | |
CN109658492A (en) | For the geometry based on the rendering system pieced together to the moderator that tiles | |
US9990691B2 (en) | Ray compression for efficient processing of graphics data at computing devices | |
US20190286563A1 (en) | Apparatus and method for improved cache utilization and efficiency on a many core processor | |
TWI596569B (en) | Facilitating dynamic and efficient pre-launch clipping for partially-obscured graphics images on computing devices | |
US10796483B2 (en) | Identifying primitives in input index stream | |
CN106575221A (en) | Method and apparatus for unstructured control flow for SIMD execution engine | |
EP3374961A2 (en) | Facilitating efficeint centralized rendering of viewpoint-agnostic graphics workloads at computing devices | |
TW200912798A (en) | Systems and methods for managing texture data in computer | |
US20050253873A1 (en) | Interleaving of pixels for low power programmable processor | |
CN109478137B (en) | Apparatus and method for shared resource partitioning by credit management | |
US7250953B2 (en) | Statistics instrumentation for low power programmable processor | |
WO2017112030A1 (en) | Clustered color compression for efficient processing of graphics data at computing devices | |
TW201137786A (en) | System and method for improving throughput of a graphics processing unit | |
US11748911B2 (en) | Shader function based pixel count determination | |
Fuetterling et al. | Accelerated single ray tracing for wide vector units | |
TWI616844B (en) | Facilitating culling of composite objects in graphics processing units when such objects produce no visible change in graphics images | |
Claus et al. | High performance FPGA based optical flow calculation using the census transformation |