TWI297468B - Graphics processor, graphics system, embedded processor, method of performing a graphics processing operation, method of operating a graphics pipeline, method of performing a register write, and method of monitoring a graphics processor - Google Patents

Graphics processor, graphics system, embedded processor, method of performing a graphics processing operation, method of operating a graphics pipeline, method of performing a register write, and method of monitoring a graphics processor Download PDF

Info

Publication number
TWI297468B
TWI297468B TW094115854A TW94115854A TWI297468B TW I297468 B TWI297468 B TW I297468B TW 094115854 A TW094115854 A TW 094115854A TW 94115854 A TW94115854 A TW 94115854A TW I297468 B TWI297468 B TW I297468B
Authority
TW
Taiwan
Prior art keywords
pixel
packet
graphics
alu
phase
Prior art date
Application number
TW094115854A
Other languages
Chinese (zh)
Other versions
TW200609842A (en
Inventor
Edward A Hutchins
Brian K Angell
Paul Kim
Original Assignee
Nvidia Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US10/846,226 external-priority patent/US7268786B2/en
Priority claimed from US10/845,714 external-priority patent/US7250953B2/en
Priority claimed from US10/846,097 external-priority patent/US7091982B2/en
Priority claimed from US10/846,106 external-priority patent/US7389006B2/en
Priority claimed from US10/846,110 external-priority patent/US7142214B2/en
Priority claimed from US10/846,334 external-priority patent/US7199799B2/en
Application filed by Nvidia Corp filed Critical Nvidia Corp
Publication of TW200609842A publication Critical patent/TW200609842A/en
Application granted granted Critical
Publication of TWI297468B publication Critical patent/TWI297468B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G5/00Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
    • G09G5/36Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators characterised by the display of a graphic pattern, e.g. using an all-points-addressable [APA] memory
    • G09G5/37Details of the operation on graphic patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Graphics (AREA)
  • Image Generation (AREA)
  • Image Processing (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
  • Advance Control (AREA)

Abstract

A graphics piocessor has a programmable Arithmetic Logic Unit (.ALL ) capable of scalar arithmetic operations for processing pixel packets and pixel packets formatted in a S l 8 format to improve dynamic range or in a different data format (see figure 3) The graphic processor may be implemented as a configurable graphics pipeline, as distributors couple elements of a graphics pipeline to permit the process flow of pixel packets through to be reconfigured in response to a command from a host and a data packet triggers an element of the graphics pipeline to discover an identifier A configurable test point selector may be used to monitor a selected subset of tap points of the graphics pipeline and count statistics for at least one condition associated with each tap point ot the subset of tap points Pixels may be assigned even pixels or odd pixels and the pixel packets of odd and even pixels then interleaved to account for ALU latency.

Description

1297468 九、發明說明: 【發明所屬之技術領域】 本發明大體係關於可程式之虛採哭 .^ 八 < 慝理窃。更特定言之,本發 明係針對用於圖形應用之低功率可程式之處理器。 【先前技術】 w 產生二維圖形影像在各種電子遊戲及其他應用中被關 /主。通# ’用於產生情景之二維影像之某些步驟包括產生 待顯不之物體的二維模型。形成幾何圖原(ge〇metHcal • primitive)(例如,三角形),其連同深度資訊一起被映射至 二維投影。再現(繪製)圖原包括在圖原之每一二維投影上内 插諸如深度及色彩之參數。 圖形處理單元(GPU)—般用於圖形系統中以產生三維影 像’以回應來自中央處理單元之指令(instructi〇n)。現代Gpu 一般利用圖形管線處理資料。圖1為傳統管線架構之先前技 術圖式,該架構為一具有專用於執行特定功能之階段的 深管線。轉換階段1 〇 5執行圖原之幾何計算且亦可執行 φ 裁剪運算(dipping operation)。設定/光柵階段110將圖原光 栅化。紋理位址階段115及紋理提取階段12〇係用於紋理映 射。務化階段130實施霧化演算法(f〇g algorithm)。透明度 測試階段135執行透明度測試(alpha test)。深度測試階段140 執行用於挑選閉塞像素之深度測試。透明度混色階段145執 行透明度混色色彩組合演算法。記憶體寫入階段1 50寫入管 線之輸出。 一般使用OpenGL®圖形語言來最佳化圖1中所說明之傳 101822.doc 1297468 統咖管線架構以用於快速紋理化。深管線架構之益處在 允許陕速、咼品質地再現甚至複雜之情景。 在無線電話、個人數位助理(PDA)及成本與功率消耗為重 要設計需求之其他裝置中對利用三維圖形之關注曰益增 」而傳統之深管線架構要求顯著之晶片區域,其導 致成亡:於所要成本。另外,即使階段正執行相當少:處 理=管線亦消耗顯著功率。此係因為許多階段消耗大約 相同i之功率而無論其是否正處理像素。 、出於成本及功率考慮,圖1中所說明之習知深管線架構不 適口於許多圖形應用,諸如在無線電話及pDA上實施三維 遊戲。 因此’吾人所要的是一適合於圖形處理應用但功率及大 小需求降低之處理器架構。 【發明内容】 一圖形處理器包括-用於處理像素封包之可程式之算術 邏輯單元(ALU)階段4ALU階段中在像素封包上執行純量 算術運算以實施圖形功能。 在像素上執行圖形處理運算之方法的—實施例包括:識 別待在像素封包上執狀—純量算料算序列讀行圖形 功能;為該像素產生複數個像素封包,每—像素封包包括 在該純量算術運算序列中待作為運算元處理的—子組像素 屬性;自至少一彻中的像素封包讀取運算元;及根據指 令序列來執行純量算術運算以執行該純量算術運算序列。 圖形處理器之-實施例包括:_具有用於處理像素封包 101822.doc I297468 =tALU的可程式之ALU階段,每-彻經程式化以 在一且能之純量算術運算’該純量算術運算係 傻/ 。之當前指令的傳入像素封包上執行,其中在 素封包上執行-算術運算序列以執行圖形處理功能。 圖:處,器包括詩在像素封包上執行純量算術運算之 、 异術邏輯單元(alu)。對於所選純量算術運算,可 格式化像素封包中的運算元以改良動態範 国 對於至少一其他έφ晉曾你;、富曾 格式化像素封包。以其他資料格式來 在一方法之—實施例中,識別純量算術運算,其可在像 素封包上來執行以實施圖形功能。為待處理之每―像素產 生至少-像㈣包列,每—像素封包包括詩待作為運算 7L處理之-子組像素屬性的至少—攔位,該至少—列具有 -相關聯之指令序列。經指派之運算元在複數個彻:每 -者中被讀取,該等運算元之至少―者對應於一自—像素 封包列内之一像素封包所讀取的運算元。在每一 alu中, 根據指令序列在經指派之運算元上執行一純量算術計算。 對於需要[0, 圍内之結果的所選純量算術運算,以S18 格式來格式化像素封包之對應運算元,該S18格式對應於 具有8位元小數成份之[_2, +2]範圍内之運算元的基數2表 示’並將所選純量算術函數的結果钳制至[(),丨]之範圍。對 於至少一其他純量算術運算,以不同資料格式來格式化像 素封包。 圖形處理器具有由分配器所耦合之圖形管線的元件。該 101822.doc -8· 1297468 等分配器允許重新域像素封包經由f線之處理流程1 回應來自主機之命令。 在一設備之一實施例中,一圖形瞢 口办S線包括複數個階段。 第-分配器麵合至該複數個階段的個別輸入。第二分配写 耦合至該複數個階段的個別輸出。該第一分配器及該第二 分配器經調適成重新組態像素封包纟 、 ^、、二由複數個階段之處理 流程,以回應來自主機之命令。 在一方法之-實施例中,接收來自軟體主機之命令,以 重新組態像素封包經㈣㈣線之元件的處理流程。作為 回應,調整至少一分配器以將管峻自 s線自弟—處理流程重新組 悲至弟二處理流程。 圖形處理器包括-用於處理像素封包之算術邏輯單元 像素可需要處理—個以上之像素封包列。 父錯諸如奇像素及偶像素之不同像夸 U彳冢素的像素封包以解決 ALU等待時間。 所成1297468 IX. Description of the invention: [Technical field to which the invention pertains] The large system of the present invention is about the imaginary cry of the program. ^ 八 < 慝理窃. More specifically, the present invention is directed to a low power programmable processor for graphics applications. [Prior Art] w Producing 2D graphics images is turned off/main in various video games and other applications. Some of the steps used to generate a two-dimensional image of a scene include generating a two-dimensional model of the object to be displayed. A geometric primitive (ge〇metHcal • primitive) (eg, a triangle) is formed, which is mapped to the two-dimensional projection along with the depth information. Reproducing (drawing) the map originally includes interpolating parameters such as depth and color on each of the two-dimensional projections of the original image. A graphics processing unit (GPU) is typically used in graphics systems to generate three-dimensional images' in response to instructions from the central processing unit (instructi). Modern Gpus typically use graphics pipelines to process data. Figure 1 is a prior art diagram of a conventional pipeline architecture, which is a deep pipeline with stages dedicated to performing specific functions. The conversion phase 1 〇 5 performs the original geometry calculation and can also perform the φ dipping operation. The set/raster stage 110 rasterizes the picture. The texture address stage 115 and the texture extraction stage 12 are used for texture mapping. The materialization stage 130 implements a fog algorithm (f〇g algorithm). Transparency Test phase 135 performs an alpha test. The depth test phase 140 performs a depth test for picking occluded pixels. The transparency blending stage 145 performs a transparency blending color combination algorithm. The memory write phase 1 50 writes to the output of the pipeline. The OpenGL® graphics language is generally used to optimize the 101822.doc 1297468 system pipeline architecture illustrated in Figure 1 for fast texturing. The benefits of a deep pipeline architecture allow for the reproducibility of even the most complex scenarios. The use of three-dimensional graphics has increased in wireless telephones, personal digital assistants (PDAs), and other devices where cost and power consumption are important design requirements. Traditional deep pipeline architectures require significant wafer areas that lead to death: At the cost. In addition, even if the phase is performing quite a bit: the process = pipeline consumes significant power. This is because many phases consume approximately the same i power regardless of whether they are processing pixels. For the sake of cost and power considerations, the conventional deep pipeline architecture illustrated in Figure 1 is not suitable for many graphics applications, such as implementing 3D gaming on wireless phones and pDA. So what we want is a processor architecture that is suitable for graphics processing applications but with reduced power and size requirements. SUMMARY OF THE INVENTION A graphics processor includes a programmable arithmetic logic unit (ALU) for processing pixel packets. Phase 4 ALU stages perform scalar arithmetic operations on pixel packets to implement graphics functions. An embodiment of the method for performing a graphics processing operation on a pixel includes: identifying a function to be performed on a pixel packet - a scalar computing sequence reading line graphic function; generating a plurality of pixel packets for the pixel, each pixel packet being included a sub-group pixel attribute to be processed as an operation element in the scalar arithmetic operation sequence; reading an operation element from at least one of the pixel packets; and performing a scalar arithmetic operation according to the instruction sequence to execute the scalar arithmetic operation sequence . The embodiment of the graphics processor includes: _ having a programmable ALU stage for processing the pixel packet 101822.doc I297468 = tALU, each - programmed to perform a scalar arithmetic operation on a scalar quantity 'the scalar arithmetic The operation is silly /. The incoming instruction is executed on the incoming pixel packet, wherein the sequence of arithmetic operations is performed on the prime packet to perform graphics processing functions. Figure: The device includes a poetic logic unit (alu) that performs scalar arithmetic operations on the pixel packet. For the selected scalar arithmetic operation, the operands in the pixel packet can be formatted to improve the dynamic state. For at least one other έ 晋 曾 ;;; In other data formats, in one method embodiment, a scalar arithmetic operation is identified that can be executed on a pixel packet to implement a graphics function. For each pixel to be processed, at least - like (four) packets are generated, each pixel packet includes at least - a block of the sub-group pixel attributes to be processed as a 7L process, the at least - column having - an associated instruction sequence. The assigned operands are read in a plurality of integers: at least one of the operands corresponds to an operand read by a pixel packet in a self-pixel packet column. In each alu, a scalar arithmetic calculation is performed on the assigned operands according to the sequence of instructions. For the selected scalar arithmetic operation requiring [0, the result of the surrounding, the corresponding operand of the pixel packet is formatted in S18 format, which corresponds to the range of [_2, +2] having the 8-bit fractional component The base 2 of the operand represents 'and clamps the result of the selected scalar arithmetic function to the range of [(), 丨]. For at least one other scalar arithmetic operation, the pixel packet is formatted in a different data format. The graphics processor has elements of a graphics pipeline coupled by a distributor. The allocator such as 101822.doc -8· 1297468 allows the re-domain pixel packet to respond to commands from the host via process f of f-line. In one embodiment of an apparatus, a graphics port S line includes a plurality of stages. The first-distributor faces the individual inputs of the plurality of stages. A second allocation write is coupled to the individual outputs of the plurality of stages. The first splitter and the second splitter are adapted to reconfigure the pixel packets ^, ^, and 2 by a plurality of stages of processing in response to commands from the host. In a method-embodiment, a command from a software host is received to reconfigure the processing flow of the component of the pixel packet via the (4) (four) line. In response, at least one of the dispensers is adjusted to re-establish the process from the s-process to the second process. The graphics processor includes - an arithmetic logic unit for processing pixel packets. The pixels may need to process more than one pixel packet column. Parental errors such as odd and even pixels are similar to pixel buffers to resolve ALU latency. Made into

在一方法之-實施例中,識別_純量算術運算序列,复 可在像素封包上來執行以在複數個像素上實施圖形功I 將像素指派為偶像素或奇像素。為每一 主u a 本京產生至少兩像 素封〇列’每-像素封包包㈣於在該純量算術運算 中待作為運算元處理之-子組像素屬性的至少—欄位… 等至少兩列具有一相關聯之指令 ^ 7斤列、一用以指示像 包是用於奇像素還是用於偶像素的識別符。在一群像素, 包列中交錯用於偶像素及奇像素之像素封包列,其中^ 中之每一列經指派用於在連續時脈週期 〇Λ 〜w τ蜒理。在ALU階 101822.doc 1297468 焱令接收當前時脈週期 像素封包列所讀取之至少一運=列。根據指令序列在自 其尹像夸運#疋上執行純量算術計算, 一素封包之處理在ALU階段尹得以交錯。 的一:::圖形管線具有像素封包經由該圖形管線之元件 件以發料之可錢理流程。資料封包觸發圖形管線之元 1千以發現識别符。 線 之只紅例尹,所接收之資料封包觸發圖形管In a method-embodiment, a sequence of sess-quantity arithmetic operations is identified that can be performed on a pixel packet to perform a graphics function on a plurality of pixels to assign pixels as even or odd pixels. For each master ua, the Beijing generates at least two pixel-packed columns 'per-pixel packet (four) at least two columns of the sub-group pixel attributes to be processed as operands in the scalar arithmetic operation. There is an associated instruction, a flag to indicate whether the image packet is for an odd pixel or an even pixel. In a group of pixels, a column of pixel packets for even and odd pixels is interleaved, wherein each of the columns is assigned for continuous clock cycle 〇Λ ~w τ 蜒. At the ALU stage 101822.doc 1297468, the current clock cycle is received. At least one of the data = column read by the pixel packet column. According to the instruction sequence, the scalar arithmetic calculation is performed on the 像 运 运 , , , , , , , 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 The one::: graphics pipeline has a process for the pixel packet to be sent through the component of the graphics pipeline. The data packet triggers the element of the graphics pipeline to find the identifier. Only the red case of the line, the received data packet triggers the graphic tube

位詈Γ:、’以發現一用於每一元件之指示處理流程内元件 的識别符。每一元件在指示其在處理流程内之相對位 暫存器中寫入識別符。在—實施例中,元件在資 L㉟取識㈣之當前值,將該t前值寫人至組態暫 存盗’遞增該識別符’並將具有遞增識別符的資料封包轉 發至處理流程之下一元件。 圖形處理器包括圖形管線。分接頭點與圖形管線之元件 相關聯。一可組態之測試點選擇器監視一所選子組分接頭 點,並對與該子組分接頭點之每一分接頭點相關聯之至少 一條件計數統計量。在—實施例中,該可組態之測試點選 擇器起作用以回應來自軟體主機之命令並為該軟體主機收 集統計量。 【實施方式】 圖2為本發明之一實施例的方塊圖。可程式之圖形處理器 205耦合至暫存器介面21〇、主機介面22〇及記憶體介面,該 s己憶體介面諸如直接記憶體存取(DMA)引擎230,其用於使 用諸如訊框緩衝器之圖形記憶體(未圖示)來進行記憶體讀 101822.doc -10- 1297468 取/寫入操作。主機介面22〇允許可程式之圖形處理器205接 收來自主機之用於產生圖形影像的命令。舉例而言,主機 可將頂點資料、命令及程式指令發送至可程式之圖形處理 态205。諸如DMA引擎23〇之記憶體介面允許使用圖形記憶 體(未圖不)來執行讀取/寫入操作。暫存器介面210提供一用 於與可程式之圖形處理器2〇5的暫存器介面連接之介面。 可將可程式之圖形處理器205實施為系統290之一部分, 該系統包括執行軟體應用程式27〇之至少一其他中央處理 § 單兀260,該中央處理單元用作可程式之圖形處理器2〇5的 主機。例示性系統290可(例如)包含諸如行動電話或個人數 位助理(PDA)之掌上型單元。舉例而言,軟體應用程式27〇 可包括一用於在顯示器295上產生圖形影像之圖形應用程 式275。另外,如以下更詳細之描述,在某些實施例中,軟 體應用程式270可包括用於執行與可程式之圖形處理器2〇5 相關聯之管理功能的圖形處理器管理軟體應用程式28〇,該 等管理功能諸如(例如)管線重新組態、暫存器組態及測試。 鲁 在一實施例中,可程式之圖形處理器205、暫存器介面 210、主機介面220及DMA引擎230為一形成於單一積體電路 200上之嵌入式圖形處理核心25〇之部件,該單一積體電路 包括主機,諸如形成於晶片上之包括中央處理單元26〇之積 體電路200,該中央處理單元具有駐存於記憶體上之軟體 270。或者,圖形處理核心250可安置於第一積體電路上, 且CPU 260安置於第二積體電路上。 圖3為詳細說明根據本發明之一實施例之可程式之圖形 101822.doc 1297468 階段310程式化指令。光柵階段31〇處理給定之三角形之每 一像素,並判定作為再現之一部分的需要對像素而計算之 參數,諸如計算色彩、紋理、透明度測試、透明度混色、z 深度測試及霧化參數。在一實施例中,光柵階段31〇對像素 封包計算重心係數。在重心、座標系、统中,量測三角形中相 對於其頂點的距離。使用重心係數會降低所需之動態範 圍,其允許使用與浮點計算相比需要較少功率之固定點 算。 光柵階段31〇對待處理之三角形的每一像素產生至少一 像素封包。母-像素封包包括用於為處理所需之像素屬性 (例如’色彩、紋理、深度、霧化、(X,y)位置)之有效負載 的攔位。另外,每-像素封包具有相關聯之旁頻帶資訊, 其包括待在像素封包上執行之運算的指令㈣。光柵階段 31 〇中之指令區域(未圖示)將指令指派給像素封包。 圖4說明一像素之例示性像素封包43。及糊。在—實施例 ^光柵階段㈣將像素屬性分割為兩個或兩個以上不同類 :之像素封包一,其中每一類型之像 = 用於特^類型之指令所作用〇 ^而要僅 素資料分割為較小之工作 將像 對於特定處理運算而+ ^降低頻寬需求’且若(例如) 瞀,目,丨〇僅萬要對一子組像素屬性進行運 -T則其亦會降低處理需求。 丁運 次訊49if。素封包具有相關聯之旁頻帶資訊410及有效負载 貝Η 。例不性旁頻帶資訊包括有效欄位412、取消攔 (kill fleid)4l4、栲々納 a % A 襴位 ^己攔位、及包括當前指令之指令欄位 101822.doc * 13 - 1297468 416。例示性像素封包43G包括—第-組(S,t)紋理座標422 及424之攔位與霧化欄位426。例示性像素封包46〇包括色彩 攔位462及-第二組紋理座標(s,t)464及摘。在—實施例 中,每一像素封包以固定點表示來表示有效負載資訊42〇。 可包括於像素封包(其中像素屬性之像素封包大小為朗立 元)中之像素屬性的實例包括:—Z l6十六位元战度值; 一 16位元S/T紋理座標及4位元精細度;一對色彩值,每一 者具有8位元之精確度;或封裝25555 argb色彩,其中五 位元各在每一 ARGB變數中。 像素封包之旁頻帶資訊可包括像素之(x,y)位置。然而, 在-實施例中,藉由光柵階段310錄,y)起源處產生開始 跨距命令,在該起源處其開始沿著掃描線穿過三角形。使 用開始跨距命令會允許自像㈣包省略(x,y)位置。該開始 跨距命令通知其他實體(例如,資料寫人階段州及資料提 取階段330)在掃描線開始處的初始(x,y)位置。沿著掃描線 之:他像素的(x’y)位置可由像素之數目來推斷,一給定像 素逐離Θ起源。在一實施例中’資料寫入階段说及資料提 取階段330包括本機快取記憶體,該等本機快取記憶體經調 適成遞增本機計數H,絲基於其在跨㈣始命令之後遇 到之像素數目的計算來更新(x,y)位置。 >看圖5 ’在-實施例巾,光柵階段3灣待處理之每— 像素產生至少-像素封包列51()。在某些實施例中,每一列 =具:一為該列510定義一指令序列的共同旁頻帶資訊 右一像素需要多於—列510,則將該等列510組織為_ 101822.doc -14- 1297468 和〇列’其隨著每-新時脈週期而被連續處理。在 :中,賴位元像素資料分料四個騎元像素屬性^ 益值’其中該等四個像素暫存器值定義像素之像素 (R0、Rl、R2及 R3)之“列,,51〇。 光柵階段3丨0之迭代器暫存器㈣(未圖示)具有對應之暫 存器以支援像素封包列510β在—實施例中,光柵階段則 包括-支援尚達四像素封包列的暫存器集區。某些類型之 像素封包屬性(諸如紋理)可需要高精確度。相反,某些類型 之像素封包屬性可需要較低之精確度,諸如色彩。可配置 暫存器集區以支援列51G中每—像素封包的高精確度及低 精確度值。在—實施财,暫存器集區包括每列4個高精確 度及4個低精確度之透視校正迭代值加上z深度值。舉例而 言,以此方式允許軟體指派迭代器之精確度以用於處理特 定像素封包屬性。在一實施例中,光栅階段3ι〇包括周 適成追蹤紋理之整數部分的暫存器集區,從而允許將:理 之小數位元作為資料封包來發送。 光柵階段310可(例如)接收來自主機之需要在像素上執行 運算的指令。作為回應’光栅階段31〇產生具有相關聯之指 令序列的-或多個像素封包列别,其中該等像素封包列及 指令經配置以執行所要之處理運算。如以下更詳細之描 述’在-實施例中,ALU階段34〇允許執行純量算術運算: 其中運算元包括像素封包列51〇内之一預選子組像素屬 性、常數值及像素封包上先前計算的臨時儲存結果。 各種圖形運异可用公式表達為一或多個純量算術運算。 101822.doc 15- 1297468 另外,各種向量圖形運算可用公式表達為複數個純量算術 運w 因此,應瞭解,可程式化本發明之可程式之圖形處 理器205以在像素上執行任何圖形運算,該圖形運算可表示 為一純量算術運算序列,諸如霧化運算、色彩(透明度)混 色、紋理組合、透明度測試或深度測試,諸如在〇pen Gjl®Position:, 'to find an identifier for the component within the processing flow for each component. Each component writes an identifier in its relative bit register indicating its processing flow. In the embodiment, the component obtains the current value of (4) in the L35, writes the pre-t value to the configuration temporary stolen 'increment the identifier' and forwards the data packet with the incremental identifier to the processing flow. Next component. The graphics processor includes a graphics pipeline. The tap point is associated with the component of the graphics pipeline. A configurable test point selector monitors a selected sub-component joint point and counts at least one conditional count statistic associated with each tap point of the sub-component joint point. In an embodiment, the configurable test point selector functions in response to commands from the software host and collects statistics for the software host. Embodiments Fig. 2 is a block diagram showing an embodiment of the present invention. The programmable graphics processor 205 is coupled to a scratchpad interface 21, a host interface 22, and a memory interface, such as a direct memory access (DMA) engine 230, for use with, for example, a frame. The buffer's graphics memory (not shown) performs the memory read 101822.doc -10- 1297468 fetch/write operation. The host interface 22 allows the programmable graphics processor 205 to receive commands from the host for generating graphics images. For example, the host can send vertex data, commands, and program instructions to the programmable graphics processing state 205. A memory interface such as the DMA engine 23 allows the use of graphics memory (not shown) to perform read/write operations. The scratchpad interface 210 provides an interface for interfacing with the scratchpad interface of the programmable graphics processor 2〇5. The programmable graphics processor 205 can be implemented as part of a system 290 that includes at least one other central processing unit 260 executing a software application program 27 that functions as a programmable graphics processor. 5 hosts. The illustrative system 290 can, for example, include a palm-sized unit such as a mobile phone or a personal digital assistant (PDA). For example, the software application 27A can include a graphics application 275 for generating graphics images on the display 295. Additionally, as described in greater detail below, in some embodiments, the software application 270 can include a graphics processor management software application 28 for executing management functions associated with the programmable graphics processor 2.5A. These management functions such as, for example, pipeline reconfiguration, scratchpad configuration, and testing. In one embodiment, the programmable graphics processor 205, the scratchpad interface 210, the host interface 220, and the DMA engine 230 are components of an embedded graphics processing core 25 formed on a single integrated circuit 200. The single integrated circuit includes a host, such as an integrated circuit 200 formed on the wafer including a central processing unit 26, the central processing unit having a software 270 resident on the memory. Alternatively, the graphics processing core 250 may be disposed on the first integrated circuit, and the CPU 260 is disposed on the second integrated circuit. 3 is a stage 310 stylized instruction detailing a programmable graphic 101822.doc 1297468 in accordance with an embodiment of the present invention. The raster stage 31 processes each pixel of a given triangle and determines the parameters that need to be calculated for the pixel as part of the reproduction, such as computational color, texture, transparency test, transparency blending, z-depth testing, and fogging parameters. In one embodiment, the raster stage 31 计算 calculates the center of gravity coefficients for the pixel packets. In the center of gravity, coordinate system, and system, measure the distance from the vertices in the triangle. Using the center of gravity factor reduces the dynamic range required, which allows the use of fixed points that require less power than floating point calculations. The raster stage 31 produces at least one pixel packet for each pixel of the triangle to be processed. The mother-pixel packet includes an intercept for the payload of the desired pixel attributes (e.g., 'color, texture, depth, fog, (X, y) position). Additionally, each-pixel packet has associated sideband information that includes instructions (4) to be performed on the pixel packet. The instruction area (not shown) in the raster stage 31 指派 assigns instructions to the pixel packet. FIG. 4 illustrates an exemplary pixel package 43 of a pixel. And paste. In the embodiment - the grating stage (4), the pixel attribute is divided into two or more different classes: a pixel packet one, wherein each type of image = is used for the instruction of the special type, and only the data is Dividing into smaller jobs will reduce the bandwidth requirement by + ^ for a particular processing operation and if (for example) 目, 丨〇, only a subset of the pixel attributes will be processed, it will also reduce processing. demand. Ding Yun, the news 49if. The prime packet has an associated sideband information 410 and a payload. The inactive sideband information includes a valid field 412, a kill fleid 4l4, a cancel a % A field, and an instruction field including the current instruction 101822.doc * 13 - 1297468 416. The exemplary pixel packet 43G includes a block and fogging field 426 of the -th set (S, t) texture coordinates 422 and 424. The exemplary pixel packet 46 includes a color block 462 and a second set of texture coordinates (s, t) 464 and a pick. In an embodiment, each pixel packet is represented by a fixed point representation of payload information 42. Examples of pixel attributes that may be included in a pixel packet (where the pixel attribute size of the pixel attribute is Langer) include: - Z l6 hexadecimal warfare value; a 16-bit S/T texture coordinate and 4 bit Fineness; a pair of color values, each with an accuracy of 8 bits; or a 25555 argb color, where each of the five bits is in each ARGB variable. The sideband information of the pixel packet may include the (x, y) position of the pixel. However, in an embodiment, the start span command is generated by the raster stage 310, y) origin, where it begins to pass through the triangle along the scan line. Using the Start Span command will allow the self (image) package to omit the (x,y) position. The start span command informs other entities (e.g., data writer stage state and data extraction phase 330) at the initial (x, y) position at the beginning of the scan line. Along the scan line: the (x'y) position of his pixel can be inferred from the number of pixels from which a given pixel is derived. In one embodiment, the 'data writing phase and data extraction phase 330 includes native cache memory, and the native cache memory is adapted to increment the local count H, based on its cross-four command. The number of pixels encountered is calculated to update the (x, y) position. > Looking at Figure 5, in the embodiment, the raster stage 3 bay is to be processed - at least - pixel packet column 51 (). In some embodiments, each column = has a common sideband information defining a sequence of instructions for the column 510. The right pixel requires more than - column 510, and the columns 510 are organized as _101822.doc -14 - 1297468 and queues 'which are processed continuously with each new clock cycle. In:, the sub-pixel data is divided into four riding element pixel attributes ^ profit value 'where the four pixel register values define the pixels of the pixel (R0, Rl, R2 and R3) of the column, 51迭代. The iterative register (4) of the raster stage 3丨0 (not shown) has a corresponding register to support the pixel packet column 510β. In the embodiment, the raster stage includes - supporting the four-pixel packet column of the Shangda Scratchpad pools. Some types of pixel-packet attributes (such as textures) may require high precision. Conversely, some types of pixel-packet attributes may require lower precision, such as color. Configurable scratchpad pools To support the high accuracy and low accuracy of each pixel packet in column 51G. In the implementation, the scratchpad pool includes 4 high precision and 4 low precision perspective correction iteration values per column plus z depth value. For example, in this way the software assigns the iterator's precision for processing specific pixel packet attributes. In one embodiment, the raster stage 3 ι includes a register set that is adapted to track the integer portion of the texture. Zone, thus allowing The decimal bit is transmitted as a data packet. The raster stage 310 can, for example, receive an instruction from the host that needs to perform an operation on the pixel. In response to the 'raster stage 31', a multi-pixel packet with an associated instruction sequence is generated. The columns, wherein the pixel packet columns and instructions are configured to perform the desired processing operations. As described in more detail below, in the embodiment, the ALU stage 34 is capable of performing scalar arithmetic operations: where the operands comprise pixel packets A pre-selected subset of pixel attributes, constant values, and previously calculated temporary storage results on the pixel packet in column 51. Various graphical algorithms can be expressed as one or more scalar arithmetic operations. 101822.doc 15- 1297468 Additionally, Various vector graphics operations can be expressed as a plurality of scalar arithmetic operations. Thus, it should be appreciated that the programmable graphics processor 205 of the present invention can be programmed to perform any graphics operation on a pixel, which can be represented as a pure Sequence of arithmetic operations, such as atomization operations, color (transparency) color mixing, texture combining, transparency testing, or deep Degree test, such as in 〇pen Gjl®

Graphics System: A Specification (Version ι·2)中所描述 的運算’其内容以引用之方式併入本文中。舉例而言,為 回應光栅段3 10谓測待在像素上執行之所要的圖形處理 _ 功能(例如,霧化運算),光柵階段3 10可使用可程式之映射 表或映射演算法來判定像素封包及用於執行在像素上實施 圖形功能所需之純量算術運算之相關聯指令的指派。該映 射可例如藉由圖形處理器管理軟體應用程式28〇程式化。 再次參看圖3,隨著藉由光柵階段310來遊動三角形之每 一像素,光柵階段3 1 〇產生像素封包以供進一步處理,該等 像素封包係由閘道管理器階段32〇所接收。閘道管理器階段 320執行資料流控制功能。在一實施例中,閘道管理器階段 _ 320具有-用於像素封包之排程、負載平衡、資源配置及避 險之相關聯的計分板325。計分板325追蹤像素之進入及引 退。進入閘道管理器階段320之像素封包設定該計分板,且 該計分板在完成處理之後被重設為自可程式之處理器2〇5 中所排出之像素封包。作為一說明性實例,若緊密顯示器 295具有128x32像素之區域,則計分板325可為該顯示器之 每一像素維持一表以監視像素。 計分板325提供若干益處。舉例而言,當三角形中之一像 101822.doc . 1297468 素位於正在被處理及處於飛行中之另一像素的頂部上時, 計分板325阻止了危險。在一實施例中,計分板奶監視閒 置條件,並使料分板技術(s⑽eb_ding) f訊來記錄閒置 單元時間。舉例而言,若不存在有效像素,則計分板325可 關閉ALU以節省功率。如以下更詳細之描述,計分板奶追 縱像素封包,該等像素封包能夠與具有取消位元組之像素 封包一起由ALU 350來處理,使得像素封包流過ALU35〇而 未主動處自纟貫施例中,計分板325追蹤再循環像素封 ,包之(X,3〇位置。若像素封包被再循環,料分板325在隨 後通過中將像素封包中之指令序列遞增至像素之下一指 令,例如,若指令在通過數字i上係用於霧化運算,則指令 在通過數子2上被迭代至透明度混色運算。 育料提取階段330提取由閘道管理器32〇所傳遞之像素封 包的資料。以此方式可包括(例如)藉由料一像素封包列執 行適當之色彩、深度或紋理資料讀取來提取色彩、深度及 紋理資料。資料提取階段33〇可(例如)藉由自記憶體介:請 籲求讀取(例如,使用DMA引擎23〇來讀取訊框緩衝器(未圖 示))來提取像素或質素資料。在—實施例中’ f料提取階段 330亦可管理本機快取記憶體,諸如紋理/霧化快取記憶體 332、色彩/深度快取記憶體334及用於深度資料之z快取記 憶體(未圖示)。在將像素封包發送至下—階段上之前,將所 提取之資料置放於對應之像素冑包欄位上。在—實施例 中’資料提取階段330包括具有用於存取像素封包屬性搁位 所需之資料之-指令的指令隨機存取記憶體(ram)。在某些 101822.doc -17 - 1297468 實施例中,資料提取階段330亦執行Z深度測試。在此實施 例中,資料提取階段330使用一或多個深度比較測試來比較 像素封包之Z深度值與所儲存之Z值。若像素之Z深度值指 示該像素被閉塞,則設定取消位元。 像素封包列進入算術邏輯單元(ALU)階段340以供處理。 ALU階段340具有一組包括至少一 ALU 350之ALU 350,諸 如 ALU 3 5 0-0、3 50-1、35 0-2及 3 50-3。雖然說明四個 ALU 350’但是視應用而定,可在alu階段340中使用更多或更 _ 少之ALU 350。個別ALU 350讀取用於至少一像素封包列 5 1 〇之當前指令,並實施任何指令以執行該alu經程式化以 支援之純量算術運算。指令包括於每一 ALU 3 5〇中,並可(例 如)儲存於本機指令RAM(圖3中未圖示)中。 每一 ALU 350包括用於在第一運算元乘積(a*b)及第二運 算元乘積(b*c)上執行至少一算術運算的指令,其中&、b、〇 及d為運算元且*為乘法。某些或所有運算元可對應於⑽如) 像素封包列510内之暫存器值屬性。ALU35〇亦可具有為常 ⑩數或軟體可載入之一或多個遥瞀;γ去» ^ , ' 乂夕彳固連异儿值。在某些實施例中, ALU可支援使用來自像辛封句 个《 πI对巴上之先刖運异的臨時儲存結 果0 在一實施例中’每一ALU 350為可程式的。交又開關 (cr〇ssbar)(未圖示)或其他可程式之選擇器可包括於則 内彳允《午^擇運异几及結果的目標以回應來自軟體 (例如’軟體應用程式270)之指令。舉例而言,在 中,可使用運算命令碼自像素封包列51。内之任何暫 101822.doc •18 * 1297468 之屬性、臨時值及常數值選擇每一運算元(a、b、c、d)之源。 =此實^例巾,運#命令亦指示ALU 35q向何處發送算術運 的、。果諸如使用該結果來更新像素封包、將該結果儲 ;、、、α寺值或既使用该結果來更新像素封包又將該結果 儲存為臨時值。因&,舉例而言,可程式化alu以將像素 封包内之特定屬性讀取為運算元,並應用由#前指令所指 丁之純里异術運算。運算命令碼亦可包括用以互補運算元 (,例如,計算l_x,其中父為讀取值)、求反運算元(例如,計 π-Χ,其中X為讀取值)或鉗制運算元或結果之命令。運算命 7碼之其他實例可包括(例如)用以選擇資料袼式之命令。 由ALU 350所執行之算術運算的實例為像素封包内至少 一變數上之形式(a*b)+(c*d)之純量算術運算,其中a、b、eThe operations described in Graphics System: A Specification (Version ι. 2) are incorporated herein by reference. For example, in response to raster segment 3 10 pre-determining the desired graphics processing function (eg, atomization operation) to be performed on the pixel, raster stage 3 10 may determine the pixel using a programmable mapping table or mapping algorithm. Encapsulation and assignment of associated instructions for performing scalar arithmetic operations required to implement graphics functions on pixels. The mapping can be stylized, for example, by the graphics processor management software application 28. Referring again to Figure 3, as each pixel of the triangle is navigated by the raster stage 310, the raster stage 3 1 〇 produces pixel packets for further processing, which are received by the gateway manager stage 32 。. The gateway manager stage 320 performs data flow control functions. In one embodiment, the gateway manager stage _ 320 has an associated scoreboard 325 for scheduling, load balancing, resource allocation, and avoidance of pixel packets. Scoreboard 325 tracks the entry and exit of pixels. The score packet entering the gateway manager stage 320 sets the scoreboard, and the scoreboard is reset to the pixel packets ejected from the programmable processor 2〇5 after processing is completed. As an illustrative example, if the compact display 295 has an area of 128 x 32 pixels, the scoreboard 325 can maintain a table for each pixel of the display to monitor pixels. The scoreboard 325 provides several benefits. For example, when one of the triangles is like 101822.doc. 1297468 is located on top of another pixel being processed and in flight, the scoreboard 325 prevents danger. In one embodiment, the scoreboard milk monitors the idle conditions and causes the scoreboard technique (s(10)eb_ding) to record the idle unit time. For example, if there are no valid pixels, the scoreboard 325 can turn off the ALU to save power. As described in more detail below, the scoreboard milk tracks the pixel packets, which can be processed by the ALU 350 along with the pixel packets with the canceled bytes, such that the pixel packets flow through the ALU 35 without being actively activated. In one embodiment, the scoreboard 325 tracks the recirculating pixel seals, which are (X, 3 〇 position. If the pixel packets are recycled, the scoreboard 325 increments the sequence of instructions in the pixel packets to pixels in subsequent passes. The next instruction, for example, if the instruction is used for atomization operations on the number i, then the instruction is iterated over the number 2 to a translucent color mixing operation. The nurturing extraction stage 330 is extracted by the gateway manager 32 〇 The data of the pixel packets may include, for example, performing color, depth, or texture data reading by performing a color, depth, or texture data read by a pixel packet column. The data extraction phase 33 may, for example, By self-memory: please call for reading (for example, use the DMA engine 23〇 to read the frame buffer (not shown)) to extract pixel or quality data. In the embodiment The fetch stage 330 can also manage native cache memory, such as texture/atomized cache memory 332, color/depth cache memory 334, and z cache memory for depth data (not shown). Before the pixel packet is sent to the next stage, the extracted data is placed on the corresponding pixel packet field. In the embodiment, the data extraction stage 330 includes a shelf for accessing the pixel packet attribute. The required data-instruction command random access memory (ram). In some 101822.doc -17 - 1297468 embodiments, the data extraction phase 330 also performs a Z-depth test. In this embodiment, the data extraction phase 330 uses one or more depth comparison tests to compare the Z-depth value of the pixel packet with the stored Z-value. If the Z-depth value of the pixel indicates that the pixel is occluded, then the cancellation bit is set. The pixel packet column enters the arithmetic logic unit ( ALU) stage 340 for processing. ALU stage 340 has a set of ALUs 350 including at least one ALU 350, such as ALU 3 5 0-0, 3 50-1, 35 0-2, and 3 50-3. ALU 350' but depending on the application, available at More or less ALUs 350 are used in the alu stage 340. The individual ALUs 350 read the current instructions for at least one pixel packet column 5 1 , and implement any instructions to perform the alu stylization to support the scalar amount Arithmetic operations are included in each ALU 3 5 and can be stored, for example, in a native instruction RAM (not shown in Figure 3). Each ALU 350 includes a product for the first operand (a *b) and an instruction to perform at least one arithmetic operation on the second operand product (b*c), where &, b, 〇, and d are operands and * is multiplication. Some or all of the operands may correspond to (10) as in the scratchpad value column 510 within the scratchpad value attribute. ALU35〇 can also have one or more telegrams for normal or software loading; γ to » ^ , ' 乂 彳 彳 彳 异 。 。 。 。 。. In some embodiments, the ALU may support the use of a temporary storage result from a singular sentence of πI on the bar. In an embodiment, each ALU 350 is programmable. A switch (cr〇ssbar) (not shown) or other programmable selector may be included in the "noon" option and the result of the response in response to the software (eg 'software application 270') Instructions. For example, in the middle, the operation command code can be used to encapsulate the column 51 from the pixel. The source, temporary value, and constant value of any temporary 101822.doc •18 * 1297468 are selected as the source of each operand (a, b, c, d). = This is the case, the # command also indicates where the ALU 35q sends the arithmetic. For example, using the result to update the pixel packet, storing the result; , , , α temple value or both using the result to update the pixel packet and storing the result as a temporary value. For example, &, programmatic alu can be used to read a specific attribute in a pixel packet as an operand, and apply the purely different operation specified by the #pre-instruction. The operation command code may also include a complementary operation element (for example, calculating l_x, where the parent is a read value), a negation operation element (for example, π-Χ, where X is a read value), or a clamp operation element or The result of the order. Other examples of operations may include, for example, commands to select data patterns. An example of an arithmetic operation performed by ALU 350 is a scalar arithmetic operation of at least one variable form (a*b)+(c*d) within a pixel packet, where a, b, e

及d為運算元且*運算為乘法。較佳地亦可程式化每一 alu 350以執行其他數學運算,諸如互補運算元及求反運算元。 另外,在某些實施例中,每一ALU 35〇可自(a*b,c*d)計算 最小及最大值,並執行邏輯比較(例如,若a*b等於、不等 於、小於、或小於或等於c*d之邏輯結果)。 在某些實施例中,每一 ALU 35〇亦可包括指令,該等指 令用於基於一測試來判定是否在取消欄位414中產生取消 位元,該測試諸如a*b與c*d之比較(例如,若a*b不等於c*d 則取消、若a*b等於c*d則取消、若a*b小於c*d則取消、或 若a*b大於或等於c*d則取消)。可產生取消位元之alu運算 的實例包括透明度測試,其中將色彩值與測試色彩值進行 比較,諸如表達式IF(透明度〉透明度參考),則取消像素, 101822.doc -19- 1297468 其中透明度為色彩值,且透明度參考為參考色彩值。可產 生取4位το之ALU運算的另一實例為2深度測試,其中將像 素之Z值與具有相同位置之先前像素的至少一Z值進行比 車乂且右該深度測試指示該像素被閉塞,則取消該像素。 在員施例中,若取消位元經設定於像素封包中,則關 於=理像素封包使個別ALU35G被停用。在_實施例中,當 在方頻帶身訊中偵測到取消位元時,使用時脈閘控機制以 使ALU35G被停用。結果,在對像素封包產生取消位元後, ALU 350未在該像素封包上浪費功率,因為其經由則階段 =0傳播。然而,應注意,具有取消位元組之像素封包仍向 前傳播,從而允許其由資料寫入階段355及計分板325來解 =此方式允許所有像素封包由計分板325來解決,甚至 疋忒等由取消位兀標記為不需要進一步alu處理的像素封 在實施例中,若像素之任何列5 10係由取消位元來標 :,則亦取消相同像素之其他列51〇。此可(例如)藉由在; &之間轉發取消資訊或藉由追縱像素之―或多個階段(其 中列5 10係由取消位元來標記)來完成。在某些實施例中, 一取’肖位7L被設定,則僅像素封包列5 1〇之旁頻帶資訊 41〇(其包括取消位元)傳播至下一階段上。 & ALUP “又34〇之輸出轉到資料寫入階段355。資料寫入階 段355將所處理之像素封包轉換為像素資料並將結果寫人 至記憶體介面(例如’經由DMA引擎23〇)。在一實施例中, 像素之寫入值在寫入緩衝器352中被積累,且像素之積累寫 入被刀批寫人至記憶體。資料寫人階段355可執行之功能的 101822.doc 1297468 實例包括色彩及深度回寫與格式轉換。在某些實施例中, 資料寫入階段355亦可識別待取消之像素並設絲消位元。 包括再循ί辰路徑360以將像素封包再循環回至閘道管理器 320。再循核路徑36〇允許(例如)需要使用通過階段“ο 一次以上來執行一算術運算序列的處理。資料寫入階段 指示引退之寫入至閘道管理器階段32〇以用於計分板技術。 圖6為例示性個別ALU 35〇之方塊圖。ALu 35〇具有輸入 匯流排605 ,該輸入匯流排具有用於接收對應暫存器r〇、 Rl、R2及R3中之像素封包列51〇的資料匯流排。包括指令 RAM6H)以用於ALU指令。一例示性指令組在方塊62〇中: 以說明。在一實施例中,可程式化ALU 35〇以自列51〇讀取 四個20位元暫存器值中的任一值並自列51〇選擇一組運算 兀。另外,可程式化ALU 350以自暫存器(T)63〇將臨時值選 擇為運算元,諸如每ALU 350兩個20位元臨時值,其可自先 丽結果臨時儲存,如路徑64〇所指示。ALU 35〇亦可將常數 值(未圖示)選擇為運算元,其亦可由軟體來程式化。在一實 施例中,第一多工器(MUX)階段645自像素封包列選擇運算 元、任何臨時值630及任何常數值(未圖示)。可包括格式轉 換模組650以在算術計算單元67〇中將運算元轉換為適合於 八1^11 3 50之計算精確度的所要資料格式。八乙1;35〇包括用以 允許在第二MUX階段660中選擇每一運算元或其互補的元 件。將所得之四個運算元輸入至純量算術計算單元67〇,該 純量算術計算單元可執行兩乘法及一加法。視情況可使用 钳制器680將所得值鉗制至所要範圍(例如,〇至1〇)。像素 1018|2,d〇c -21- 1297468 封包列510在匯流排690上退出。 在-實施例中,所選像素封包屬性可為—符號i8(si8) 格式。該S1.8格式為-具有8位元小數之基數2數目,其範 圍為[-2至+2]。S1.8格式允許計算之更高的動態範i舉例 而言’在處理照明之計算中,S1.8格式允許增加之動離範 ^從而導致改良之真實性。若以⑴執行之純量算術運 异的結果必須在[G,l]範圍内,則可鉗制該結果以迫使該結 果在[〇, im圍内,作為-說明性實例,可以Sl 8格式^ 行色彩資料的網㈣算且接著_結果。纽意,在本發 明之實施例中’不同類型之像素封包可具有以不同格式^ 不之資料屬性。舉例而t,色彩資料可以S18格式之第一 類型的像素封包來表示,而(s,〇紋理資料可以高精確度^ 位元格式之第二類型的像素封包來表示。在某些實施例 中,像素封包位元大小係藉由最高精確度像素屬性的位元 大小需求來設I舉例而言,由於紋理屬性—心色彩需 要更大之精確度’因而可以像素封包大小來以高精確度 表示紋理資料’諸如16位元紋理資料。Sl 8格式之改良動 態範圍允許(例如)將用於—個以上之色彩成份的資料有效 封裝成2G位元像素封包大小,該大小係為需要(例如)16位元 之紋理資料及4位元精細度(L0D)的更高精確度資料之紋理 資料而選擇。舉例而士 , ^ ^ 一 。,由於母— S1.8色彩成份需要10位 70 ’因而兩個色彩成份可經封裝成2G位元像素封包。 圖7說明例示性ALU階段34〇,其包括一個以上經配置為 管線之ALU 3 50,J: φ兩加斗、工加 ,、中兩個或兩個以上ALU 35〇被鏈結在一 101822.doc * 22 1297468 起。如先前所插述,可程式化個別ALU35(m自—像素封包 讀取-或多個運算元,產生算術運算的結果,並使用該結 果,來更新像素封包或臨時暫存器。可指派每一⑽以讀取 運算元,產生算術結果,並在將像素封包列傳遞至下一alu 之别更新一或多個像素封包或臨時值。 視待執行之處理運算、ALU等待時間及效率考慮而定, 可以各種方式來組態ALU階段34〇中之alu 35〇之間的資料 流。如先前所描述,本發明允許程式化每一 alu以讀取像 | 素封包列内之所選運算元且使用結果來更新所選像素封包 暫存器。在-實施例中,則階段34〇包括用於每一色彩通 道(例如,紅色、綠色、藍色及透明度)之至少一 ALU 350。 以此方式允許(例如)負載平衡,其中該等A L u經組態以在像 素封包列5 10上並行地運算(儘管因管線技術而以不同之時 間點)以執行類似或不同之處理任務。作為可如何程式化 ALU 350之一實例,可程式化第一 ALU35〇-〇以執行第一色 彩成份的計算,可程式化第:ALU35〇_1以執行第二色彩成 拳份的運算,可程式化第三ALU35〇-2以執行第三色彩成份的 運异,且可程式化第四ALU 3 50-3以執行霧化運算。因此, 在某些實施例中,對於一像素封包列5丨〇,可對每一 ALU 3 5〇 指派不同之處理任務。另外,如以下之更詳細描述,在某 些κ施例中,軟體可組態ALU 350以選擇ALU階段340内 ALU 350之資料流,包括ALU 3 5〇之執行次序。然而,由於 可組態該資料流’因而應瞭解,在某些實施例中,可配置 沿著一 ALU鏈的資料流,使得一 ALU35〇彳之結果更新一或 101822.doc 1297468And d is an operand and * is computed as a multiplication. Preferably, each alu 350 can also be programmed to perform other mathematical operations, such as complementary operands and negation elements. Additionally, in some embodiments, each ALU 35 can calculate a minimum and a maximum from (a*b, c*d) and perform a logical comparison (eg, if a*b is equal, not equal, less than, or Less than or equal to the logical result of c*d). In some embodiments, each ALU 35 can also include instructions for determining whether to generate a cancellation bit in the cancellation field 414 based on a test, such as a*b and c*d. Comparison (for example, if a*b is not equal to c*d, cancel, if a*b is equal to c*d, cancel, if a*b is less than c*d, cancel, or if a*b is greater than or equal to c*d cancel). An example of an alu operation that can generate a cancel bit includes a transparency test in which a color value is compared to a test color value, such as the expression IF (Transparency > Transparency Reference), then the pixel is cancelled, 101822.doc -19- 1297468 where transparency is The color value, and the transparency reference is the reference color value. Another example of an ALU operation that can produce a 4-bit το is a 2-depth test in which the Z-value of a pixel is compared to at least one Z-value of a previous pixel having the same position and the depth test indicates that the pixel is occluded. , then cancel the pixel. In the embodiment, if the cancellation bit is set in the pixel packet, then the individual ALU 35G is deactivated. In the embodiment, when a cancel bit is detected in the square band body, a clock gating mechanism is used to disable the ALU 35G. As a result, after the cancellation bit is generated for the pixel packet, the ALU 350 does not waste power on the pixel packet because it propagates via phase =0. However, it should be noted that the pixel packet with the canceled byte is still propagated forward, allowing it to be resolved by the data write phase 355 and the scoreboard 325. This way all pixel packets are allowed to be resolved by the scoreboard 325, even The pixels marked by the cancellation bit as not requiring further alu processing are encapsulated in the embodiment. If any column 5 10 of the pixel is marked by the cancel bit: the other columns 51 of the same pixel are also cancelled. This can be done, for example, by forwarding the cancellation information between & or by tracking the "or multiple phases" of the pixels (where column 5 10 is marked by the cancellation bit). In some embodiments, a 'corner 7L is set, and only the sideband information of the pixel packet column 〇1 (which includes the cancellation bit) is propagated to the next stage. & ALUP "The output of the other 34" goes to the data writing stage 355. The data writing stage 355 converts the processed pixel packet into pixel data and writes the result to the memory interface (eg 'via DMA engine 23〇) In one embodiment, the write value of the pixel is accumulated in the write buffer 352, and the accumulated write of the pixel is written to the memory by the cutter. The function of the data writer stage 355 is 101822.doc 1297468 Examples include color and depth write back and format conversion. In some embodiments, data writing stage 355 can also identify pixels to be cancelled and set up silk distracting elements. Loop back to the gateway manager 320. The re-routing path 36 allows, for example, the need to use the process of performing an arithmetic operation sequence through the stage "o more than once. Data Write Phase Indicates that the retirement is written to the Gateway Manager Phase 32 for use in the scoreboard technology. Figure 6 is a block diagram of an exemplary individual ALU 35〇. The ALu 35A has an input bus 605 having a data bus for receiving pixel packet columns 51 of the corresponding registers r, R1, R2, and R3. Includes instruction RAM6H) for ALU instructions. An exemplary set of instructions is in block 62: to illustrate. In one embodiment, the programmable ALU 35 reads any of the four 20-bit scratchpad values from column 51 and selects a set of operations from column 51. In addition, the programmable ALU 350 selects the temporary value as an operand from the scratchpad (T) 63, such as two 20-bit temporary values per ALU 350, which can be temporarily stored as a result, such as path 64. Instructed. The ALU 35〇 can also select a constant value (not shown) as an operand, which can also be programmed by software. In one embodiment, the first multiplexer (MUX) stage 645 selects operands, any temporary values 630, and any constant values (not shown) from the pixel packet column. A format conversion module 650 can be included to convert the operands into a desired data format suitable for the computational accuracy of eight 1^11 3 50 in the arithmetic calculation unit 67.八乙1; 35〇 includes elements to allow selection of each operand or its complement in the second MUX stage 660. The resulting four operands are input to a scalar arithmetic calculation unit 67, which performs two multiplications and one addition. The resulting value can be clamped to the desired range (e.g., 〇 to 1 〇) using a clamp 680, as appropriate. Pixel 1018|2, d〇c - 21 - 1297468 Packet column 510 exits on bus bar 690. In an embodiment, the selected pixel packet attribute may be in the -symbol i8 (si8) format. The S1.8 format is - the number of base 2 with an 8-bit fraction, which ranges from [-2 to +2]. The S1.8 format allows for a higher dynamic range of calculations. For example, in the calculation of processing illumination, the S1.8 format allows for an increased dynamic range, resulting in improved authenticity. If the result of the scalar arithmetic transfer performed by (1) must be in the range of [G, l], the result can be clamped to force the result to be in [〇, im, as an illustrative example, in S8 format^ The network of color data (4) is calculated and then _ results. In the embodiment of the invention, the different types of pixel packets may have data attributes in different formats. For example, t, the color data may be represented by a first type of pixel packet of the S18 format, and (s, the texture material may be represented by a second type of pixel packet of a high precision ^bit format. In some embodiments The pixel packet bit size is set by the bit size requirement of the highest precision pixel attribute. For example, since the texture attribute—heart color requires greater precision, the pixel packet size can be expressed with high precision. Texture data such as 16-bit texture data. The improved dynamic range of the S8 format allows, for example, efficient encapsulation of data for more than one color component into a 2G bit pixel packet size, which is required (for example) 16-bit texture data and 4-bit fineness (L0D) for higher-precision data texture data. For example, ^, I., because the mother-S1.8 color component requires 10 bits 70' The two color components can be packaged into 2G bit pixel packages. Figure 7 illustrates an exemplary ALU stage 34, which includes more than one ALU 3 50 configured as a pipeline, J: φ two plus buckets, work plus, Two or more ALUs 35〇 are chained together at 101822.doc * 22 1297468. As previously explained, individual ALU35 (m self-pixel packet reads - or multiple operands can be programmed to generate The result of the arithmetic operation, and use the result to update the pixel packet or temporary register. Each (10) can be assigned to read the operand, produce an arithmetic result, and update the pixel packet column to the next alu. Or multiple pixel packets or temporary values. Depending on the processing operations to be performed, ALU latency, and efficiency considerations, the data flow between alu 35〇 in the ALU phase 34〇 can be configured in various ways. The present invention allows each lu to be programmed to read selected operands within the column of the prime packet and use the result to update the selected pixel packet register. In an embodiment, then stage 34A is included for each At least one ALU 350 of a color channel (eg, red, green, blue, and transparency). In this manner, for example, load balancing is allowed, wherein the ALs are configured to operate in parallel on the pixel packet column 5 10 (although due to tube Techniques at different points in time to perform similar or different processing tasks. As an example of how the ALU 350 can be programmed, the first ALU 35〇-〇 can be programmed to perform the calculation of the first color component, which can be programmed :ALU35〇_1 to perform the second color into a punch operation, the third ALU35〇-2 can be programmed to perform the third color component, and the fourth ALU 3 50-3 can be programmed to perform the atomization. Thus, in some embodiments, for a pixel packet column 5, each ALU 35 can be assigned a different processing task. Additionally, as described in more detail below, in some κ embodiments The software configurable ALU 350 selects the data stream of the ALU 350 within the ALU stage 340, including the execution order of the ALU 3 5〇. However, since the data stream can be configured, it will be appreciated that in some embodiments, the data stream along an ALU chain can be configured such that the result of an ALU 35 is updated by one or 101822.doc 1297468

來讀取。 其作為運算元由隨後 之 ALU 350-1 圖8為一 之圖形處理器205To read. It is used as an arithmetic unit by the subsequent ALU 350-1. FIG. 8 is a graphics processor 205.

地重新組態像素封包經由階段之處理流程。因此較佳地利 用同步技術來協調在自一組態改變至另一組態期間處於飛 行中之像素封包的為料流,意即,執行同步使得處於飛行 中之意欲在第一組態中待處理之像素封包在該組態改變至 圖8為一具有可重新組態管線之可程式 之一部分之實施例的方塊圖,其中像素連 理流程可組態以回應軟體命令,諸如來差 第二組態之前完成其處理。 在一實施例中,資料提取階段830、資料寫入階段855及 個別ALU 850具有各連接至第一分配器89〇的個別輸入及各 連接至弟一分配器895的個別輸出。每一分配器890及895可 (例如)包含交換器、交叉開關、路由器或MUX電路以選擇 傳入像素封包至資料提取階段830、ALU 8 50及資料寫入階 段855之分配流。分配器890及895判定傳入像素封包810經 由資料提取階段830、資料寫入階段855及個別ALU 850的資 料路徑。訊號輸入892及894允許分配器890及895接收軟體 命令(例如,來自在CPU上執行之軟體應用程式),以在資料 101822.doc -24· 1297468 提取階段830、資料寫入階段855及ALU 850之間重新組離、像 素封包之分配。重新組態之一實例為指派ALU 850之執行次 序。重新組態之另一實例為··若判定到對於特定時間處理 任務不需要資料提取階段,則繞過資料提取階段83〇。作為 重新組態之又一實例,可能需要改變資料提取階段83〇耦合 至ALU的次序。作為另一實例,可能需要對資料寫入階段 8 5 5進行重新排序。作為一說明性實例,可存在以下情況: 其中更有效的是在資料提取之前對紋理座標進行運算,在 該情況下,配置資料流以使資料提取階段83〇在alu85〇執 订紋理運算之後接收像素封包。因此,可重新組態管線之 一益處在於:軟體應用程式可重新組態可程式之圖形處理 器205以增加效率。 再-人參看圖5,如先前所論述,光柵階段31〇產生用於處Reconfigure the pixel packet through the processing flow of the stage. It is therefore preferred to utilize synchronization techniques to coordinate the flow of pixels in flight during a change from one configuration to another, meaning that synchronization is performed such that it is in flight intended to be in the first configuration. The processed pixel packet is changed in the configuration to Figure 8 is a block diagram of an embodiment of a programmable portion having a reconfigurable pipeline, wherein the pixel conjunction process is configurable in response to a software command, such as a second group Complete its processing before the state. In one embodiment, the data extraction phase 830, the data writing phase 855, and the individual ALUs 850 have individual inputs that are each coupled to the first distributor 89A and respective outputs that are coupled to the first distributor 895. Each of the splitters 890 and 895 can, for example, include a switch, crossbar switch, router or MUX circuit to select the incoming stream of incoming packets to the data extraction stage 830, the ALU 8 50, and the data write stage 855. Distributors 890 and 895 determine the incoming pixel packet 810 via the data extraction phase 830, the data write phase 855, and the individual ALU 850 data paths. Signal inputs 892 and 894 allow allocators 890 and 895 to receive software commands (e.g., from a software application executing on the CPU) to extract phase 830, data write phase 855, and ALU 850 in data 101822.doc -24· 1297468 Re-distribution and allocation of pixel packets. One instance of reconfiguration is to assign the execution order of the ALU 850. Another example of reconfiguration is if the data extraction phase is not required for a particular time processing task, then the data extraction phase 83 is bypassed. As yet another example of reconfiguration, it may be necessary to change the order in which the data extraction phase 83 is coupled to the ALU. As another example, it may be necessary to reorder the data write phase 855. As an illustrative example, there may be the following cases: where it is more efficient to operate on the texture coordinates prior to data extraction, in which case the data stream is configured to cause the data extraction phase 83 to receive after the aloe 85 texture processing operation Pixel packet. Therefore, one benefit of the reconfigurable pipeline is that the software application can reconfigure the programmable graphics processor 205 to increase efficiency. Referring again to Figure 5, as previously discussed, the grating phase 31 is generated for use at

理之像素封包列510。可進一步將該等列51〇配置為一群MO 列諸如一四列5丨〇序列,其被傳遞以用於在連續時脈週期 中處理。然而,可在像素封包列51〇上執行之某些運算可需 要另一像素封包列之算術運算的結果。因此,在一實施例 中’光栅階段310在_群52()列中配置像素封包以解決資料 =賴性。作為_說明性實例,若—像素封包上之紋理運算 而要歹】中另一像素封包之結果,則配置群52〇,使得且有 依賴之紋轉算㈣素料纽於錢之Μ。 參看圖9,JZ _ 、、 只她例中,像素由光栅階段310交替地指 派為奇或偶。每_德4 像素列之相應暫存器(R0、Rl、R2及R3) 相應地被指派為偶或奇。接著利用一或多個規則來交錯偶 101822.doc -25- 1297468 像素之偶像素封包㈣5及奇像素之㈣9ig明免資料依 、 每隔列進行交錯會提供額外之時脈週期以解決 ALU等待時間。51此’若偶像素之列G需要兩個時脈週期以 所需之結果,則交錯奇像素之列0 由ALU等待時間所需之時間的額外時脈週期。作為 2明性實例,考慮多紋理運算,其中偶像素之列〇為混色 的^且相同像素之列1對應於與需要第-混色運算之結果 :弟紋理之/ttl色。若第—運算的ALu等待時間為兩個時 2週期’則交錯會允許混色運算之結果可用於使用 鼻之紋理。 /一交錯實施例中,較佳地包括旁頻帶資訊以協調交錯 貝^。舉例而言’在一實施例中’每一像素封包中的旁 頻帶資訊包括-偶/奇攔位以區別偶及奇列。每—勘⑽ =包括兩組對應於偶像素及奇像素之臨時暫存器的臨時 ,存益’以對偶/奇像素封包提供合適之臨時值。使用偶/ 奇攔位來選擇合適之臨時暫存器組,例如,對奇像素 偶臨時暫存器’而對偶像素選擇一奇臨時暫存器組。在一 實施例中’偶及奇像素兩者共用常數暫存器,以降低甩於 偶及奇像素之常數值之儲存需要的總量。在—實施例中,、 軟體主機可用常數值來設定臨時暫存器達-延長之時段, 以核擬常數暫存器。雖然交錯兩個像素為-實施例時,作 是應瞭解,若(例如)ALU等待時間對應於兩個以上之時脈 期’則可將該交錯進一步延展為交錯兩個以上之像 光栅階段3 10交錯像素封包之一益步〆 ”The pixel block column 510. The columns 51A can be further configured as a group of MO columns, such as a four column 5 丨〇 sequence, which are passed for processing in a continuous clock cycle. However, some of the operations that can be performed on the pixel packet column 51 can require the result of an arithmetic operation of another pixel packet column. Thus, in one embodiment the 'raster stage 310 configures pixel packets in the _group 52() column to account for data. As an illustrative example, if the texture operation on the pixel packet is the result of another pixel packet, then the group 52 is configured such that there is a dependency on the texture (4). Referring to Figure 9, JZ _ , in her case alone, the pixels are alternately assigned odd or even by the raster stage 310. The respective registers (R0, R1, R2, and R3) of each 4-pixel column are assigned as even or odd. Then use one or more rules to interleave the even pixel packets (4) 5 of the 10822.doc -25 - 1297468 pixels and the (4) 9ig clear data of the odd pixels. Interleaving every other column will provide an additional clock cycle to solve the ALU latency. . 51 If the even pixel column G requires two clock cycles to achieve the desired result, then the odd-numbered pixel column 0 is interleaved by the ALU waiting time for the additional clock cycle. As a two-intelligence example, a multi-texture operation is considered in which the column of even pixels is mixed color and the column 1 of the same pixel corresponds to the result of the need for the first-color mixing operation: the /ttl color of the texture. If the ALU latency of the first operation is two, 2 cycles' then the interleaving will allow the result of the color mixing operation to be used for the texture of the nose. In an interleaved embodiment, sideband information is preferably included to coordinate interleaving. For example, in one embodiment, the sideband information in each pixel packet includes an - even/odd block to distinguish between even and odd columns. Each of the surveys (10) = includes two sets of temporary temporary registers corresponding to even and odd pixels, providing a suitable temporary value for the dual/odd pixel packets. The even/odd block is used to select the appropriate temporary register set, e.g., for the odd pixel even temporary register' and the odd pixel selects an odd temporary register set. In one embodiment, both the even and odd pixels share a constant register to reduce the amount of storage required for constant values of even and odd pixels. In the embodiment, the software host can use the constant value to set the temporary temporary register up-extension period to verify the constant register. Although interleaving two pixels is an embodiment, it should be understood that if, for example, the ALU latency corresponds to more than two clock periods, then the interlace can be further extended to interlace two or more image raster stages 3 10 interlaced pixel packets

1處在於:硬體考慮到ALU 101822.doc -26- 1297468 等待時間,從而降低了軟體上之負擔以解決ALU等待時 間’若(例如)光柵階段3 10未交錯像素,則將另外發生該ALU 等待時間。 如先前所論述,在一可組態管線中,可組態ALU 350内 之資料流。舉例而言,在硬體中,每一 ALU 3 50可大體上相 同。然而,特定ALU可經組態以在資料流中具有一個以上 之位置,例如,不同之執行次序。因此,需要在每一 ALU 350 中提供一識別符以指示其在資料流内之位置。可(例如)藉由 | 每一 ALU 350的直接暫存器寫入技術對每一 ALU 350提供 識別符。然而,此方法具有需要顯著軟體耗用之缺點。因 此’在一實施例中,利用封包技術來觸發需要組態資訊之 元件’以發現其在處理流程内的相對位置並在本機暫存器 中寫入相應之識別符。 參看圖10,在一實施例中,ALU 350之暫存器位址空間 係使用封包初始化技術之軟體可組態型,以使用資料封包 將一識別(10)傳達至每一八乙11 350。每一八1^11 350可(例如) ® 包括用於接收及轉發資料封包之習知網路模組。在一實施 例中’10封包1〇1〇係由軟體應用程式來起始。1£)封包1〇1〇 含有初始ID碼,諸如數字。ID封包1010係在需要1]3碼之元 件之前的點處注入於圖形管線中,且接著傳遞至由當前管 線組態所定義之處理流程的隨後元件。在一實施例中,第 一 ALU 350中之組態暫存器1020接收ID封包,將ID碼之當 前值寫入至該組態暫存器中,且接著在將該1〇封包傳遞至 下一 ALU上之前遞增該ID封包之10碼。繼續此處理,其中 101822.doc -27- 1297468 每一隨後之ALU 350將ID碼之當前值寫入至其組態暫存器 中,且接著將具有遞增ID碼的ID封包傳遞至下一 alu。應 瞭解,沿著資料流路徑的其他階段亦可以類似方式而設定 之組態暫存器。舉例而言,組態流中之元件亦可包括資料 提取階段或資料寫入階段,其亦具有藉由讀取仍封包而設 定之組態暫存器,且其在將具有遞增⑴的①封包傳遞至該 組態流中之下一元件之前遞增10碼。此暫存器組態形式之 一益處在於:其不需要ALU 350單元之間的硬體差異,從而 Φ 允許資料流經由管線的軟體重新組態。因此,例如,在一 實施例中,圖形處理器管理應用程式28〇僅需要產生初始1〇 封包1010,諸如藉由經由主機介面220發佈用以產生ID封包 1010之命令,該命令係由ID封包產生器1030所接收。 在一替代實施例中,使用廣播封包技術將1£&gt;碼寫入至組 態暫存器中以觸發需要寫入組態暫存器之元件以發現其 ID。在此實施例中,該等元件(例如,ALU 35〇)可使用網路 協定來發現其ID。廣播封包技術可用於(例如)管線經分支 ® 以允許該管線之分支並行地處理像素之實施例中。 圖Π說明一包括診斷監視能力的實施例。在一實施例 中,存在一連串沿著圖形處理器2〇5之元件的分接頭,諸如 分接頭與每一 ALU 35〇及資料提取階段33〇相關聯。同樣亦 可在其他階段處包括分接頭。可組態之測試點選擇器丨丨〇5 經调適成允許監視諸如兩個分接頭112〇及113〇之所選分接 頭’以回應諸如來自圖形處理器管理應用程式28〇之軟體命 令的軟體命令。可(例如)使用多工器來實施可組態之測試點 101822.doc -28- I297468 L擇器1105。在-實施例中,包括至少—計數器⑴〇以用 於每一所選測試點的統計量收集。在一實施例中,由軟體 所產生之追蹤記錄(instrument)封包提供關於待監視之分接 頭之資訊且為所選測試點啟用計數。另外,可 錄暫存H崎基於管狀運算模式㈣計量收集 (例如,可提供追蹤記錄暫存器以允許軟體為特定類型之圖 形運算啟料數,諸如當發生透明度混色運算時啟用統計 計數卜可組態之測試點選擇器11〇5之一益處在於··其允許 諸如圖形處理H管理應用程式之軟體具有僅對所關注 之測試點而收集的統計資料,從而降低了硬體複雜性及成 本’同時仍允許軟體分析可程式之處理器2〇5之行為的任何 P刀可(例如)選擇所關注之測試點以收集與處理特定種類 貧料之該等ALU 350相關聯的統計量,該等ALU35〇諸如處 理紋理貝料的ALU 350。另外,可對特定圖形運算(諸如透 明度混色)啟用統計量收集。 在一實施例中’可'組態之測試點選擇器11〇5利用三線協 定。諸如ALU 350-0之具有有效負載資料之每一元件產生一 有效訊號,該有效訊號可(例如)向下流動至下一元件(例 如,ALU 350-1)。就緒接收有效負載之元件產生就緒訊號, 該就緒訊號可(例如)向上流動至前一元件。然而,若元件未 就緒接收有效負載’則該元件產生未就緒訊號,該未就緒 訊號可(例如)對應於未確定該就緒訊號。啟用訊號對應於為 監視而啟用之元件’該監視諸如藉由軟體控制經由對鄰近 於被監狀點而儲存的監視啟用控制位元的管線式暫存器 101822.doc -29- 1297468 寫入Λ號可自產生S亥訊號之元件或接收該等訊號之元件 直接分接。 可使用所選分接頭點處之有效、就緒及未就緒訊號來判 定運异狀態。一轉移狀態對應於一時脈記號,該時脈記號 具有一用於向下游流動之資料之有效的有效負載(意即,有 效位元組)及一來自下游區塊之就緒訊號以在該下游區塊 中接收資料(例如,在分接頭點1120處,—來自ALU_0之有 效訊號,及在分接頭點113〇處,—來自ALlM之就緒訊號)。 -等待狀態對應於-具有有效的有效負載之時脈記號,該 有效的有效負載被閉塞,因為下面之區塊未就緒接收資料 (例如,在分接頭點1120處,一來IALU_0之有效訊號,及 在分接頭點1130處’ 一來自ALU-i之未就緒訊號)。在此實 施例中,可收集所選分接頭點上的統計量,諸如計數轉移 狀悲及等待狀態被偵測之時脈週期的數目。 本發明之實施例提供各種可用於嵌入式圖形處理器核心 250中之益處。在一緊密系統中,低功率掌上型系統29〇、 功率、空間及CPU能力可受到相當地限制。在一實施例中, 當不需要處理時對ALU 350進行時脈閘控(例如,藉由偵測 取消位元),從而降低了處理功率需求。另外,光柵階段31〇 僅需要產生用於被處理之子組像素資料的像素封包,從而 亦降低了功率需求。可程式之ALU階段34〇與具有用於執行 專用圖形功能之專用階段的習知管線相比需要較小之晶片 區域,從而降低了成本。可將可程式之處理器2〇5實施為由 軟體可組態之區塊’從而提供了改良之效率。測試監視可 101822.doc -30- 1297468 經組態以测試一子組測試點,從而降低了軟體之頻寬及分 析需求。該等及其他先前所描述之特徵使得所關注之可程 式之圖形處理器205用於嵌入式圖形處理器核心25〇中。 用於解釋目的之上述描述使用特定術語來提供對本發明 之完整理解。然而,熟習此項技術者將顯而易見,不需要 特定細節以實踐本發明。因此,提出本發明之特定實施例 之上述描述以用於說明及描述之目的。其並不意欲為詳盡 的或將本發明限制於所揭示之精確形式;顯然,鑒於以上 | 之教示,許多修改及變化為可能的。選擇及描述該等實施 例以最好地解釋本發明之原理及其實際應用,其藉此使熟 習此項技術者能夠最好地利用本發明及具有各種修改之各 種實施例,該等修改適合於所涵蓋之特定使用。以下之申 請專利範圍及其相等物意欲應定義本發明之範疇。 【圖式簡單說明】 圖1為三維圖形之先前技術管線之圖; ” ®2為包括根據本發明之一實施例之可程式之圖形處理 Φ 器的積體電路之方塊圖; 圖3為根據本發明之一實施例之可程式之圖形處理器的 方塊圖; 圖4說明根據本發明之一實施例的例示性像素封包; 圖5說明根據本發明之一實施例將像素封包配置成一群 像素封包列的例示性配置; 圖6為根據本發明之一實施例之單—算術邏輯單元的方 塊圖; 101822.doc -31 · 1297468 圖7為根據本發明之—實施例之—含兩個算術邏輯單元 之序列的方塊圖;One is that the hardware considers the ALU 101822.doc -26- 1297468 latency, which reduces the burden on the software to solve the ALU latency. If the raster phase 3 10 is not interlaced, for example, the ALU will occur separately. waiting time. As discussed previously, in a configurable pipeline, the data flow within the ALU 350 can be configured. For example, in hardware, each ALU 3 50 can be substantially the same. However, a particular ALU can be configured to have more than one location in the data stream, e.g., a different order of execution. Therefore, an identifier needs to be provided in each ALU 350 to indicate its location within the data stream. Each ALU 350 can be provided with an identifier, for example, by |direct register write technology for each ALU 350. However, this approach has the disadvantage of requiring significant software consumption. Thus, in one embodiment, a packet technique is used to trigger an element that needs to configure information to find its relative position within the process flow and to write a corresponding identifier in the local register. Referring to Fig. 10, in one embodiment, the scratchpad address space of the ALU 350 uses a software configurable type of packet initialization techniques to communicate an identification (10) to each october 11 350 using a data packet. Each of the eight 1^11 350 can, for example, include a conventional network module for receiving and forwarding data packets. In one embodiment, the '10 packet 1' is initiated by the software application. 1 £) Packet 1〇1〇 Contains the initial ID code, such as a number. The ID packet 1010 is injected into the graphics pipeline at a point before the component requiring 1]3 code and then passed to subsequent components of the processing flow defined by the current pipeline configuration. In an embodiment, the configuration register 1020 in the first ALU 350 receives the ID packet, writes the current value of the ID code into the configuration register, and then passes the 1〇 packet to the next The 10 code of the ID packet is incremented before an ALU. Continuing with this process, 101822.doc -27- 1297468 each subsequent ALU 350 writes the current value of the ID code into its configuration register, and then passes the ID packet with the incremental ID code to the next alu . It should be understood that the configuration registers can be configured in a similar manner along other stages of the data flow path. For example, the components in the configuration stream may also include a data extraction phase or a data write phase, which also has a configuration register set by reading still packets, and which will have 1 packet with increments (1). The code is incremented by 10 yards before being passed to the next component in the configuration stream. One benefit of this form of register configuration is that it does not require hardware differences between ALU 350 units, so Φ allows data flow to be reconfigured via pipeline software. Thus, for example, in one embodiment, the graphics processor management application 28 only needs to generate an initial packet 1010, such as by issuing a command to generate an ID packet 1010 via the host interface 220, the command being packetized by the ID. The generator 1030 receives it. In an alternate embodiment, a broadcast packet technique is used to write a 1 £&gt; code into the configuration register to trigger an element that needs to be written to the configuration register to find its ID. In this embodiment, the elements (e.g., ALU 35A) can use a network protocol to discover their ID. Broadcast packet techniques can be used, for example, in embodiments where the pipeline is branched ® to allow the branches of the pipeline to process pixels in parallel. The figure illustrates an embodiment including diagnostic monitoring capabilities. In one embodiment, there is a series of taps along the components of graphics processor 2〇5, such as taps associated with each ALU 35〇 and data extraction phase 33〇. Taps can also be included at other stages. The configurable test point selector 丨丨〇5 is adapted to allow monitoring of selected taps such as two taps 112〇 and 113〇 in response to software commands such as those from the graphics processor management application 28 Software command. A configurable test point can be implemented, for example, using a multiplexer 101822.doc -28- I297468 L selector 1105. In an embodiment, at least - a counter (1) is included for statistical collection for each selected test point. In one embodiment, the instrumentation packet generated by the software provides information about the taps to be monitored and enables counting for the selected test points. In addition, the recordable temporary storage Hsaki is based on the tubular operation mode (4) metering collection (for example, a tracking record register can be provided to allow the software to be a specific type of graphics operation number, such as enabling statistical counting when a transparency color mixing operation occurs. One of the benefits of the configured test point selector 11〇5 is that it allows software such as the graphics processing H management application to have statistics collected only for the test points of interest, thereby reducing hardware complexity and cost. 'A P-knife that still allows the software to analyze the behavior of the programmable processor 2〇5 can, for example, select a test point of interest to collect statistics associated with the ALU 350 that processes a particular type of poor material, which ALU 35 such as ALU 350 for processing textured beakers. Additionally, statistic collection can be enabled for specific graphics operations, such as transparency blending. In one embodiment, the 'configurable' test point selector 11〇5 utilizes a three-wire protocol Each component, such as ALU 350-0, having payload data produces a valid signal that can, for example, flow down to the next component (eg For example, ALU 350-1). A component ready to receive a payload generates a ready signal that can, for example, flow up to the previous component. However, if the component is not ready to receive a payload, the component generates a not ready signal. The not ready signal may, for example, correspond to an undetermined ready signal. The enable signal corresponds to an element enabled for monitoring 'this monitoring is enabled by means of software control via a monitoring enable control bit stored adjacent to the monitored point Pipelined register 101822.doc -29- 1297468 Write the nickname to directly tap the component that generated the S signal or the component that receives the signal. The valid, ready, and not available at the selected tap point can be used. The ready signal determines the different state of the transfer. A transfer state corresponds to a clock mark having a valid payload (ie, a valid byte) for downstream data and a downstream block The ready signal to receive data in the downstream block (eg, at tap point 1120, - a valid signal from ALU_0, and at tap point 113〇, - The ready signal from ALMM. - The wait state corresponds to a clock token with a valid payload, the valid payload being blocked because the following block is not ready to receive data (eg, at tap point 1120, A valid signal from IVAL_0, and a 'not ready signal from ALU-i' at tap point 1130. In this embodiment, statistics on selected tap points can be collected, such as counting shifts. The number of clock cycles in which the wait state is detected. Embodiments of the present invention provide various benefits that can be used in the embedded graphics processor core 250. In a compact system, low power palm-sized systems 29, power, space, and CPU power can be fairly limited. In one embodiment, the ALU 350 is clocked (e.g., by detecting cancellation bits) when processing is not required, thereby reducing processing power requirements. In addition, the raster stage 31〇 only needs to generate pixel packets for the processed subset of pixel data, thereby also reducing power requirements. The programmable ALU stage 34(R) requires a smaller wafer area than a conventional pipeline having a dedicated stage for performing dedicated graphics functions, thereby reducing cost. The programmable processor 2〇5 can be implemented as a software configurable block&apos; to provide improved efficiency. Test Monitoring 101822.doc -30- 1297468 is configured to test a subset of test points, thereby reducing the bandwidth and analysis requirements of the software. These and other previously described features enable the tangible graphics processor 205 of interest to be used in the embedded graphics processor core. The above description for the purpose of explanation is used to provide a complete understanding of the invention. However, it will be apparent to those skilled in the art that <RTIgt; Accordingly, the above description of the specific embodiments of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. It is obvious that many modifications and variations are possible in light of the teachings of the above. The embodiments were chosen and described in order to best explain the embodiments of the invention, For the specific use covered. The scope of the claims below and the equivalents thereof are intended to define the scope of the invention. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram of a prior art pipeline of a three-dimensional graph; "2 is a block diagram of an integrated circuit including a programmable graphics processing Φ device according to an embodiment of the present invention; FIG. 3 is based on A block diagram of a programmable graphics processor in accordance with an embodiment of the present invention; FIG. 4 illustrates an exemplary pixel packet in accordance with an embodiment of the present invention; FIG. 5 illustrates a pixel packet configured as a group of pixels in accordance with an embodiment of the present invention. Illustrative configuration of a packet column; Figure 6 is a block diagram of a single-arithmetic logic unit in accordance with an embodiment of the present invention; 101822.doc - 31 · 1297468 Figure 7 is an embodiment of the present invention - containing two arithmetic a block diagram of the sequence of logical units;

圖8為根據本發明之一實施例之可組態之可程式之圖3 處理器的方塊圖; V 圖9說明像素封包列根據本發明之一實施例之交錯; 圖10為說明根據本發明之一實施例之具有組態暫存器之 算術邏輯單元的方塊圖;及 圖11為說明根據本發明之一實施例之可組態测試點選擇 器的方塊圖。 相同之參考數字在整個該等圖式之若干視圖中指的是對 應之部分。 【主要元件符號說明】 105 轉換階段 110 設定/光柵階段 115 紋理位址階段 120 紋理提取階段 130 霧化階段 135 透明度測試階段 140 深度測試階段 145 透明度混色階段 150 記憶體寫入階段 200 積體電路 205 圖形處理器 210 暫存器介面 丨.doc -32- 12974688 is a block diagram of a configurable programmable processor of FIG. 3 in accordance with an embodiment of the present invention; FIG. 9 illustrates a staggered row of pixel packs in accordance with an embodiment of the present invention; FIG. A block diagram of an arithmetic logic unit having a configuration register in one embodiment; and FIG. 11 is a block diagram illustrating a configurable test point selector in accordance with an embodiment of the present invention. The same reference numbers are used in the corresponding drawings throughout the drawings. [Major component symbol description] 105 Conversion phase 110 Setting/raster phase 115 Texture address phase 120 Texture extraction phase 130 Atomization phase 135 Transparency test phase 140 Depth test phase 145 Transparency color mixing phase 150 Memory writing phase 200 Integrated circuit 205 Graphics Processor 210 Scratchpad Interface 丨.doc -32- 1297468

220 主機介面 230 直接記憶體存取(DMA)引擎 250 圖形處理核心 260 中央處理單元 270 軟體應用程式 275 圖形應用程式 280 圖形處理器管理軟體應用程式 290 系統 295 顯示器 305 設定階段 308 頂點緩衝為 310 光栅階段 320 閘道管理器階段 325 計分板 330 資料提取階段 331 紋理/霧化快取記憶體 334 色彩/深度快取記憶體 340 算術邏輯單元(ALU)階段 350 算術邏輯單元(ALU) 350-0 、 350-1 、 ALU 350-2 &gt; 350-3 355 資料寫入階段 360 再循環路徑 410 旁頻帶資訊 101822.doc -33- 1297468220 Host Interface 230 Direct Memory Access (DMA) Engine 250 Graphics Processing Core 260 Central Processing Unit 270 Software Application 275 Graphics Application 280 Graphics Processor Management Software Application 290 System 295 Display 305 Setup Phase 308 Vertex Buffer to 310 Raster Stage 320 Gateway Manager Stage 325 Scoreboard 330 Data Extraction Phase 331 Texture/Atomization Cache Memory 334 Color/Deep Cache Memory 340 Arithmetic Logic Unit (ALU) Stage 350 Arithmetic Logic Unit (ALU) 350-0 350-1 , ALU 350-2 &gt; 350-3 355 Data Write Phase 360 Recycling Path 410 Side Band Information 101822.doc -33- 1297468

412 有效搁位 414 取消欄位 416 指令欄位 420 有效負載資訊 422〜 424 第一組(s,t)紋理座標 426 霧化欄位 430、 460 像素封包 462 色彩欄位 464 - 466 第二組紋理座標(s,t) 510 像素封包列 520 群 605 輸入匯流排 610 指令RAM 620 方塊 630 暫存器(T) 640 路徑 645 第一多工器(MUX)階段 650 格式轉換模組 660 第二MUX階段 670 算術計算單元 680 钳制器 690 匯流排 810 像素封包 830 資料提取階段 101822.doc -34- 1297468 850 ALU 855 資料寫入階段 890 、 895 分配器 892 訊號輸入 905 偶像素封包列 910 奇列 1010 ID封包 1020 組態暫存器 1030 ID封包產生器 1105 可組態之測試點選擇器 1110 計數器 1120 、 1130 分接頭/分接頭點 R0、R1、R2、R3 像素封包412 Effective Shelf 414 Cancel Field 416 Command Field 420 Payload Information 422~ 424 First Set (s, t) Texture Coordinates 426 Air Flow Field 430, 460 Pixel Packet 462 Color Field 464 - 466 Second Set of Textures Coordinate (s, t) 510 pixel packet column 520 group 605 input bus 610 instruction RAM 620 block 630 register (T) 640 path 645 first multiplexer (MUX) stage 650 format conversion module 660 second MUX stage 670 Arithmetic Calculation Unit 680 Clamp 690 Bus 810 Pixel Packet 830 Data Extraction Stage 101822.doc -34- 1297468 850 ALU 855 Data Write Stage 890, 895 Distributor 892 Signal Input 905 Even Pixel Packet Column 910 Chile 1010 ID Packet 1020 Configuration Register 1030 ID Packet Generator 1105 Configurable Test Point Selector 1110 Counter 1120, 1130 Tap/Tap Point R0, R1, R2, R3 Pixel Packet

101822.doc 35-101822.doc 35-

Claims (1)

\6S 申請專利範圍: 1. 一種圖形處理器,其包含·· 一光柵階段,其接收關於待光柵化之圖原之資料,該 t拇階段產生用於待處理之每—像素之複數個像素封 二,母—像素封包包括識別待處理之至少—像素屬性的 有效負載資訊’並具有_—待在該像㈣包上執行之 含至少-指令的序列所相關聯之旁頻帶資訊;及 -可程式之算術邏輯單元(ALU)階段,其用於處理該等 像素封包,該ALW段包括至少一 ALU,每一 alu經程式 化以:有-組至少一可能之純量算術運算,該組純量算 術運算係在一具有一對應之當前指令之傳入像素封包上 來執行; 其中在該複數個像素封包上執行一算術運算序列以執 行一圖形處理功能。 如請求項1之圖形處理器,其中該旁頻帶資訊包括一取消 欄位,且每一 ALU被時脈閘控以回應在該取消攔位中偵 測到一取消位元,以降低該複數個ALu之功率消耗。 如請求項2之圖形處理器,其中至少一 ALU回應偵測到對 於、、'屯里運异元之一邏輯比較的一真值而設定該取消位 元0 如請求項1之圖形處理器,其中該算術運算之形式為a*b + c*d,其中a、b、e及d為運算元且*為一乘法運算。 如請求項1之圖形處理器,其中該ALU階段經調適成執行 一網底功能。 101822.doc 1297468 6·如請求項1之圖形處理器,其中該ALU階段執行一霧化運 异、紋理映射、透明度混色、z測試或一透明度測試中之 至少一者。\6S Patent Application Range: 1. A graphics processor comprising: a raster stage that receives data about a picture to be rasterized, the t phase of the t generating a plurality of pixels for each pixel to be processed The second parent-pixel packet includes payload information identifying at least the pixel attribute to be processed and having the sideband information associated with the sequence containing at least the instruction to be executed on the image (four) packet; A programmable arithmetic logic unit (ALU) stage for processing the pixel packets, the ALW segment comprising at least one ALU, each alu being programmed to: have-group at least one possible scalar arithmetic operation, the group The scalar arithmetic operation is performed on an incoming pixel packet having a corresponding current instruction; wherein an arithmetic operation sequence is performed on the plurality of pixel packets to perform a graphics processing function. The graphics processor of claim 1, wherein the sideband information includes a cancel field, and each ALU is gated to respond to detecting a cancel bit in the cancel block to reduce the plurality of bits. ALu's power consumption. The graphics processor of claim 2, wherein the at least one ALU responds to the detection of a true value of one of the logical comparisons of the 异, and sets the cancellation bit 0 such as the graphics processor of the request item 1, The form of the arithmetic operation is a*b + c*d, where a, b, e, and d are operands and * is a multiplication operation. The graphics processor of claim 1, wherein the ALU stage is adapted to perform a network function. The image processor of claim 1, wherein the ALU stage performs at least one of an atomization, texture mapping, transparency blending, z-test, or a transparency test. 如請求項1之圖形處理器,其中該光栅階段對每一該像素 產生至少一像素封包列,每一列具有在一特定時脈週期 中被發送至該ALU階段之複數個像素封包。 如明求項1之圖形處理器,其進一步包含一閘道管理器階 段,該閘道管理器階段具有一用於追蹤自該光柵階段所 接收之像素封包的記分板。 9·如凊求項8之圖形處理器,其進一步包含: 一資料提取階段,其用以提取該等像素封包之資料; 一資料寫入階段,其用以執行自該ALU階段所接收之經 處理之像素封包的像素資料之一記憶體寫入。 10·如請求項9之圖形處理器,其進一步包含: 一自該資料寫入階段至該閘道管理器階段的再循環路 咎其用於為額外通過該ALU階段一次而再循環像素封 包。 11 ·如明求項1〇之圖形處理器,其中關於在該alu階段中被 標記為取消之像素封包的資訊被提供至該記分板以解決 取消之封包像素。 12· 一種圖形處理器,其包含: 至少—階段,其用以轉換並設定待光栅化之圖原的頂 黑占, 一 廳段’其接收關於待光柵化之圖原之資料,該 101822.doc 1297468 先栅k #又為待使用一圖形 /1、一# m ^ π及處理之每一像素產生至 ^像素封包列,該圖形運算 算序列· π j被表達為一純量算術運 一閘道管理器,其包括一 ’、 用以追蹤像素封包之該處理 之计分板; -資料提取階段’其用以提取每一像素封包列之資料; 二階段’其包含複數個可程式之算術邏輯單元 (ALU)以用於處理每一兮後各土^ a μ像素封包列,每一 ALU接收一輸 入像素封包龍輸出-輸线素封包halu自一 所㈣之像素封包列讀取至少—運算元,❹該至少一 异70來執行一純量算術運算,產生-結果,並執行將 吞亥結果寫入至一臨昧伯wm 夺值及使用該結果來更新該輸出列之 -像素屬性暫存器中之至少一者;及 負料寫入p白#又’其用以執行自該複數個从輯接收之 經處理之像素封包的像素資料之一記憶體寫入; —其中在該複數個像素封包上執行—算術運算序列以執 行該圖形處理功能。 13.如請求項12之圖形處理器,其中該複數個ALU係以-管 線來配置。 14·如睛求項12之圖形處理器,其中每一像素封包包括識別 待處理之至少一像素屬性的有效負載資訊,且每一列具 有識別待在該列之每一該像素封包上執行之至少一指令 的相關聯之旁頻帶資訊。 15·如明求項12之圖形處理器,其進一步包括一再循環路 101822.doc 1297468 錢路徑用以自該資料寫人階段至該閘道管理 » “又再循環像素封包,藉以可使用通過該則階段一 二人以上來處理像素封包。 16.ΓΓ項12之圖形處理器,其中每-則經調適成允許 人體主機選擇待選自該列之運算元以及—常數值與一 臨時值中之至少一者。 、 、月求貝12之圖形處理器,其中至少—階段經組態以痛 測取4條件並將像素封包標記為被取消,其中一 AwA graphics processor as claimed in claim 1, wherein the raster stage produces at least one pixel packet column for each of the pixels, each column having a plurality of pixel packets transmitted to the ALU phase in a particular clock cycle. The graphics processor of claim 1, further comprising a gateway manager stage having a scoreboard for tracking pixel packets received from the raster phase. 9. The graphics processor of claim 8, further comprising: a data extraction phase for extracting data of the pixel packets; and a data writing phase for performing the process received from the ALU phase One of the pixel data of the processed pixel packet is written to the memory. 10. The graphics processor of claim 9, further comprising: a recirculation path from the data write phase to the gateway manager phase for recycling pixel packets for additional pass through the ALU phase. 11. A graphics processor as claimed in claim 1, wherein information about the pixel packets marked as cancelled in the alu phase is provided to the scoreboard to resolve the cancelled packet pixels. 12. A graphics processor, comprising: at least a stage for converting and setting a top black share of a picture to be rasterized, and a field segment 'receiving information about a picture to be rasterized, the 101822. Doc 1297468 The first gate k# is generated for each pixel of the pattern to be used by a pattern /1, a #m^π, and to the pixel packet column, and the graph operation sequence π j is expressed as a scalar arithmetic one. a gateway manager, comprising: a scoreboard for tracking the processing of the pixel packet; - a data extraction phase 'to extract data for each pixel packet column; a second phase' comprising a plurality of programmable An Arithmetic Logic Unit (ALU) is used to process each ^ ^ ^ ^ ^ ^ ^ ^ ^ , , , , , , 每一 每一 每一 每一 每一 每一 每一 每一 a a a a a a a a a a a a a a a a a a a a a a a a An operand, 至少 至少 至少 来 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行 执行In the attribute register One of the lesser; and the negative material is written to the white white and the memory is used to perform one of the pixel data of the processed pixel packet received from the plurality of copies; wherein the plurality of pixel packets are on the plurality of pixel packets Execution—A sequence of arithmetic operations to perform the graphics processing function. 13. The graphics processor of claim 12, wherein the plurality of ALUs are configured with a - pipe. 14. The graphics processor of claim 12, wherein each of the pixel packets includes payload information identifying at least one pixel attribute to be processed, and each column has at least one of identifying to be performed on each of the pixel packets of the column The associated sideband information of an instruction. 15. The graphics processor of claim 12, further comprising a recirculation path 101822.doc 1297468, the money path for retrieving the pixel packet from the data writer phase to the gateway management, whereby the Then one or two people or more are used to process the pixel packet. 16. The graphics processor of item 12, wherein each is adapted to allow the human body host to select an operation element to be selected from the column and - a constant value and a temporary value At least one, . , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 中之、、、屯里算術處理被停用以回應㈣到—像素封包被標 記為被取消。 18·如#求項12之圖形處理器,其中該純量算術運算之形式 為^ c d,其中a、b、c&amp;d為運算元且*為一乘法運算。 19·如請求項12之圖形處理器,其中該alu階段經調適成執 行一網底功能。 20·如請求項12之圖形處理器,其中該ALU階段執行一霧化 運异、紋理映射、透明度混色、:z測試或一透明度測試中 之至少一者。 21.如清求項12之圖形處理器,其中該光柵階段及該複數個 ALU經調適成將一向量算術運算轉換為一純量算術運算 序列。 22. —種圖形系統,其包含: 一中央處理器’其具有一圖形軟體模組; 一可程式之圖形處理器,其自該圖形軟體模組接收頂 點資訊及用於程式化該可程式之圖形處理器之階段的程 101822.doc 1297468 式化指令’該可程式之圖形處理器包含: 尤樹丨自丨又,其為待處理之每 1豕I產生複數個像 素封〇以回應來自該圖形軟體模組之指令,每一像 封包包括識別待處理之至少―像素屬性的有效:載:: 訊,並具有識別待在每—該像素封包上執行之至少: 指令的相關聯之旁頻帶資訊;及 乂 -可程式之算術邏輯單元(ALU)階段,其包括經組態 用於處理該等像素封包之複數個ALU,每一 alu係由嗜 圖形軟體模組所指派以自所接收之像素封包讀取所選 運算元,執行-純量算術運算以回應—當前指令以產 生-結果,並執行使用該結果來更新一像素屬性暫存 器及將该結果储存為一臨時值中之至少一者· 其中在該複數個像素封包上執行一純量算術運算序 列以在每一該像素上執行一圖形處理功能。 23·如請求項22之圖形系統,其中該旁頻帶資訊包括一取消 欄位,且每一 ALU被時脈閘控以回應在該取消欄位中偵 測到一取消位元,以降低該複數個ALU之功率消耗。 24·如請求項22之圖形系統,其中該算術運算之形式為a*b + c*d,其中a、b、c及d為運算元且*為一乘法運算。 25·如請求項22之圖形系統,其中該ALU階段經調適成執行 一網底功能。 26·如請求項22之圖形系統,其中該ALU階段執行一霧化運 算、紋理映射、透明度混色、z測試或一透明度測試中之 至少一者。 101822.doc 1297468 27·如請求項22之圖形系統,其中該光栅階段對每一該像素 產生至少一像素封包列,每一列具有在一特定時脈週期 中被發送至該ALU階段之複數個像素封包。 28·如請求項22之圖形系統,其進一步包含一閘道管理器階 奴該閘道官理器階段具有一用於追蹤自該光栅階段所 接收之像素封包的記分板。 29·如請求項22之圖形系統,其進一步包含: 一資料提取階段,其用以提取該等像素封包之資料; | 一 ί料寫入階段,其用以執行自該ALU階段所接收之經 處理之像素封包的像素資料之一記憶體寫入;及 一自該資料寫入階段至該閘道管理器階段的再循環路 徑,其用於為額外通過該ALU階段一次而再循環像素封 包。 30.如凊求項29之圖形系統,其中關於在該ALU階段中被標 記為取消之像素封包的資訊被提供至該記分板以解決取 消之封包像素。 _ 31· -種嵌入式處理器,其包含: 一暫存器介面,其用於一主機以程式化一圖形核心之 暫存器; 一主機介面,其用於一主機以與該圖形核心進行通信; δ己憶體介面,其用於該圖形核心以讀取及寫入資料; 一可程式之圖形處理器,其安置於該圖形核心中,該 可程式之圖形處理器包含: 至少一階段,其用以設定並轉換待光栅化之圖原的 101822.doc 1297468 頂點; 一光柵階段,其接收關於待光柵化之圖原之資料 該光柵單元為待使用一圖形運算來處理之每一像素產 生至少一像素封包列,該圖形運算可被表達為一純旦 算術運算序列,每一像素封包包括識別待處理之至少 -像素屬性的有效負載資訊’且每一列具有識別待二 該列之每-該像素封包上執行之至少一指令的相關聯 之資訊; ❿ -閉道管理器,其包括—用以追蹤像素封包之該處 理之計分板; -資料提取階段,其用以提取每一像素封包列之資 料; ' - ALU階段,其包含複數個可程式之算術邏輯單元 (則)以用於處理每—該像素封包列,每—彻接收一 輸入像素封包列並輸出一輸出像素封包列,每一ALU I:所=之:素封包列讀取至少-運算元,使用該 來執行—純量算術 並執行將該結果寫入至 “果 a 匕日守值及使用該結果來更新 «亥輸出列之一屬性暫存 一資料寫入階段,盆 ilt ^ ^ ^ ^ ^ ,、 乂執仃自該複數個ALU所接 收之經處理之像素封 其中在該複數個像素封:之一記憶體寫入; 執行該圖形處理功能:、、匕上執行一算術運算序列以 32.如請求項31之嵌入式處理 w ’其中該複數個ALU係以一 101822.doc 1297468 管線來配置。 33. =請求項31之嵌人式處理器,其進-步包含-再循環路 。再循%路徑用以自該資料寫入階段至該閘道管理 =階段再循環像素封包,藉以可使用通過該則階段-次以上來處理像素封包。 34. Γ請求項31之嵌入式處理器,其中每-則可組態以允 許選自該列之運曾+ β i 以及—常數值與一臨時值中之至少 一者。 35·ΓΓ項31之嵌入式處理11,其中至少-階段經組態以 取消條件並將像素封包標記為被取消…一 記為被取消。 V用以回應偵測到一像素封包被標 A如請求項31之喪入式處_,其進 _ 理器,其甩於執行— 3 中央處 苴&quot;心仿 目形應用程式以操作該圖形核心, /、中4圖形核心及該中央 37· -種在一像素上執行一圖形處理運::二-晶片上。 對於待在一像素上執行之至少一:^’其包含: 像素封包上執行的 '純量算術運算二 =實識=在 一圖形功能; ^只知该至少 對該像素產生複數個像素封包,每一像 :純量算術運算序列中待作為運算元處理之一:: 屬=,該複數個像素封包具有—相„之子轉素 在至少-算術邏輯單 7序列, 運算元; W自4錢素封包讀取 10l822.doc 1297468 在δ亥至少_ A Τ τγ » 瞀 、 ALU中,根據該指令序列來執行純量算術運 執^丁該純量算術運算序列以用於實施該至少一圖 形功能。 %如請:項37之方法,其進一步包含: 判疋一像素不需要進一步處理; 夺取消狀態指派給至少一像素封包;及 了用由該至少一像素封包所遇到之每一隨後之ALU中 的算術計算以節省功率。 ^月項37之方法,其中該圖形功能包括一紋理組合、 ^ &amp; ’則&quot;式、一透明度混色、一透明度測試或一霧化中 之至少一者。 40·如請求項37 * 、 法’其中該純量算術運算之形式為a*b + c ,其中a、b、c及d為運算元且*為一乘法運算。 41·種在一像素上執行一圖形處理運算之方法,其包含: 、對於待在-像素上執行之—圖形;力能,朗可在像素 、匕上執行的一純量算術運算序列,以實施該圖形功能; 對在連續時脈週期中待處理之該像素產生至少一像素 ^包列’每-像素封包包㈣於在該純量算術運算序列 待作為運算元處理之一子組像素屬性的至少一搁位, 該至少一列具有一相關聯之指令序列; 在複數個算術邏輯單元(ALU)之每_者中,讀 之運算元’該等運算元中之至少一者對應於一自 封包列内之一像素封包所讀取之運算元; ’、 在每-該ALU中,根據該指令序列在該等所指派之運曾 101822.doc 1297468 42. 43.The arithmetic processing in the middle, , and 被 is deactivated in response to (4) to - the pixel packet is marked as cancelled. 18. The graphics processor of claim 12, wherein the scalar arithmetic operation has the form ^ c d, wherein a, b, c &amp; d are operands and * is a multiplication operation. 19. The graphics processor of claim 12, wherein the alu phase is adapted to perform a network function. 20. The graphics processor of claim 12, wherein the ALU stage performs at least one of atomization, texture mapping, transparency blending, :z testing, or a transparency test. 21. The graphics processor of claim 12, wherein the raster stage and the plurality of ALUs are adapted to convert a vector arithmetic operation into a scalar arithmetic operation sequence. 22. A graphics system, comprising: a central processing unit having a graphics software module; a programmable graphics processor receiving vertex information from the graphics software module and for programming the programmable The process of the stage of the graphics processor 101822.doc 1297468 The instructional program 'The programmable graphics processor comprises: 尤树丨自丨, which generates a plurality of pixel seals for each 待I to be processed in response to the An instruction of the graphics software module, each image packet includes identifying at least a "pixel attribute" to be processed: a message: and having an associated sideband identifying at least: an instruction to be executed on each of the pixel packets Information; and a programmable logic unit (ALU) stage comprising a plurality of ALUs configured to process the pixel packets, each alu being assigned by the graphics software module to receive The pixel packet reads the selected operand, performs a scalar arithmetic operation in response to the current instruction to generate a result, and executes the result to update a pixel attribute register and store the result And storing at least one of the temporary values, wherein a scalar arithmetic operation sequence is performed on the plurality of pixel packets to perform a graphics processing function on each of the pixels. 23. The graphics system of claim 22, wherein the sideband information comprises a cancellation field, and each ALU is gated in response to detecting a cancellation bit in the cancellation field to reduce the plural The power consumption of the ALU. 24. The graphics system of claim 22, wherein the arithmetic operation is of the form a*b + c*d, wherein a, b, c, and d are operands and * is a multiplication operation. 25. The graphics system of claim 22, wherein the ALU phase is adapted to perform a network function. 26. The graphics system of claim 22, wherein the ALU stage performs at least one of a fogging operation, a texture mapping, a transparency blending, a z-test, or a transparency test. The graphics system of claim 22, wherein the raster stage generates at least one pixel packet column for each of the pixels, each column having a plurality of pixels transmitted to the ALU phase in a particular clock cycle Packet. 28. The graphics system of claim 22, further comprising a gateway manager stage slave circuit having a scoreboard for tracking pixel packets received from the raster stage. The graphics system of claim 22, further comprising: a data extraction phase for extracting data of the pixel packets; a write phase for performing the process received from the ALU phase A memory write of one of the pixel data of the processed pixel packet; and a recirculation path from the data write phase to the gateway manager phase for recycling pixel packets for additional pass through the ALU phase. 30. The graphics system of claim 29, wherein information regarding pixel packets marked as cancelled in the ALU phase is provided to the scoreboard to resolve the cancelled packet pixels. An embedded processor, comprising: a scratchpad interface for a host to program a scratchpad of a graphics core; a host interface for a host to perform with the graphics core a δ mnemonic interface for the graphics core to read and write data; a programmable graphics processor disposed in the graphics core, the programmable graphics processor comprising: at least one stage , which is used to set and convert the 101822.doc 1297468 vertex of the primitive to be rasterized; a raster phase that receives information about the original image to be rasterized. The raster unit is each pixel to be processed using a graphics operation Generating at least one pixel packet sequence, the graphics operation can be expressed as a purely denier arithmetic operation sequence, each pixel packet includes payload information identifying at least - pixel attributes to be processed, and each column has a signature identifying each of the columns - associated information of at least one instruction executed on the pixel packet; ❿ - a closed channel manager comprising - a scoreboard for tracking the processing of the pixel packet - a data extraction phase for extracting data for each pixel packet column; '- ALU phase, which includes a plurality of programmable arithmetic logic units (then) for processing each of the pixel packet columns, each receiving An input pixel packet column and output an output pixel packet column, each ALU I: = = prime packet column reads at least - an operation element, using the to perform - scalar arithmetic and execution of the result is written to a 守 守 守 守 及 及 及 及 及 及 及 及 及 及 及 及 及 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守 守Sealing in the plurality of pixel seals: one memory write; performing the graphics processing function:, executing an arithmetic operation sequence on the block 32. The embedded process w of claim 31, wherein the plurality of ALU systems Configured in a 101822.doc 1297468 pipeline. 33. = embedded processor of claim 31, with an ingress-recirculation path. The % path is used for the data write phase to the gateway management. = stage recycling pixel packet The pixel packet can be processed by using the stage-times or more. 34. The embedded processor of claim 31, wherein each - is configurable to allow the selected from the column + β i and - often At least one of a value and a temporary value. 35. The embedded process 11 of item 31, wherein at least the stage is configured to cancel the condition and mark the pixel packet as cancelled (one is cancelled). In response to detecting that a pixel packet is marked as A, as in the case of request item 31, it enters the processor, and then executes the -3 centrally-sized application to operate the graphics core. /, the middle 4 graphics core and the central 37 - a type of graphics processing on a pixel:: two - on the wafer. For at least one of the pixels to be executed on a pixel: ^' contains: a 'scaling arithmetic operation performed on the pixel packet 2 = actual knowledge = in a graphics function; ^ only knowing that at least a plurality of pixel packets are generated for the pixel, Each image: one of the scalar arithmetic operation sequences to be treated as an operand:: genus =, the plurality of pixel packets have - the phase of the sinusin at least - the arithmetic logic single 7 sequence, the operand; W from 4 money The prime packet reads 10l822.doc 1297468. In the δ hai at least _ A Τ τ γ » 瞀, ALU, the scalar arithmetic operation sequence is executed according to the instruction sequence, and the scalar arithmetic operation sequence is used to implement the at least one graphics function. %, the method of item 37, further comprising: determining that a pixel does not require further processing; assigning a cancel state to at least one pixel packet; and using each subsequent one encountered by the at least one pixel packet Arithmetic calculations in the ALU to save power. ^A method of clause 37, wherein the graphics function includes a texture combination, ^ &amp;'then&quot;, a transparency blend, a transparency test, or an atomization At least one of the following. 40. If the request item 37*, the method 'where the scalar arithmetic operation is of the form a*b + c , where a, b, c, and d are operands and * is a multiplication operation. A method for performing a graphics processing operation on a pixel, comprising: a graphics, a power consumption, and a scalar arithmetic operation sequence performed on a pixel and a cymbal to perform on the pixel a graphics function; generating at least one pixel of the pixel to be processed in the continuous clock cycle 'per-pixel packet (four) at least one of the subset of pixel attributes to be processed as an operand in the scalar arithmetic operation sequence a shelf, the at least one column having an associated sequence of instructions; in each of the plurality of arithmetic logic units (ALUs), the read operand 'at least one of the operands corresponds to a self-styled packet The operand read by one of the pixel packets; ', in each ALU, according to the sequence of instructions assigned to the ship 101822.doc 1297468 42. 43. 44. 45. 46.44. 45. 46. 47. 48. 49. το上執行一純量算術計算,以執行該純量算術運算序列 以用於執行該圖形功能。 如請求項41之方法,其中每一該ALU執行以下步驟中之 至少一者:使用該純量算術運算之一結果來更新一像素 封包;及儲存該純量算術運算之一結果,以在一稍後之 時脈週期中所執行之一算術運算中用作一運算元。 如睛求項41之方法,其進一步包含: 識別一不需要進一步處理之像素,且作為回應,將該 像素之至少一像素封包標記為取消;及 j每一ALU中,停用已標記為被取消的像素封包之算術 s十身·。 t清求項43之方法’其中該識別係在-資料提取階段中 來執行。 項43之方法’其中該識別係在一資料寫入階段中= :41之方法,其進一步包含:指派待讀取之該等 |淋、5〜 #執彳了之-對應之純量 异術運异’以回應該指令序列内之—當前指令。 如睛求項41之方法,盆推一牛 料。 ’、^匕3 :提取像素封包之資 如請求項41之方法,复 八進步包含:對該圖形功台t宜λ 經處理之像素資料。 ㈣功月b寫入 如請求項41之方法,其進 一遞增指人爲一笛-老 匕3 ·在該4ALU中使用 7為第一處理通過而 丹循% —經處理之像素 101822.doc 1297468 封包。 )υ.如 &quot;月求項41之方法,其進一步包含 2-像素封包列上執行該第—圖形功能及該第二圖形 1此’其中一第-組ALu在-第-組像素封包上執行 第一類型之圖形功能,且-第二組ALU在—第二 封包上執仃—第二類型之®形功能。 、47. 48. 49. A scalar arithmetic calculation is performed on το to perform the scalar arithmetic operation sequence for performing the graphics function. The method of claim 41, wherein each of the ALUs performs at least one of: updating a pixel packet using a result of the scalar arithmetic operation; and storing a result of the scalar arithmetic operation to It is used as an operand in one of the arithmetic operations performed in the later clock cycle. The method of claim 41, further comprising: identifying a pixel that does not require further processing, and in response, marking at least one pixel packet of the pixel as canceled; and j in each ALU, deactivating has been marked as being Cancel the arithmetic of the pixel packet of the pixel s. The method of claim 43 wherein the identification is performed in the data extraction phase. The method of item 43, wherein the identification is in a data writing stage =: 41, which further comprises: assigning the one to be read, 5, #彳#彳彳之-corresponding scalar This is the same as the current instruction. As a result of the method of item 41, the pot pushes a cow. ‘, ^匕3: Extracting the Pixel Packets As in the method of claim 41, the improvement includes: pixel data that is processed by the graphics table t. (4) The power month b is written as in the method of claim 41, and the increment is referred to as a flute-old 匕3. In the 4ALU, 7 is used for the first process and the dan is followed by the %-processed pixel 101822.doc 1297468 Packet. The method of [00], wherein the method further comprises the step of performing the first graphics function on the 2-pixel packet column and the second graphics 1 wherein the one-group ALU is on the --group pixel packet The first type of graphics function is performed, and the second group of ALUs are executed on the second packet - the second type of the TM function. , 51. 一種執行—圖形處理運算之方法,其包含: ^式化複數個算術邏輯單元(則),以自—像素封包列 取所選運算兀,並執行-所選純量算術運算以回應— 與該像素封包列相關聯之所選當前指令; 對於待在-像素上執行之至少—圖形運算,識別待在 該像素之-子組屬十 生上執行之至少一對應純量算術運 /對該像素產生—像素封包列,每-像素封包包括用於 待作為至少一運算元處理之與該像素相關聯的至少一屬 性=攔位’該像素封包具有—指示待執行之—純量算術 運异序列的相關聯之當前指令; 在該荨ALU中,在該像素封包列中讀取該等所選運算 凡,並執行對應於該相關聯之當前指令的該所選純量算 術運算。 52·如明求項51之方法,其進一步包含··程式化一資料提取 階段以提取像素封包之資料。 101822.doc -11 - 1297468 3求項51之方法’其進一步包含··程式化該光柵階段 =將該圖形運算映射至—像素封包指派及-相關聯之指 令。 54.如請,項51之方法’其進一步包含:程式化至少一則 ,、屯里比車乂測试’且若-像素封包使該純量比較 Μ :試失敗,則將該像素封包標記為被取消。 種在—像素上執行圖形處理運算之方法,A包含: • 二於待在-像素上執行之至少1形功能、,識別可在 1素封包上執行的純量算術運算1實施該至少一圖形 功能; 二々理之該像素產生至少—像素封包列,每一像素 封包包括用於待作為運算元處理之—子轉素屬性的至 /一攔位’該至少-列具有-相關聯之指令序列; 在=數個算術邏輯單元(ALU)之每—者中,讀取所指派 2 ’該等運算元中之至少_者對應於—自一像素 封包列内之一像素封包所讀取之運算元; 彳:-該ALU中,根據該指令序列在該等所指派之運算 7C上執行一純量算術計算; 衔^對於一需要-在[°,1]範圍内之結果的所選純量算 :s二irS1.8格式來格式化像素封包之對應運算元, : 式對應於具有-8位元小數成份之[-2, +2]範圍 算元的-基數2表示,並將該所選純量算術運算之 …果鉗制至該[0, 1]範圍;及 對於至少-其他純量算術運算,以―不同之資料格式 101822.doc -12- 1297468 來格式化對應之像素封包。 56·如請求項55之方法, 的該至少—吨曰〜 在該[0,1]範圍内之結果 少一 μ成^鼻術運算係在㈣S1.8格式所表示之至 夕邑杉成份上的一計算。 57·如請求項56之方法,並中 岸於一 八中該至夕一其他純量算術運算對 ^以一不同於S1.8之格式所表示之紋理上的一運算。 58·如叫求項57之方法,甘士 所雨之固* 〃中母一像素封包具有為像素封包 所而之一固定位元大 Μ理_ u ^像素封包包括高精確度(s, t)、、文理貝枓,且色彩成 之資料。 成切之像素封包包括至少兩色彩成份 59. -種在-像素上執行一圖形處理運算之方法,. 對=在一像素之色彩成份上執行之一第:二功 -識別一弟-純量算術運算序列以實施該第—圖形功 此,该弟-圖形功能需要一純量算術運算 於該[〇, 1]範圍内的結果; ^ 對於待在與一像素相關聯之紋理上勃 , 亏㈠丁又一第二圖形 功能’識別-第二純量算術運算序列以實施該第 功能; / 對或像素產生至少一像素封包列,每一像素封~具有 長度為至少職元之一固定位元大小’且包括= 為運算元處理之一子組像素屬性的至少_ 爛位,該至少 一列具有一相關聯之指令序列; 對於與該第一圖形功能相關聯之每一僮11 S1.8格式來封裝至少兩色彩成份,該S1.8袼式斜靡於 1豕京封包,以一 具 101822.doc -13- 1297468 一 8位元小數成份之[-2, +2]範圍内之運算元的一基數2表 不; 對於與該第二圖形功能相關聯之每一像素封包,封裝 一需要8位元以上之單一高精確度紋理; 在複數個算術邏輯單元(ALU)之每一者中,讀取所指派 之運算元,並根據該指令序列在該等所指派之運算元上 執行一純量算術計算; 其中對於該第一圖形功能,色彩成份以該818格式被選 ,為運算元,且一結果被辦制至該[0,丨]範圍,而對於該 第二圖形功能,該紋理以一具有大於8位元之精確度的格 式被選擇為一運算元。 60· —種圖形處理器,其包含: 至少-階段,其用以設定並轉換待光拇化之圖原的頂 具接收關於待光柵化之圖原之資料 光拇階段對待為每―圖形運算而處理之每—像素產生至 少一像㈣包列,料-目料算可録 術運算序列; 一閘道管理器,1_ ,、包括一用以追蹤像素封包之該處理 之計分板,· _貝料提取階段,其用以提取每—像素封包列之資料; (ALU^^ 入像素封勺、处理母一該像素封包列,每一 ALU接收一輸 …匕列亚輪出-輸出像素封包列,每一ALU自- 101822.doc -14- 1297468 所接收之像音44 4 u + 京封包列讀取至少一運算元,使用該至少一 運算元來執彳千_ # e — 订一純置鼻術運算,產生一結果,並執行將 該、、。果寫入至一臨時值或使用該結果來更新該輸出列之 -像素封包中之至少一者;及 貝料寫入階段,其用以執行自該複數個ALU所接收之 、、二處理之像素封包的像素資料之—記憶體寫入; 其中在該複數個像素封包上執行一算術運算序列以執 行該圖形處理功能; 二亥光栅ρ白段以一 Sl 8格式對一第一類型的純量算術運 式化像素封包,該Sl 8格式對應於具有—陳元小數 f份之在卜2,+2]之一冑圍内之運算元的一基數2表示,且 母-ALU處理該第一類型之純量運算,其將一結果鉗制 在該[〇, 1]範圍内; 式對一 該光栅階段以一需要大於8位元之精確度的 第二類型之純量算術運算格式化像素封包。 6^•如請求項60之圖形處理器,其中該第_類型之純量運曾 對應於-色彩成份上的—運算,且至少兩色彩成份包: 於一為s色彩錢上之-純量運算而產生之像素封勺 中’且該第二類型之純量運算對應於—運算,該運曾: 具有一大於該像素封包大小之一半之大小的資;: 62.如請求項60之圖形處理器 對應於一紋理上之一運算 63·如請求項62之圖形處理器 其中該第二類型之純量運算 其中每-像素封包具有—為 101822.doc -15- I297468 至少20位元之大小,一 、,、 之屬性舄要至少16位元並且 有一相關聯之4位元精細度,且每_si抹 ” 元,藉以可於每一像辛封勺 ^ 十位 #料封包中含有兩個Sh8運算元 該=型之運算’且可於每一像素封包中含有用於該 ^财之運算的—組紋理屬性以用於該第二類型之運 64· —種圖形處理器,其包含·· 、-光栅階段’其接收關於待光柵化之圖原的資料,該 =栅階段為待處理之每—像素產生複數個像素封包,每 素封包包括識別待處理之至少—像素屬性的有效負 載^ ’亚具有識料在每—該像素封包上執行之至少 一指令的相關聯之旁頻帶資訊; 一可程式之算術邏輯單元(ALU)階段,其用於處理該等 像素封包,該ALU階段包括複數個ALU,每一 ALu具有一 、、且至少-可能之算術運算,独算術運算係纟_具有一 對應之當前指令式命令的傳入像素封包上來執行; 一資料提取階段,其用以提取該等像素封包之資料; 貝料寫入階段,其用以執行自該ALU階段所接收之經 處理之像素封包的像素資料之一記憶體寫入; 第刀配器’其麵合至該ALU階段、該資料提取階段 及該資料寫入階段之個別輸入;及 弟一分配器,其麵合至該ALU階段、該資料提取階段 及該資料寫入階段之個別輸出; /第分配器及該第二分配器經調適成重新組態像素 101822.doc . -16- 1297468 封ι、、二由該資料提取階段、該ALU階段及該ALU寫入階段 之一處理流程,以回應一來自一主機之命令。 65. T .月求項64之圖形處理器,其中該第一分配器及該第二 分配器經調適成允許繞過該圖形處理器之至少一部分以 回應—軟體命令。 66. ^請求項65之圖形處理器,其中該第一分配器及該第二 分配器經調適成允許繞過該資料提取階段以回應一軟體 命令。 67. 如請求項64之圖形處理器,其中該第一分配器及該第二 分配器輕合至該ALU階段中之每—則,以允許指派一 ALU執行次序以回應一軟體命令。 68. 如請求項64之圖形處理器’其中該圖形處理器經調適成 允許重新組態該處理流程,以在一純量算術運算之後具 有一資料提取。 、 69. 如請求項64之圖形處理器,其中該圖形處理器經調適成 允許該主機重新組態該處理流程,以在一純量算術運算 之前具有一資料提取。 70· —種操作一圖形管線之方法,該圖形管線具有··一光栅 器八用以產生像素封包;一資料提取階段,其用以提 取像素封包之資料;一 ALU階段,其具有至少一alu以用 於在像素封包上執行純量算術運算;―資料寫入階段, 其用於寫入像素資料;及分配器,其耦合至該資料提取 階段、該資料寫入階段及該ALU階段;該方法包含: 回應第-命令,程式化該等分配器以定義像素封包經 101822.doc -17- 1297468 由該資料提取階段、該ALU階段及該資料寫入階段之一 第一處理流程;及 回應一第二命令’程式化該等分配器以定義像素封包 經由該資料提取階段、該ALU階段及該資料寫入階段之 一第二流程; 其中一軟體主機可以複數個處理流程中之任一者來組 態該管線。 71·如請求項70之方法,其中該第一處理流程包括該資料提 取^ #又、該ALU階段及該資料寫入階段,且該第二處理 流程繞過該資料提取階段。 72·如請求項7〇之方法,其中該第一處理流程包括該等alu 之一第一執行次序,且該第二處理流程包括該等ALU之 一第二執行次序。 73·如請求項7〇之方法,其中該第一處理流程包括先於一純 量算術運算之一資料提取,且該第二處理流程包括在一 純量算術運算之後的一資料提取。 74. —種操作一圖形管線之方法,該圖形管線具有:一光柵 器,其用以產生像素封包;一資料提取階段,其用以提 取像素封包之資料;一 ALU階段,其具有至少一 ALU以用 於在像素封包上執行砘量算術運算;一資料寫入階段, 其用於寫入像素資料;及分配器,其耦合至該資料提取 階段、該資料寫入階段及該ALU階段;該方法包含: 接收一來自一軟體主機之命令,以將該管線自像素封 包經由該資料提取階段、該ALU階段及該資料寫入階段 101822.doc -18- 1297468 之一第一處理流程重新組態至像素封包經由該資料提取 階段、該ALU階段及該資料寫入階段之一第二處理流 程;及 調整該等分配器’以將該管線自該第一處理流程重新 組態至該第二處理流程。 75·如蜎求項74之方法’其中該第一處理流程包括該資料提 取階段、該ALU階段及該資料寫入階段,且該第二處理 流程繞過該資料提取階段。 76·如明求項74之方法,其中該第_處理流程包括該等mu 之一第一執行次序,且該第二處理流程包括該等ALu之 一第二執行次序。 77.::求項74之方法’其中該第一處理流程包括先於一純 量=運算之-資料提取,且該第:處理流程包括在一 純置异術運算之後的一資料提取。 78· 一種圖形處理器,其包含: 複數個元件,其用於處理像素封包; 一弟-分配器’其搞合至該複數個元件之個別輸入·,及 二分配器’其耦合至該複數個元件之個別輸出; „亥第一分配器及該第配器經調適成重新組態像素 封包經由該複數個元件之-處理流程,以回應—來自;; 主機之命令。 79·:請求項78之圖形處理器,其中該等元件為階段。 80. -種操作―圖形管線之方法’該圖形管線具有一具有用 於處理像素封包之複數個元件之圖形管線,該方法包含 101822.doc •19- 1297468 口應第-命令,程式化分 複數個元件之一第♦ 。 ^疋義像素封包經由該 1干惑弟一處理流程;及 回應一第二命令,p 經由該複數個元件之1二=分配器以定義像素封包 其中一軟體主機可以藉I 態該管線。 數個處理流程中之任一者來組 物之方法’其t該等元件為階段。 • .一上執行一圖形處理運算之方法,其包含: 響 對於待在複數個像素 Ά3 素封包上執行的—純量算術運==功此,識別可在像 功能; π序列,以實施該等圖形 將像素指派為偶像素或奇像素; 對母該像素產生至少兩像素封勺丨卜 包括用於在該純量算術運算岸w J ’母一像素封包 -早“丰s 列中待作為運算元處理之 像素屬性的至少-攔位,該等至少兩列具有一相 關聯之指令序列及_用4 /、 • …田 用Ά不該像素封包是用於-奇像 素還疋用於一偶像素的識別符; 在一群像素封包列中交錯一 封包列,其中該群中之每—列=素及―奇像素之像素 週期中處理; H 底以用於在連續時脈 PALU階段之複數個算術邏輯單元(則)之每 中,對一當前之時脈週期接收一像素封包列’並柯 指令序列在自該像素封包列 乂康口亥 行一純量算術計算;M⑽之至少一運算元上執 101822.doc -20- 1297468 其中像素封包之處理係在該等ALU中來交錯。 83.如請求項82之 前像素封包列 間。 方法,其中一像素封包列需要一來自_先 之結果,且選擇該交錯以解決ALU等待時 84·如請求項82之方法,其進一步包含:在每一則中儲存 偶像素與奇像素兩者的一共用組常數值。 85·^求項82之方法,其進一步包含:在每一⑽中儲存 。像素之一第一組臨時值及偶像素之一第二組臨時值。 6.如明求項85之方法,其進一步包含:利用該識別符來為 ^像素之像素封包選擇㈣—組臨時值且為奇像素之像 μ封包選擇該第二組臨時值。 87·如請求項85之方法,j:推一半勺人•抑 _ ,、進 乂匕έ •將該等臨時值儲存 足夠之日$間長度以模擬一常數暫存器。 88.:種可組態之圖形管線之元件中執行一識別符之一 :器寫入的方法’該可組態之圖形管線具有像素封包 =圖形管線之元件之-個以上可能的處理流程,該 方法包含: ,以發 之位置 接收-觸發該圖形管線之該等元件的資料封包 現用於母一凡件之指示該元件在該處理流程内 的識別符;及 入一指示該處理流程内 軟體命令來產生觸發該 每一元件在一組態暫存器中寫 之一相對位置的識別符。 'vd. 89·如請求項88之方法,其中回應一 等元件之該資料封包。 101822.doc -21- I297468 9〇.:請求項88之方法,其中在需要組態資訊之元件之 一位置處將該資料封包注入於該管線中, ⑴之 咨却4 A 八甲轉要組態 ::之母-連續元件在該資料封包中讀取一識別符之— 當前值,將該當前值寫入至其組態暫存 效、, 1于裔遞增該識別 付,亚將具有一遞增識別符的該資料 流程的下一元件。 L轉發至該處理 9l.如請求項88之方法,其中該等元件包含用於處理 包之异術邏輯單元。 、、 92· — 種圖形處理器,其包含 一光柵階段,其接收關於待光柵化之圖原的資料,該 光栅階段為待處理之每一像素產生複數個像素封包,每 一像素封包包括識別待處理之至少—像素屬性的有效負 載資訊,並具有識別待在每一該像素封包上執行之至少 一指令的相關聯之旁頻帶資訊; 一可程式之算術邏輯單元(ALU)階段,其用於處理該等 像素封包,該ALU階段包括複數個ALU,每一 ALU具有一 組至少一可能之算術運算,該組算術運算係在一具有一 對應之當前指令式命令的傳入像素封包上來執行; 一資料提取階段,其用以提取該等像素封包之資料; 一資料寫入階段,其用以執行自該ALU階段所接收之經 處理之像素封包的像素資料之一記憶體寫入; 一第一分配器,其耦合至該段、該資料提取階段 及該資料寫入階段之個別輸入;及 一第二分配器,其耦合至該ALU階段 '該資料提取階段 I0l822.doc -22- 1297468 及該資料寫入階段之個別輸出; 5亥第 刀配及该弟*一分配益經调適成重新組雜像素 封包經由該資料提取階段、該ALU階段及該ALU寫入階段 之一處理流程,以回應一來自一主機之命令; 其中該ALU階段之每一 ALU經調適成接收一由一軟體 識別碼所起始之識別封包,每一 ALU將該識別封包之一 識別符之一當前值寫入至一組態暫存器中,遞增該識別 符,並將該識別封包轉發至下一 ALU。 _ 93· —種圖形處理器,其包含: 一圖形管線,其具有一組與該圖形管線之元件相關聯 之分接頭點; 一可組態之測試點選擇器,其接收來自一軟體主機之 ρ々 該可組悲之測试點選擇器經調適成監視由一軟體 命令所選擇之一子組分接頭點,並對與該子組分接頭點 之每一該分接頭點相關聯之至少一條件來計數統計量·, φ 其中對該軟體主機收集一子組分接頭點的統計量。 94·如凊求項93之圖形處理器,其中該圖形管線包含一用於 處理像素封包之算術邏輯單元(ALU)鏈。 95.如請求項94之圖形處理器,其中該子組分接頭點係由與 °亥圖形管線中之連續元件相關聯的兩個分接頭點所組 成。 96· 2請求項95之圖形處理器,其中為一指示有效負載資料 此,自一第一兀件流動至一第二元件之有效訊號及一指 不該第二元件是否能夠接收該有效負載資料之就緒訊 10l822.doc -23 - 1297468 而l視該等兩個分接頭點。 如%求項96之圖形處理器,其中當偵測到一有效條件及 沈緒條件時對每一時脈週期計數一轉移狀態,且當偵 、丨有放條件及一未就緒條件時對每一時脈週期計數 一等待狀態。 ▲:求項94之圖形處理器,其中在與該等alu相關聯之 〆等兩個所選分接頭點中之每—者處監視—轉移狀態及 一等待狀態。51. A method of performing a graphics processing operation, comprising: formulating a plurality of arithmetic logic units (then), taking a selected operation 自 from a pixel-packet column, and performing a selected scalar arithmetic operation in response to - a selected current instruction associated with the pixel packet column; for at least a graphics operation to be performed on the pixel, identifying at least one corresponding scalar arithmetic operation/pair to be performed on the sub-group of the pixel The pixel generates a column of pixel packets, the per-pixel packet including at least one attribute associated with the pixel to be processed as at least one operand = the block 'the pixel packet has — indicating the sufficiency arithmetic to be executed The associated instruction of the different sequence; in the ALU, the selected operations are read in the pixel packet column and the selected scalar arithmetic operation corresponding to the associated current instruction is executed. 52. The method of claim 51, further comprising: stylizing a data extraction phase to extract data from the pixel packet. 101822.doc -11 - 1297468 3 The method of claim 51, which further comprises: stylizing the raster phase = mapping the graphics operation to - pixel packet assignment and - associated instructions. 54. The method of item 51, wherein the method further comprises: stylizing at least one, 屯 比 乂 乂 test and if the pixel packet compares the scalar quantity: if the test fails, marking the pixel packet as got canceled. A method for performing a graphics processing operation on a pixel, A comprising: • performing at least one graphics function to be performed on a pixel, and identifying a scalar arithmetic operation 1 executable on a 1-bit packet to implement the at least one graphics The pixel of the second processing generates at least a pixel packet column, and each pixel packet includes a per-blocker for the sub-transportant attribute to be processed as an operand 'the at least-column has an associated instruction Sequence; in each of = a number of arithmetic logic units (ALUs), reading the assigned 2' of at least one of the operands corresponding to - reading from one of the pixel packets in a column of pixel packets Arithmetic: - In the ALU, a scalar arithmetic calculation is performed on the assigned operations 7C according to the sequence of instructions; the selected pure for the result of a need-in the range [[,1] Measure: s two irS1.8 format to format the corresponding operand of the pixel packet, : the formula corresponds to the base-base 2 representation of the [-2, +2] range of operators with -8-bit fractional components, and The selected scalar arithmetic operation is clamped to the [0, 1] range; At least - Other scalar arithmetic operation to - different data formats 101822.doc -12- 1297468 corresponds to the pixel format of the packet. 56. The method of claim 55, wherein the result of the at least one-ton 曰~ is less than one μ in the range of [0, 1], and the nasal operation is in the (4) S1.8 format. a calculation. 57. The method of claim 56, and in the middle of the eighth, the other scalar arithmetic operation is an operation on the texture represented by a format different from S1.8. 58. For example, the method of claim 57, the rainy solid of the Gansian* 〃 mother-in-one pixel packet has a fixed bit size for the pixel packet _ u ^ pixel packet including high precision (s, t ),, Wen Li Bei Yu, and the color of the material. The cut pixel packet includes at least two color components 59. - a method of performing a graphics processing operation on the pixel, . = one of the color components of a pixel is performed: the second power - the identification of a younger - scalar The sequence of arithmetic operations is implemented to perform the first-graph function, which requires a scalar arithmetic operation to result in the range of [〇, 1]; ^ for the texture to be associated with a pixel, (1) another second graphic function 'recognition-second scalar arithmetic operation sequence to implement the first function; / or pixel to generate at least one pixel packet column, each pixel block having a fixed length of at least one of the service elements The meta size 'and includes = at least _ rotten bits of one of the sub-pixel attributes of the operand, the at least one column having an associated sequence of instructions; for each child associated with the first graphical function 11 S1.8 The format is to encapsulate at least two color components, and the S1.8 type is slanted in a 1 豕 Beijing packet, and is operated in a range of [-2, +2] of 101822.doc -13 - 1297468 - 8 bit fractional components. One base of the yuan 2 is not; Each pixel packet associated with the second graphics function encapsulates a single high-precision texture requiring more than 8 bits; in each of a plurality of arithmetic logic units (ALUs), the assigned operand is read, and Performing a scalar arithmetic calculation on the assigned operands according to the sequence of instructions; wherein for the first graphics function, the color component is selected in the 818 format as an operand, and a result is processed to the [ 0, 丨] range, and for the second graphics function, the texture is selected as an operand in a format having an accuracy greater than 8 bits. 60. A graphics processor, comprising: at least a stage, which is used to set and convert a pedestal of the image to be rasterized to receive information about the original image of the image to be rasterized. Each pixel of the processing generates at least one image (four) packet sequence, and the material-object material counts the recordable operation sequence; a gateway manager, 1_, includes a scoreboard for tracking the processing of the pixel packet, _ batting extraction stage, which is used to extract the data of each pixel packet column; (ALU^^ into the pixel sealing spoon, processing the parent one of the pixel packet column, each ALU receives an input... 匕列亚轮出-output pixel The packet column, each ALU receives the image 44 44 u + from the - 101822.doc -14 - 1297468, reads the at least one operand, and uses the at least one operand to execute the thousand _ # e - Purely setting the nose operation, generating a result, and executing the write to a temporary value or using the result to update at least one of the output column - the pixel packet; and the bedding write phase, Used to perform the reception from the plurality of ALUs, The pixel data of the processed pixel packet is a memory write; wherein an arithmetic operation sequence is performed on the plurality of pixel packets to perform the graphic processing function; and the second black raster ρ white segment is in a S8 format a type of scalar arithmetic operation pixel packet, the S 8 format corresponding to a base 2 of the operand having one of the bins, and a parent-ALU Processing the first type of scalar operation, which clamps a result within the range of [〇, 1]; the second pair of scalar arithmetic operations requiring a precision greater than 8 bits for the raster stage Formatting the pixel packet. 6^• The graphics processor of claim 60, wherein the singular type of the _ type has corresponded to the - operation on the -color component, and at least two color components are included: In the pixel sealing spoon generated by the scalar operation, and the scalar operation of the second type corresponds to the operation, the operation has: a size larger than one half of the size of the pixel packet; 62. The graphics processor of claim 60 corresponds to a pattern One of the operations 63. The graphics processor of claim 62, wherein the second type of scalar operation wherein each-pixel packet has - 101822.doc -15 - I297468 at least 20 bits in size, one, ,, The attribute is at least 16 bits and has an associated 4-bit fineness, and each _si wipes the element, so that each of the images can contain two Sh8 operands in the packet. = type operation 'and may include a set of texture attributes for the operation of the pixel in each pixel packet for use in the second type of graphics processor, which includes ··, -raster Stage 'which receives data about the original to be rasterized, the =gate phase generates a plurality of pixel packets for each pixel to be processed, and each prime packet includes a payload identifying at least the pixel attribute to be processed ^ ' Sensing associated sideband information for at least one instruction executed on each of the pixel packets; a programmable arithmetic logic unit (ALU) stage for processing the pixel packets, the ALU stage including a plurality of ALUs , each A Lu has one, and at least, a possible arithmetic operation, and the unique arithmetic operation system _ has an incoming pixel packet corresponding to the current imperative command to execute; a data extraction phase for extracting data of the pixel packet a stuffing stage for performing memory writing of one of pixel data of the processed pixel packets received from the ALU stage; the first tooling's face to the ALU stage, the data extraction stage and The data is written into the individual inputs of the phase; and the distributor is integrated into the ALU phase, the data extraction phase and the individual output of the data writing phase; the first dispenser and the second dispenser are adapted Reconfigure the pixel 101822.doc. -16 - 1297468 The process of processing the data extraction phase, the ALU phase, and the ALU write phase in response to a command from a host. 65. The graphics processor of claim 64, wherein the first allocator and the second allocator are adapted to allow at least a portion of the graphics processor to be bypassed in response to a software command. 66. The graphics processor of claim 65, wherein the first allocator and the second allocator are adapted to allow the data extraction phase to be bypassed in response to a software command. 67. The graphics processor of claim 64, wherein the first allocator and the second allocator are spliced to each of the ALU stages - to allow an ALU execution order to be assigned in response to a software command. 68. The graphics processor of claim 64, wherein the graphics processor is adapted to allow reconfiguration of the processing flow to have a data extraction after a scalar arithmetic operation. 69. The graphics processor of claim 64, wherein the graphics processor is adapted to allow the host to reconfigure the processing flow to have a data fetch prior to a scalar arithmetic operation. 70. A method for operating a graphics pipeline, the graphics pipeline having a rasterizer eight for generating a pixel packet; a data extraction phase for extracting data of the pixel packet; and an ALU phase having at least one alu For performing a scalar arithmetic operation on a pixel packet; a data writing phase for writing pixel data; and a distributor coupled to the data extraction phase, the data writing phase, and the ALU phase; The method includes: responding to the first command, stylizing the allocators to define a pixel packet via 101822.doc -17- 1297468, the first processing flow by the data extraction phase, the ALU phase, and the data writing phase; and the response a second command 'programming the allocators to define a pixel packet via the data extraction phase, the ALU phase, and the second phase of the data writing phase; wherein the software host can have any of a plurality of processing flows To configure the pipeline. The method of claim 70, wherein the first processing flow includes the data extraction, the ALU phase, and the data writing phase, and the second processing flow bypasses the data extraction phase. The method of claim 7, wherein the first processing flow includes a first execution order of the one of the alus, and the second processing flow includes a second execution order of the ALUs. 73. The method of claim 7, wherein the first processing flow comprises data extraction prior to a scalar arithmetic operation, and the second processing flow includes a data extraction subsequent to a scalar arithmetic operation. 74. A method of operating a graphics pipeline, the graphics pipeline having: a rasterizer for generating a pixel packet; a data extraction phase for extracting data of the pixel packet; and an ALU phase having at least one ALU For performing a quantization arithmetic operation on a pixel packet; a data writing phase for writing pixel data; and a distributor coupled to the data extraction phase, the data writing phase, and the ALU phase; The method includes: receiving a command from a software host to reconfigure the pipeline from the pixel packet via the data extraction phase, the ALU phase, and the data writing phase 101822.doc -18- 1297468 Up to the second processing flow of the pixel packet via the data extraction phase, the ALU phase, and the data writing phase; and adjusting the distributors to reconfigure the pipeline from the first processing flow to the second processing Process. 75. The method of claim 74, wherein the first processing flow includes the data extraction phase, the ALU phase, and the data writing phase, and the second processing flow bypasses the data extraction phase. The method of claim 74, wherein the first processing flow includes a first execution order of the ones of the mus, and the second processing flow includes a second execution order of the ALUs. 77.:: method of claim 74 wherein the first processing flow includes data extraction prior to a scalar = operation, and the processing flow includes a data extraction after a purely different operation. 78. A graphics processor, comprising: a plurality of components for processing a pixel packet; a sub-distributor that engages to an individual input of the plurality of components, and a second distributor that is coupled to the plurality of components Individual outputs of the components; the first distributor and the adapter are adapted to reconfigure the pixel packets via the plurality of components - in response to the process - from;; host commands. 79: request item 78 a graphics processor, wherein the components are in stages. 80. - Operation - a method of graphics pipelines. The graphics pipeline has a graphics pipeline having a plurality of components for processing pixel packets, the method comprising 101822.doc • 19 - 1297468 The mouth should be command-command, stylized into one of the plurality of components, ♦. 疋 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素 像素= allocator to define a pixel packet, one of the software hosts can borrow the I state of the pipeline. The method of grouping any of several processing flows 'these components are staged. A method of performing a graphics processing operation, comprising: sounding a scalar arithmetic operation to be performed on a plurality of pixel 封3 prime packets == this function, identifying an image function; π sequence, to implement the graphics to pixel Assigned as an even pixel or an odd pixel; generating at least two pixels for the pixel, including for processing in the scalar arithmetic operation bank w J 'mother one pixel packet-early feng s column to be processed as an arithmetic element At least-blocking of pixel attributes, such at least two columns having an associated sequence of instructions and _ using 4 /, • ... for the pixel packet is used for - odd pixels and for identification of an even pixel a packet sequence that is interleaved in a group of pixel packets, where each column of the group is processed in the pixel period of the prime and odd pixels; the bottom is used for a plurality of arithmetic logic units in the continuous clock PALU phase (in the case of each), for a current clock cycle to receive a pixel packet column 'and the instruction sequence is from the pixel packet column 乂 Kangkouhai line a scalar arithmetic calculation; M (10) at least one operation element on the 101822 .doc -20- 1297468 wherein the processing of pixel packets is interleaved in the ALUs. 83. As before request item 82, the pixel is packed between columns. The method, wherein a pixel packet column requires a result from the first, and the interlace is selected to resolve the ALU wait time 84. The method of claim 82, further comprising: storing both the even pixel and the odd pixel in each of the methods A shared group constant value. 85. The method of claim 82, further comprising: storing in each (10). One of the first set of temporary values and one of the even pixels of the second set of temporary values. 6. The method of claim 85, further comprising: using the identifier to select a (four)-set temporary value for the pixel block of the pixel and select the second set of temporary values for the image of the odd pixel. 87. As in the method of claim 85, j: push half a spoonful of people • suppress _, enter 乂匕έ • store the temporary values for a sufficient period of time to simulate a constant register. 88. One of the identifiers executed in the component of the configurable graphics pipeline: the method of writing the device 'The configurable graphics pipeline has pixel packets = components of the graphics pipeline - more than one possible processing flow, The method comprises: receiving, by a location, a data packet of the component that triggers the graphics pipeline is used for an identifier of the component in the processing flow; and entering a software indicating the processing flow A command is generated to generate an identifier that triggers the relative position of each of the components to be written in a configuration register. The method of claim 88, wherein the data packet of the first component is responded to. 101822.doc -21- I297468 9〇.: The method of claim 88, wherein the data packet is injected into the pipeline at a position of one of the components requiring configuration information, (1) consulting 4 A 八甲转组State:: The mother-continuous component reads an identifier in the data packet—the current value, writes the current value to its configured temporary storage effect, and 1 increments the identification payment. The next element of the data flow that increments the identifier. L. The method of claim 88, wherein the elements comprise a different logic unit for processing the packet. And a graphics processor comprising a raster stage that receives data about a picture to be rasterized, the raster stage generating a plurality of pixel packets for each pixel to be processed, each pixel packet including identification At least - payload information of the pixel attribute to be processed, and having associated sideband information identifying at least one instruction to be executed on each of the pixel packets; a programmable arithmetic logic unit (ALU) stage, In processing the pixel packets, the ALU stage includes a plurality of ALUs, each ALU having a set of at least one possible arithmetic operation performed on an incoming pixel packet having a corresponding current imperative command a data extraction phase for extracting data of the pixel packets; a data writing phase for performing memory writing of one of pixel data of the processed pixel packets received from the ALU phase; a first distributor coupled to the segment, the data extraction phase and the individual input of the data writing phase; and a second dispenser, Coupling to the ALU stage 'the data extraction stage I0l822.doc -22- 1297468 and the individual output of the data writing stage; 5 Hai's knife and the brother * a distribution benefit is adapted into a re-grouping pixel packet via the Processing flow from the data extraction phase, the ALU phase, and the ALU write phase in response to a command from a host; wherein each ALU of the ALU phase is adapted to receive an identification initiated by a software identification code The packet, each ALU writes the current value of one of the identifiers of one of the identification packets into a configuration register, increments the identifier, and forwards the identification packet to the next ALU. _93. A graphics processor comprising: a graphics pipeline having a set of tap points associated with elements of the graphics pipeline; a configurable test point selector receiving the same from a software host The testable point selector is adapted to monitor a sub-assembly joint point selected by a software command and to associate at least each of the sub-joint points associated with the sub-assembly joint point A condition to count the statistic·, φ where the statistics of a sub-component joint point are collected for the software host. 94. The graphics processor of claim 93, wherein the graphics pipeline includes an arithmetic logic unit (ALU) chain for processing pixel packets. 95. The graphics processor of claim 94, wherein the subcomponent joint points are comprised of two tap points associated with successive elements in the graphics pipeline. 96. The graphics processor of claim 95, wherein an indication of the payload data, a valid signal flowing from a first component to a second component, and a determination of whether the second component is capable of receiving the payload data The ready signal 10l822.doc -23 - 1297468 and l regard these two tap points. For example, the graphics processor of % seeking 96, wherein a transition state is counted for each clock cycle when a valid condition and a precondition are detected, and each of the detection, 丨, and a ready condition is used. One clock cycle counts a wait state. ▲: The graphics processor of claim 94, wherein the transition state and a wait state are monitored at each of two selected tap points associated with the alu. 如明求項98之圖形處理器,其中該轉移狀態對應於一處 於-就緒狀態之下游ALU及一處於一有效狀態之上游 ALU。 丹進一步包含 1〇〇·如請求項93之圖形處理器 主機以啟用統計量收集之追蹤記錄暫存器ν 101 ·如請求項93之圖形虛理哭 甘士斗 ^ 口心處理為,其中該可組態之測試點選揭 器包括至少一計數器。 102·—種監視一圖形處理器之方法,其包含: 2收:選擇與一能夠將一有效負載發送至—第二元件 之第元件相闕聯之兩個測試點的命令; 备視該等兩個測試點;及 收集關於與該第一元# - 作丄 兀件及忒弟一凡件相關聯之至少兩 條件的統計量。 該第-條件為—用於該第—元 第,元件二:載的有效訊號’且該第二條件為-用於該 件之心示到該第二元件就緒接收一有效負載的就 101822.doc -24- 1297468 緒訊號。 104·如請求項103之方法,其中對存在一有效訊號及一就緒訊 號之每一時脈週期計數一轉移狀態,且對存在一有效訊 號但不存在就緒訊號之每一時脈週期計數一等待狀熊。 105·如請求項102之方法,其中對該圖形處理器之一預選運算 模式來執行該收集統計量。 如請求項1()2之方法,其進—步包含:接收—用以對該等 兩個測試點啟用統計量收集的命令。The graphics processor of claim 98, wherein the transition state corresponds to a downstream ALU in a-ready state and an upstream ALU in an active state. Dan further includes 1〇〇. The graphics processor host of claim 93 to enable the statistics collection of the trace record register ν 101 · If the request item 93 is a graphic imaginary crying sorrow ^ the heart is processed, wherein the group can be The test point selector includes at least one counter. 102. A method of monitoring a graphics processor, comprising: 2: receiving a command with two test points capable of transmitting a payload to a second component of the second component; a test point; and a statistic for collecting at least two conditions associated with the first element and the parent. The first condition is - for the first element, the component two: the loaded valid signal 'and the second condition is - for the heart of the piece to indicate that the second element is ready to receive a payload 101822. Doc -24- 1297468 The signal number. 104. The method of claim 103, wherein a clock state is counted for each clock cycle in which a valid signal and a ready signal are present, and a clock cycle is counted for each valid signal but no ready signal is present. Bear. 105. The method of claim 102, wherein the collection of statistics is performed by preselecting an operational mode for one of the graphics processors. As in the method of claim 1 () 2, the further step includes: receiving - a command to enable statistics collection for the two test points. 其中對该圖形處理器之一所選圖形 ’藉以收集與該圖形運算相關聯之 1 〇 7 ·如睛求項1 〇 6之方法, 運算啟用統計量收集 統計量。 108·如請求項1〇2之方 — 乃兩兵〒啟用该收集統計量以回應 定一追蹤記錄暫存器之一值的軟體主機。The method in which one of the graphics processors is selected to collect the statistic associated with the graphics operation, and the statistic collection statistic is enabled. 108. If the party to claim 1〇2 is the software host that enables the collection of statistics in response to a value of one of the trace record registers. 101822.doc -25-101822.doc -25-
TW094115854A 2004-05-14 2005-05-16 Graphics processor, graphics system, embedded processor, method of performing a graphics processing operation, method of operating a graphics pipeline, method of performing a register write, and method of monitoring a graphics processor TWI297468B (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US10/846,226 US7268786B2 (en) 2004-05-14 2004-05-14 Reconfigurable pipeline for low power programmable processor
US10/845,714 US7250953B2 (en) 2004-05-14 2004-05-14 Statistics instrumentation for low power programmable processor
US10/846,097 US7091982B2 (en) 2004-05-14 2004-05-14 Low power programmable processor
US10/846,106 US7389006B2 (en) 2004-05-14 2004-05-14 Auto software configurable register address space for low power programmable processor
US10/846,110 US7142214B2 (en) 2004-05-14 2004-05-14 Data format for low power programmable processor
US10/846,334 US7199799B2 (en) 2004-05-14 2004-05-14 Interleaving of pixels for low power programmable processor

Publications (2)

Publication Number Publication Date
TW200609842A TW200609842A (en) 2006-03-16
TWI297468B true TWI297468B (en) 2008-06-01

Family

ID=35429081

Family Applications (1)

Application Number Title Priority Date Filing Date
TW094115854A TWI297468B (en) 2004-05-14 2005-05-16 Graphics processor, graphics system, embedded processor, method of performing a graphics processing operation, method of operating a graphics pipeline, method of performing a register write, and method of monitoring a graphics processor

Country Status (6)

Country Link
EP (1) EP1759380B1 (en)
JP (1) JP4914829B2 (en)
KR (1) KR100865811B1 (en)
AT (1) ATE534114T1 (en)
TW (1) TWI297468B (en)
WO (1) WO2005114646A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8570332B2 (en) 2009-05-25 2013-10-29 Institute For Information Industry Graphics processing system with power-gating control function, power-gating control method, and computer program products thereof
TWI457843B (en) * 2010-04-30 2014-10-21 Applied Materials Inc Methods for monitoring processing equipment, and computer readable medium for recording related instructions thereon
US10430919B2 (en) 2017-05-12 2019-10-01 Google Llc Determination of per line buffer unit memory allocation

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8687010B1 (en) 2004-05-14 2014-04-01 Nvidia Corporation Arbitrary size texture palettes for use in graphics systems
US8743142B1 (en) 2004-05-14 2014-06-03 Nvidia Corporation Unified data fetch graphics processing system and method
US8736620B2 (en) 2004-05-14 2014-05-27 Nvidia Corporation Kill bit graphics processing system and method
US7091982B2 (en) 2004-05-14 2006-08-15 Nvidia Corporation Low power programmable processor
US8736628B1 (en) 2004-05-14 2014-05-27 Nvidia Corporation Single thread graphics processing system and method
US7805589B2 (en) * 2006-08-31 2010-09-28 Qualcomm Incorporated Relative address generation
US8537168B1 (en) 2006-11-02 2013-09-17 Nvidia Corporation Method and system for deferred coverage mask generation in a raster stage
US8314803B2 (en) 2007-08-15 2012-11-20 Nvidia Corporation Buffering deserialized pixel data in a graphics processor unit pipeline
US8736624B1 (en) 2007-08-15 2014-05-27 Nvidia Corporation Conditional execution flag in graphics applications
US8775777B2 (en) 2007-08-15 2014-07-08 Nvidia Corporation Techniques for sourcing immediate values from a VLIW
US9183607B1 (en) 2007-08-15 2015-11-10 Nvidia Corporation Scoreboard cache coherence in a graphics pipeline
US20090046105A1 (en) * 2007-08-15 2009-02-19 Bergland Tyson J Conditional execute bit in a graphics processor unit pipeline
US8521800B1 (en) 2007-08-15 2013-08-27 Nvidia Corporation Interconnected arithmetic logic units
US8599208B2 (en) 2007-08-15 2013-12-03 Nvidia Corporation Shared readable and writeable global values in a graphics processor unit pipeline
US8698823B2 (en) * 2009-04-08 2014-04-15 Nvidia Corporation System and method for deadlock-free pipelining
US8471858B2 (en) * 2009-06-02 2013-06-25 Qualcomm Incorporated Displaying a visual representation of performance metrics for rendered graphics elements
US9411595B2 (en) 2012-05-31 2016-08-09 Nvidia Corporation Multi-threaded transactional memory coherence
US9824009B2 (en) 2012-12-21 2017-11-21 Nvidia Corporation Information coherency maintenance systems and methods
US10102142B2 (en) 2012-12-26 2018-10-16 Nvidia Corporation Virtual address based memory reordering
US9317251B2 (en) 2012-12-31 2016-04-19 Nvidia Corporation Efficient correction of normalizer shift amount errors in fused multiply add operations
US9805478B2 (en) 2013-08-14 2017-10-31 Arm Limited Compositing plural layer of image data for display
GB2517185B (en) * 2013-08-14 2020-03-04 Advanced Risc Mach Ltd Graphics tile compositing control
US9569385B2 (en) 2013-09-09 2017-02-14 Nvidia Corporation Memory transaction ordering
US10483981B2 (en) * 2016-12-30 2019-11-19 Microsoft Technology Licensing, Llc Highspeed/low power symbol compare

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5230039A (en) * 1991-02-19 1993-07-20 Silicon Graphics, Inc. Texture range controls for improved texture mapping
US5611038A (en) * 1991-04-17 1997-03-11 Shaw; Venson M. Audio/video transceiver provided with a device for reconfiguration of incompatibly received or transmitted video and audio information
US7068272B1 (en) * 2000-05-31 2006-06-27 Nvidia Corporation System, method and article of manufacture for Z-value and stencil culling prior to rendering in a computer graphics processing pipeline
US6771264B1 (en) * 1998-08-20 2004-08-03 Apple Computer, Inc. Method and apparatus for performing tangent space lighting and bump mapping in a deferred shading graphics processor
WO2001042903A1 (en) * 1999-12-07 2001-06-14 Hitachi, Ltd. Data processing apparatus and data processing system
JP2003030641A (en) * 2001-07-19 2003-01-31 Nec System Technologies Ltd Plotting device, parallel plotting method therefor and parallel plotting program
US6909432B2 (en) * 2002-02-27 2005-06-21 Hewlett-Packard Development Company, L.P. Centralized scalable resource architecture and system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8570332B2 (en) 2009-05-25 2013-10-29 Institute For Information Industry Graphics processing system with power-gating control function, power-gating control method, and computer program products thereof
TWI457843B (en) * 2010-04-30 2014-10-21 Applied Materials Inc Methods for monitoring processing equipment, and computer readable medium for recording related instructions thereon
US10430919B2 (en) 2017-05-12 2019-10-01 Google Llc Determination of per line buffer unit memory allocation
TWI684132B (en) * 2017-05-12 2020-02-01 美商谷歌有限責任公司 Determination of per line buffer unit memory allocation
US10685423B2 (en) 2017-05-12 2020-06-16 Google Llc Determination of per line buffer unit memory allocation
TWI750557B (en) * 2017-05-12 2021-12-21 美商谷歌有限責任公司 Determination of per line buffer unit memory allocation

Also Published As

Publication number Publication date
EP1759380B1 (en) 2011-11-16
WO2005114646A2 (en) 2005-12-01
EP1759380A2 (en) 2007-03-07
KR20070028368A (en) 2007-03-12
KR100865811B1 (en) 2008-10-28
WO2005114646A3 (en) 2007-05-24
EP1759380A4 (en) 2009-01-21
ATE534114T1 (en) 2011-12-15
JP4914829B2 (en) 2012-04-11
TW200609842A (en) 2006-03-16
JP2007538319A (en) 2007-12-27

Similar Documents

Publication Publication Date Title
TWI297468B (en) Graphics processor, graphics system, embedded processor, method of performing a graphics processing operation, method of operating a graphics pipeline, method of performing a register write, and method of monitoring a graphics processor
US7969446B2 (en) Method for operating low power programmable processor
JP4639232B2 (en) Improved scalability in fragment shading pipeline
EP3274966B1 (en) Facilitating true three-dimensional virtual representation of real objects using dynamic three-dimensional shapes
US20080204461A1 (en) Auto Software Configurable Register Address Space For Low Power Programmable Processor
CN101620725B (en) Hybrid multisample/supersample antialiasing
CN108206937B (en) Method and device for improving intelligent analysis performance
US9916634B2 (en) Facilitating efficient graphics command generation and execution for improved graphics performance at computing devices
CN109658492A (en) For the geometry based on the rendering system pieced together to the moderator that tiles
US9990691B2 (en) Ray compression for efficient processing of graphics data at computing devices
US20190286563A1 (en) Apparatus and method for improved cache utilization and efficiency on a many core processor
TWI596569B (en) Facilitating dynamic and efficient pre-launch clipping for partially-obscured graphics images on computing devices
US10796483B2 (en) Identifying primitives in input index stream
CN106575221A (en) Method and apparatus for unstructured control flow for SIMD execution engine
EP3374961A2 (en) Facilitating efficeint centralized rendering of viewpoint-agnostic graphics workloads at computing devices
TW200912798A (en) Systems and methods for managing texture data in computer
US20050253873A1 (en) Interleaving of pixels for low power programmable processor
CN109478137B (en) Apparatus and method for shared resource partitioning by credit management
US7250953B2 (en) Statistics instrumentation for low power programmable processor
WO2017112030A1 (en) Clustered color compression for efficient processing of graphics data at computing devices
TW201137786A (en) System and method for improving throughput of a graphics processing unit
US11748911B2 (en) Shader function based pixel count determination
Fuetterling et al. Accelerated single ray tracing for wide vector units
TWI616844B (en) Facilitating culling of composite objects in graphics processing units when such objects produce no visible change in graphics images
Claus et al. High performance FPGA based optical flow calculation using the census transformation