US20080204461A1 - Auto Software Configurable Register Address Space For Low Power Programmable Processor - Google Patents
Auto Software Configurable Register Address Space For Low Power Programmable Processor Download PDFInfo
- Publication number
- US20080204461A1 US20080204461A1 US12/115,789 US11578908A US2008204461A1 US 20080204461 A1 US20080204461 A1 US 20080204461A1 US 11578908 A US11578908 A US 11578908A US 2008204461 A1 US2008204461 A1 US 2008204461A1
- Authority
- US
- United States
- Prior art keywords
- alu
- stage
- pixel
- alus
- identification packet
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 claims abstract description 43
- 230000008569 process Effects 0.000 claims abstract description 26
- 238000012545 processing Methods 0.000 claims description 37
- 230000004044 response Effects 0.000 claims description 10
- 230000000977 initiatory effect Effects 0.000 claims 3
- 238000002347 injection Methods 0.000 claims 1
- 239000007924 injection Substances 0.000 claims 1
- 238000012360 testing method Methods 0.000 description 31
- 238000010586 diagram Methods 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 230000008901 benefit Effects 0.000 description 7
- 238000007726 management method Methods 0.000 description 6
- 239000000203 mixture Substances 0.000 description 6
- 238000002156 mixing Methods 0.000 description 5
- 238000013507 mapping Methods 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 3
- 238000009877 rendering Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000013154 diagnostic monitoring Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012856 packing Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000013468 resource allocation Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
- -1 z-depth test Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/005—General purpose rendering architectures
Definitions
- the present invention is generally related to programmable processors. More particularly, the present invention is directed towards low power programmable processors for graphics applications.
- Geometrical primitives e.g., triangles
- Rendering (drawing) primitives includes interpolating parameters, such as depth and color, over each two-dimensional projection of a primitive.
- FIG. 1 is a prior art drawing of a traditional pipeline architecture which is a “deep” pipeline having stages dedicated to performing specific functions.
- a transform stage 105 performs geometrical calculations of primitives and may also perform a clipping operation.
- a setup/raster stage 110 rasterizes the primitives.
- a texture address 115 and texture fetch 120 stage are utilized for texture mapping.
- a fog stage 130 implements a fog algorithm.
- An alpha test stage 135 performs an alpha test.
- a depth test 140 performs a depth test for culling occluded pixels.
- An alpha blend stage 145 performs an alpha blend color combination algorithm.
- a memory write stage 150 writes the output of the pipeline.
- the traditional GPU pipeline architecture illustrated in FIG. 1 is typically optimized for fast texturing using the OpenGLTM graphics language.
- a benefit of a deep pipeline architecture is that it permits fast, high quality rendering of even complex scenes.
- the conventional deep pipeline architecture illustrated in FIG. 1 is unsuitable for many graphics applications, such as implementing three-dimensional games on wireless phones and PDAs.
- a configurable graphics pipeline has more than one possible process flow of pixel packets through elements of the graphics pipeline.
- a data packet triggers an element of the graphics pipeline to discover an identifier.
- a received data packet triggers elements of the graphics pipeline to discover an identifier for each element indicative of the location of the element within the process flow.
- Each element writes an identifier in a configuration register indicative of its relative location within the process flow.
- an element reads a current value of an identifier in a data packet, writes the current value to a configuration register, increments the identifier, and forwards the data packet with an incremented identifier to the next element of the process flow.
- FIG. 1 is a diagram of a prior art pipeline for three-dimensional graphics
- FIG. 2 is a block diagram of an integrated circuit including a programmable graphics processor in accordance with one embodiment of the present invention
- FIG. 3 is a block diagram of a programmable graphics processor in accordance with one embodiment of the present invention.
- FIG. 4 illustrates exemplary pixel packets in accordance with one embodiment of the present invention
- FIG. 5 illustrates an exemplary arrangement of pixel packets into rows of a group of pixel packets in accordance with one embodiment of the present invention
- FIG. 6 is a block diagram of a single Arithmetic Logic Unit in accordance with one embodiment of the present invention.
- FIG. 7 is a block diagram of a sequence of two Arithmetic Logic Units in accordance with one embodiment of the present invention.
- FIG. 8 is a block diagram of a configurable programmable graphics processor in accordance with one embodiment of the present invention.
- FIG. 9 illustrates interleaving of rows of pixel packets in accordance with one embodiment of the present invention.
- FIG. 10 is a block diagram illustrating Arithmetic Logic Units having configuration registers in accordance with one embodiment of the present invention.
- FIG. 11 is a block diagram illustrating a configurable test point selector in accordance with one embodiment of the present invention.
- FIG. 2 is a block diagram of one embodiment of the present invention.
- a programmable graphics processor 205 is coupled to a register interface 210 , a host interface 220 , and a memory interface, such as a direct memory access (DMA) engine 230 for memory read/ write operations with a graphics memory (not shown), such as a frame buffer.
- Host interface 220 permits programmable graphics processor 205 to receive commands for generating graphical images from a host. For example, the host may send vertex data, commands and program instructions to programmable graphics processor 205 .
- a memory interface, such as a DMA engine 230 permits read/write operations to be performed with a graphics memory (not shown).
- Register interface 210 provides an interface for interfacing with registers of programmable graphics processor 205 .
- Programmable graphics processor 205 may be implemented as part of a system 290 that includes at least one other central processing unit 260 executing a software application 270 that acts as the host for programmable graphics processor 205 .
- An exemplary system 290 may, for example, comprise a handheld unit, such as a cell phone or personal digital assistant (PDA).
- software application 270 may include a graphics application 275 for generating graphical images on a display 295 .
- software application 270 may include graphics processor management software application 280 for performing management functions associated with programmable graphics processor 205 , such as for example, pipeline re-configuration, register configuration, and testing.
- programmable graphics processor 205 register interface 210 , host interface 220 , and DMA engine 230 are part of an embedded graphics processing core 250 formed on a single integrated circuit 200 which includes a host, such as an integrated circuit 200 formed on a chip including a central processing unit 260 having software 270 resident on a memory.
- graphics processing core 250 may be disposed on a first integrated circuit and CPU 260 disposed on a second integrated circuit.
- FIG. 3 is a block diagram illustrating in more detail a programmable graphics processor 205 in accordance with one embodiment of the present invention. It includes a setup stage 305 , a raster stage 310 , a gatekeeper stage 320 , a data fetch stage 330 , Arithmetic Logic Unit (ALU) stage 340 , a data write stage 355 , and a recirculation path 360 .
- programmable graphics processor 205 includes ALUs 350 configured to execute a shader program to implement three-dimensional graphics operations such as a texture combine, fog, alpha blend (e.g., color blending), alpha test (e.g., color test), Z depth test, or other shading algorithms.
- programmable graphics processor 205 may also be configured to perform other types of processing operations.
- a setup stage 305 receives instructions from a host, such as a software application running on integrated circuit 200 .
- setup stage 305 performs the functions of geometrical transformation of coordinates (X-form), clipping, and setup.
- the setup unit takes vertex information (e.g., x, y, z, color and/or texture attributes) and applies a user defined view transform to calculate screen space coordinates for each geometrical primitive (hereinafter described as triangles because primitives are typically implemented as triangles), which is then sent to the raster stage 310 to draw the given triangle.
- a vertex buffer 308 may be included to provide a buffer for vertex data used by setup stage 305 .
- setup stage 305 sets up barycentric coefficients.
- setup stage 305 is a floating point Very Large Instruction Word (VLIW) machine that supports 32-bit IEEE floating point, S15.16 fixed point and packed 0.8 formats.
- VLIW Very Large Instruction Word
- Raster stage 310 receives data from setup stage 205 regarding triangles that are to be rendered (e.g., converted into pixels).
- an instruction RAM (not shown) may, for example, be included in raster stage 310 for programming instructions for raster stage 310 .
- Raster stage 310 processes each pixel of a given triangle and determines parameters that need to be calculated for a pixel as part of rendering, such as calculating color, texture, alpha-test, alpha-blend, z-depth test, and fog parameters.
- raster stage 310 calculates barycentric coefficients for pixel packets. In a barycentric coordinate system, distances in a triangle are measured with respect to its vertices. The use of barycentric coefficients reduces the required dynamic range, which permits using fixed-point calculations that require less power than floating point calculations.
- Raster stage 310 generates at least one pixel packet for each pixel of a triangle that is to be processed.
- Each pixel packet includes fields for a payload of pixel attributes required for processing (e.g. color, texture, depth, fog, (x,y) location). Additionally, each pixel packet has associated sideband information including an instruction sequence of operations to be performed on the pixel packet.
- An instruction area in raster stage 210 (not shown) assigns instructions to pixel packets.
- FIG. 4 illustrates exemplary pixel packets 430 and 460 for one pixel.
- raster stage 210 partitions pixel attributes into two or more different types of pixel packets 430 and 460 , with each type of pixel packet requiring fields only for pixel attribute data that a particular type of instruction acts on. Partitioning pixel data into smaller units of work reduces bandwidth requirements and also reduces the processing requirements if, for example, only a subset of attributes of a pixel need to be operated on for a particular processing operation.
- Each pixel packet has associated sideband information 410 and payload information 420 .
- Exemplary sideband information includes a valid field 412 , kill field 414 , tag field, and an instruction field 416 that includes a current instruction.
- Exemplary pixel packet 430 includes a first set of (s,t) texture coordinates 422 and 424 fields along with a fog field 426 .
- Exemplary pixel packet 460 includes a color field 462 , and a second set of a texture coordinates (s, t) 464 and 466 .
- each pixel packet represents payload information 420 in fixed-point representation.
- Examples of pixel attributes that may be included in a pixel packet with a pixel packet size of 20 bits for pixel attributes include: one Z.16 sixteen bit Z depth values; one 16 bit S/T texture coordinates and a 4 bit level of detail; a pair of color values, each with 8 bit precision; or packed 5555 ARGB color with five bits each in each ARGB variable.
- Sideband information for a pixel packet may include the (x,y) location of a pixel.
- a start span command is generated by raster stage 310 at an (x,y) origin where it starts to walk across a triangle along a scan line.
- the use of a start span command permits an (x,y) location to be omitted from pixel packets.
- the start span command informs other entities (e.g., data write stage 355 and data fetch stage 330 ) of an initial (x,y) location at the start of a scan line.
- the (x,y) position of other pixels along the scan line can be inferred by the number of pixels a given pixel is away from the origin.
- data write stage 355 and data fetch stage 330 include local caches adapted to increment local counters and update an (x,y) location based on a calculation of the number of pixels that they encounter after the span start command.
- raster stage 310 generates at least one row 510 of pixel packets for each pixel that is to be processed.
- each row 510 has common sideband information 410 defining an instruction sequence for the row 510 . If more than one row 510 is required for a pixel, the rows 510 are organized as a group 520 of rows that are processed in succession with each new clock cycle.
- 80 bit pixel data is partitioned into four 20 bit pixel attribute register values, with the four pixel register values defining a “row” 510 of a pixel packet (K, R 0 , R 1 , R 2 , and R 3 for a pixel.
- An iterator register pool (not shown) of raster stage 310 has corresponding registers to support the rows 510 of pixel packets.
- raster stage 310 includes a register pool supporting up to 4 rows of pixel packets. Some types of pixel packet attributes, such as texture, may require a high precision. Conversely, some types of pixel packet attributes may require less precision, such as colors.
- the register pool can be arranged to support high precision and low precision values for each pixel packet in a row 510 . In one embodiment the register pool includes 4 high precision and 4 low precision perspective correct iterated values per row, plus Z depth values. This permits, for example, software to assign the precision of the iterator for processing a particular pixel packet attribute.
- raster stage 310 includes a register pool adapted to keep track of an integer portion of texture, permitting fractional bits of texture to be sent as data packets.
- Raster stage 310 may, for example, receive instructions from the host that require an operation to be performed on a pixel. In response, raster stage 310 generates one or more rows 510 of pixel packets having associated instruction sequences, with the pixel packet rows and instructions arranged to perform the desired processing operation. As described below in more detail, in one embodiment ALU stage 340 permits scalar arithmetic operations to be performed in which the operands include a pre-selected subset of pixel attributes within a row 510 of pixel packets, constant values, and temporarily stored results of previous calculations on pixel packets.
- a variety of graphics operations can be formulated as one or more scalar arithmetic operations. Additionally, a variety of vector graphics operations can be formulated as a plurality of scalar arithmetic operations.
- the programmable graphics processor 205 of the present invention may be programmed to perform any graphics operation on a pixel that can be expressed as a sequence of scalar arithmetic operations, such as a fog operation, color (alpha) blending, texture combine, alpha test, or depth test, such as those described in the Open GL® Graphics System: A Specification ( Version 1.2), the contents of which are hereby incorporated by reference.
- raster stage 310 may use a programmable mapping table or mapping algorithm to determine an assignment of pixel packets and associated instructions for performing scalar arithmetic operations required to implement the graphics function on a pixel.
- the mapping may, for example, be programmed by graphics processor management application 280 .
- gatekeeper stage 320 performs a data flow control function.
- gatekeeper stage 320 has an associated scoreboard 325 for scheduling, load balancing, resource allocation, and hazard avoidance of pixel packets.
- Scoreboard 325 tracks the entry and retirement of pixels. Pixel packets entering gatekeeper stage 320 set the scoreboard and the scoreboard is reset as the pixel packets drain out of programmable processor 205 after completion of processing.
- scoreboard 325 may maintain a table for each pixel of the display to monitor pixels.
- Scoreboard 325 provides several benefits. For example, scoreboard 325 prevents a hazard where one pixel in a triangle is on top of another pixel being processed and in flight. In one embodiment, scoreboard 325 monitors idle conditions and clocks off idle units using scoreboarding information. For example, if there are no valid pixels, scoreboard 325 may turn off the ALUs to save power. As described below in more detail, the scoreboard 325 tracks pixel packets that are capable of being processed by ALUs 350 along with those having a kill bit set such that the pixel packet flows through ALUs 350 without active processing. In one embodiment, scoreboard 325 tracks (x,y) positions of recirculated pixel packets.
- scoreboard 325 increments the instruction sequence in the pixel packet in a subsequent pass to the next instruction for the pixel, e.g., if the instruction is for a fog operation on pass number 1 the instructions is iterated to an alphablending operation on pass number 2.
- a data fetch stage 330 fetches data for pixel packets passed on by gatekeeper 320 . This may include, for example, fetching color, depth, and texture data by performing appropriate color, depth, or texture data reads for each row of pixel packets.
- the data fetch stage 330 may, for example, fetch pixel or texel data by requesting a read from a memory interface (e.g., reading a framebuffer (not shown) using DMA engine 230 ).
- data fetch stage 330 may also manage a local cache, such as a texture/fog cache 332 , a color/depth cache 334 , and a Z cache for depth data (not shown).
- data fetch stage 330 includes an instruction random access memory (RAM) with instructions for accessing data required by the pixel packet attribute fields.
- data fetch stage 330 also performs a Z depth test. In this embodiment, data fetch stage 330 compares the Z depth value of a pixel packet to stored Z values using one or more depth comparison tests. If the Z depth value of the pixel indicates that the pixel is occluded, the kill bit is set.
- ALU stage 340 has a set of ALUs 350 including at least one ALU 350 , such as ALUs 350 - 0 , 350 - 1 , 350 - 2 , and 350 - 3 . While four ALUs 350 are illustrated, more or less ALUs 350 may be used in ALU stage 340 depending upon the application.
- An individual ALU 350 reads the current instruction for at least one row of a pixel packet 510 and implements any instruction to perform a scalar arithmetic operation that it is programmed to support. Instructions are included in each ALU 350 and may, for example, be stored on a local instruction RAM (not shown in FIG. 3 ).
- Each ALU 350 includes instructions for performing at least one arithmetic operation on a first product of operands (a*b) and a second product of operands (b*c) where a, b, c, and d are operands and * is a multiplication. Some or all of the operands may correspond, for example, to register value attributes within a row 510 of a pixel packet.
- An ALU 350 may also have one or more operand values that are constant or software loadable. In some embodiments, an ALU may support using temporarily stored results from previous operations on pixel packets.
- each ALU 350 is programmable.
- a crossbar (not shown) or other programmable selector may be included within an ALU 350 to permit the operands and the destination of a result to be selected in response to an instruction from software (e.g. software application 270 ).
- an operation command code may be used to select the source of each operand (a, b, c, d) from attributes of any register value within a row 510 of pixel packets, temporary values, and constant values.
- the operation command also instructs an ALU 350 where to send the result of the arithmetic operation, such as updating a pixel packets with the result, saving the result as a temporary value, or both updating a pixel packet with the result and saving the result as a temporary value.
- an ALU can be programmed to read specific attributes within a pixel packet as operands and apply the scalar arithmetic operation indicated by the current instruction.
- the operation command code can also include commands to complement operands (e.g., calculate 1 ⁇ x, where x is the read value), negate operands (e.g., calculate ⁇ x, where x is the read value), or clamp an operand or a result.
- Other examples of operation command codes may include, for example, a command to select a data format.
- An example of an arithmetic operation performed by an ALU 350 is a scalar arithmetic operation of the form (a*b)+(c*d) on at least one variable within a pixel packet where a, b, c, and c are operands and the * operation is a multiplication.
- Each ALU 350 preferably also may be programmed to perform other mathematical operations such as complementing operands and negating operands. Additionally, in some embodiments, each ALU 350 may calculate minimum and maximum values from (a*b, c*d), and perform logical comparisons (e.g., a logical result if a*b is equal to, not equal to, less than, or less than or equal to c*d).
- each ALU 350 may also include instructions for determining whether to generate a kill bit in kill field 414 based on a test, such as a comparison of a*b and c*d (e.g., kill if a*b not equal to c*d, kill if a*b is equal to c*d, kill if a*b less than c*d, or kill if a*b is greater than or equal to c*d).
- Examples of ALU operations that may generate a kill bit include an alpha test in which a color value is compared to a test color value, such as the expression IF (alpha>alpha reference), then kill the pixel, where alpha is a color value, and alpha reference is a reference color value.
- Another example of an ALU operation that may generate a kill bit is a Z depth test where the Z value of a pixel is compared to at least one Z value of a previous pixel having the same location and the pixel is killed if the depth test indicates that the pixel is occluded.
- an individual ALU 350 is disabled in regards to processing a pixel packet if the kill bit is set in a pixel packet.
- a clock gating mechanism is used to disable ALU 350 when a kill bit is detected in the sideband information.
- the ALUs 350 do not waste power on the pixel packet as it propagates through ALU stage 340 .
- a pixel packet with a kill bit set still propagates onwards, permitting it to be accounted for by data write stage 355 and scoreboard 325 . This permits all pixel packets to be accounted for by scoreboard 325 , even those pixel packets marked by a kill bit as requiring no further ALU processing.
- any row 510 of a pixel is marked by a kill bit, other rows 510 of the same pixel are also killed. This may be accomplished, for example, by forwarding kill information between stages or by one or more stages keeping track of pixels in which a row 510 is marked by a kill bit. In some embodiments, once a kill bit is set, only the sideband information 410 (which includes the kill bit) for a row 510 of pixel packets propagates on to the next stage.
- the output of ALU stage 340 goes to data write stage 355 .
- the data write stage 355 converts processed pixel packets into pixel data and writes the result to a memory interface (e.g., via DMA engine 230 ).
- write values for a pixel are accumulated in write buffer 352 and the accumulated writes for a pixel are written to memory in a batch. Examples of functions that data write stage 355 may perform include color and depth writeback, and format conversion. In some embodiments, data write stage 355 may also identify pixels to be killed and set the kill bit.
- a recirculation path 360 is included to recirculate pixel packets back to gatekeeper 320 .
- Recirculation path 360 permits, for example, processes requiring a sequence of arithmetic operations to be performed using more than one pass through ALU stage 340 .
- Data write stage 355 indicates retired writes to gatekeeper stage 320 for scoreboarding.
- FIG. 6 is a block diagram of an exemplary individual ALU 350 .
- ALU 350 has an input bus 605 with data buses for receiving a row 510 of a pixel packet in corresponding registers R 0 , R 1 , R 2 , and R 3 .
- An instruction RAM 610 is included for ALU instructions.
- An exemplary set of instructions is illustrated in block 620 .
- ALU 350 may be programmed to read any one of the four 20 bit register values from a row 510 and select a set of operands from row 510 .
- ALU 350 may be programmed to select as operands temporary values from registers (T) 630 , such as two 20 bit temporary values per ALU 350 , which are temporarily saved from a previous result, as indicated by path 640 .
- ALU 350 may also select as operands constant values (not shown), which may also be programmed by software.
- a first stage of multiplexers (MUXs) 645 selects operands from the row of pixel packets, any temporary values 630 , and any constant values (not shown).
- Format conversion modules 650 may be included to convert the operands into a desired data format suitable for the ALU's 350 computational precision in the arithmetic computation unit 670 .
- ALU 350 includes elements to permit each operand or its complement to be selected in a second stage of MUXs 660 .
- the resulting four operands are input to a scalar arithmetic computation unit 670 that can perform two multiplications and an addition.
- the resultant value may be optionally clamped to a desired range (e.g., 0 to 1.0) using a damper 680 .
- the row 510 of pixel packets exits on buses 690 .
- selected pixel packet attributes may be in a one sign 1.8 (S1.8) format.
- the S1.8 format is a base 2 number with an 8 bit fraction that is in the range of [ ⁇ 2 to +2).
- the S1.8 format permits a higher dynamic range for calculations. For example, in calculations dealing with lighting, the S1.8 format permits increased dynamic range, resulting in improved realism. If a result of a scalar arithmetic operation performed in S1.8 must be in the range of [0,1], the result may be clamped to force the result into the range [0,1]. As an illustrative example, a shading calculation for color data may be performed in the S1.8 format and the result then clamped.
- different types of pixel packets may have data attributes represented in different formats.
- color data may be represented in a first type of pixel packet in S1.8 format whereas (s,t) texture data may be represented in a second type of pixel packet by a high precision 16 bit format.
- the pixel packet bit size is set by the bit size requirement of the highest precision pixel attributes. For example, since texture attributes typically require greater precision than color, the pixel packet size may be set to represent texture data with a high level of precision, such as 16 bit texture data.
- the improved dynamic range of the S1.8 format permits, for example, efficient packing of data for more than one color component into a 20 bit pixel packet size selected for higher precision data texture data requiring, for example, 16 bits for texture data and a 4 bit level of detail (LOD). For example, since each S1.8 color component requires ten bits, two color components may be packed into a 20 bit pixel packet.
- LOD level of detail
- FIG. 7 illustrates an exemplary ALU stage 340 that includes more than one ALU 350 arranged as a pipeline in which two or more ALU 350 s are chained together.
- an individual ALU 350 may be programmed to read one or more operands from a pixel packet, generate a result of an arithmetic operation, and update either a pixel packet or a temporary register with the result.
- Each ALU may be assigned to read operands, generate arithmetic results, and update one or more pixel packets or temporary values before passing on a row of pixel packets to the next ALU.
- ALU stage 340 includes at least one ALU 350 for each color channel (e.g., red, green, blue, and alpha). This permits, for example, load balancing in which the ALUs are configured to operate in parallel upon a row of pixel packets 510 (though at different points in time due to pipelining) to perform similar or different processing tasks.
- color channel e.g., red, green, blue, and alpha
- ALUs 350 may be programmed to perform calculations for a first color component
- a second ALU 350 - 1 may be programmed to perform operations for a second color component
- a third ALU 350 - 2 may be programmed to perform operations for a third color component
- a fourth ALU 350 - 3 may be programmed to perform a fog operation.
- each ALU 350 may be assigned different processing tasks for a row of pixel packets 510 .
- software may configure the ALUs 350 to select a data flow of ALUs 350 within ALU stage 340 , including an execution order of the ALUs 350 .
- the data flow along a chain of ALUs may be arranged so that the results of one ALU 350 - 0 update one or more pixel packet registers which are read as operands by a subsequent ALU 350 - 1 .
- FIG. 8 is a block diagram of an embodiment of a portion of a programmable graphics processor 205 having a reconfigurable pipeline in which the process flow of pixel packets through the stages is configurable in response to software commands, such as software commands from graphics processor management application 280 .
- Distributors 890 and 895 coupled to respective inputs and outputs of elements of the stages permit the process flow of pixel packets to be reconfigured.
- the stages may include, for example, a data fetch stage stage 830 , data write stage 855 , and individual ALU's 850 , although it will be understood that other types may also be reconfigured using distributors 890 and 895 .
- software may dynamically reconfigure the process flow of pixel packets through the stages.
- a synchronization technique is thus preferably utilized to coordinate the data flow of pixel packets that are in flight during the change over from one configuration to another, i.e., performing a synchronization such that pixel packets in flight that are intended to be processed in a first configuration complete their processing before the configuration is changed to a second configuration.
- a data fetch stage 830 , data write stage 855 , and individual ALU's 850 have respective inputs each connected to first distributor 890 and respective outputs each connected to second distributor 895 .
- Each distributor 890 and 895 may, for example, comprise switches, crossbars, routers, or a MUX circuit to select a distribution flow of incoming pixel packets to data fetch stage 830 , ALUs 850 , and data write stage 855 .
- the distributors 890 and 895 determine the data path of incoming pixel packets 810 through data fetch stage 830 , data write stage 855 , and individual ALUs 850 .
- Signal inputs 892 and 894 permit distributors 890 and 895 to receive software commands (e.g., from a software application running on a CPU) to reconfigure the distribution of pixel packets between the data fetch stage 830 , data write stage 855 , and ALUs 850 .
- One example of a reconfiguration is assigning an execution order of the ALUs 850 .
- Another example of a reconfiguration is bypassing data fetch stage 830 if it is determined that the data fetch stage is not required for a certain time processing task.
- a reconfigurable pipeline As an illustrative example, there may be instances where it is more efficient to operate on a texture coordinate prior to a data fetch, in which case the data flow is arranged to have data fetch stage 830 receive pixel packets after the ALU 850 performing the texture operation.
- a reconfigurable pipeline one benefit of a reconfigurable pipeline is that a software application can reconfigure the programmable graphics processor 205 to increase efficiency.
- raster stage 310 generates rows 510 of pixel packets for processing.
- the rows 510 may be further arranged into a group 520 of rows, such as a sequence of four rows 510 , that are passed on for processing in successive clock cycles.
- some operations that can be performed on a row 510 of pixel packets may require the result of an arithmetic operation of another row of pixel packets. Consequently, in one embodiment raster stage 310 arranges pixel packets in a group 520 of rows to account for data dependencies.
- the group 520 is arranged so that the pixel packet having the dependent texture operation is placed in a later row.
- pixels are alternately assigned by raster stage 310 as either odd or even.
- Corresponding registers (R 0 , R 1 , R 2 , and R 3 ) for each row of a pixel are correspondingly assigned as even or odd.
- Even rows 905 of pixel packets for even pixels and odd rows 910 for odd pixels are then interleaved utilizing one or more rules to avoid data dependencies. Interleaving every other row provides an additional clock cycle to account for ALU latency.
- Row 0 for the even pixel requires two clock cycles to generate a resultant required by Row 1 of the even pixel
- the interleaving of Row 0 for the odd pixel provides the additional clock cycle of time required by the ALU latency.
- Row 0 for the even pixel is a blending operation and Row 1 for the same pixel corresponds to a blend with second texture requiring the result of the first blending operation. If the ALU latency for the first operation is two clock cycles, then interleaving permits the results of the blending operation to be available for the texture with blend operation.
- sideband information is preferably included to coordinate the interleaved data flow.
- sideband information in each pixel packet includes an even/odd field to distinguish even and odd rows.
- Each ALU 350 may also include two sets of temporary registers corresponding to temporary registers for even pixels and odd pixels to provide an appropriate temporary value for even/odd pixel packets.
- the even/odd field is use to select the appropriate set of temporary registers, e.g., even temporary registers are selected for odd pixels whereas an odd set of temporary registers are selected for even pixels.
- constant registers are shared by both even and odd pixels to reduce the total amount of storage needs for constant values used for both even and odd pixels.
- the software host may set the temporary registers at a constant value for an extended period of time to emulate constant registers. While an interleaving of two pixels is one implementation, it will be understood that the interleaving may be further extended to interleave more than two pixels if, for example, ALU latency corresponds to more than two clock cycles.
- ALU latency is taken into account by hardware, reducing the burden on software to account for ALU latency that would otherwise occur if for example, raster stage 310 did not interleave pixels.
- each ALU 350 may be substantially identical.
- a particular ALU may be configured to have a more than one place in the data flow, e.g., a different execution order. Consequently, an identifier needs to be provided in each ALU 350 to indicate its place within the data flow.
- the identifier may, for example, be provided to each ALU 350 by a direct register write technique of each ALU 350 .
- this approach has the disadvantage of requiring significant software overhead. Consequently, in one embodiment a packet technique is utilized to trigger elements requiring configuration information to discover their relative location within the process flow and write a corresponding identifier in a local register.
- the register address space of the ALUs 350 is software configurable using a packet initialization technique to communicate an identification (ID) to each ALU 350 using data packets.
- Each ALU 350 may, for example, include conventional network modules for receiving and forwarding data packets.
- an ID packet 1010 is initiated by a software application.
- the ID packet 1010 contains an initial ID code, such as a number.
- the ID packet 1010 is injected in the graphics pipeline at a point before elements requiring an ID code and then is passed on to subsequent elements of the process flow defined by the current pipeline configuration.
- a configuration register 1020 in a first ALU 350 receives the ID packet, writes the current value of the ID code into the configuration register and then increments the ID code of the ID packet before passing the ID packet onto the next ALU. This process is continued, with each subsequent ALU 350 writing the current value of the ID code into its configuration register, and then passing on the ID packet with an incremented ID code to the next ALU.
- stages along the data flow path may also have configuration registers set in a similar manner.
- the elements in a configuration flow may also include a data fetch stage or a data write stage that also have configuration registers set by reading an ID packet and which increment the ID code before passing the ID packet with the increment ID to the next element in the configuration flow.
- graphics processor management application 280 needs only generate an initial ID packet 1010 , such as by issuing a command to generate an ID packet 1010 via host interface 220 that is received by an ID packet generator 1030 .
- ID codes are written into the configuration registers using a broadcast packet technique to trigger elements requiring configuration registers to be written to discover their ID.
- the elements e.g., ALUs 350
- a broadcast packet technique is useful, for example, in embodiments in which a pipeline is branched to permit branches of the pipeline to process pixels in parallel.
- FIG. 11 illustrates an embodiment that includes a diagnostic monitoring capability.
- there is a sequence of taps along elements of graphics processor 205 such as taps associates with each ALU 350 and data fetch stage 330 . Taps may also be included at other stages as well.
- a configurable test point selector 1105 is adapted to permit selected taps, such as two taps 1120 and 1130 , to be monitored in response to a software command, such as a software command from graphics processor management application 280 .
- Configurable test point selector 1105 may, for example, be implemented using multiplexers.
- at least one counter 1110 is included for statistics collection of each selected test point.
- an instrumentation packet generated by software provides information on the taps to be monitored and enables counting for the selected test points.
- an instrument register may be included to gate statistics collection on and off based on the operation mode of the pipeline (e.g., an instrument register may be provide to permit software to enable counting for specific types of graphics operations, such as enabling statistical counting when alphablending operations occur).
- configurable test point selector 1105 permits software, such as graphics processor management application 280 , to have statistical data collected for only test points of interest, reducing the hardware complexity and cost while still allowing software to analyze any portion of the behavior of programmable processor 205 .
- the test points of interest may, for example be selected to collect statistics associated with those ALUs 350 processing specific kinds of data, such as ALUs 350 processing texture data. Additionally, the statistics collection may be enabled for specific graphics operations, such as alphablending.
- configurable test point selector 1105 utilizes a three-wire protocol.
- Each element such as an ALU 350 - 0 , that has valid payload data generates a valid signal, which may, for example flow down to the next element (e.g., ALU 350 - 1 ).
- An element that is ready to receive a payload generates a ready signal, which may, for example, flow up to the previous element.
- the element generates a not ready signal, which may, for example correspond to not asserting the ready signal.
- An enable signal corresponds to an element being enabled for monitoring, such as by software control via a pipelined register write to a monitoring enable control bit stored adjacent to the point being monitored. The signal may be tapped off directly from an element generating the signal or from elements receiving these signals.
- the valid, ready, and not-ready signals at selected tap points can be used to determine an operating state.
- a transfer state corresponds to a clock tick having a valid payload (i.e. the valid bit set) for data flowing downstream and a ready signal from a downstream block in the downstream block to receive the data (e.g., at tap point 1120 , a valid signal from ALU- 0 and a ready signal from ALU- 1 at tap point 1130 ).
- a wait state corresponds to a clock tick with a valid payload that is blocked because the block below is not ready to receive data (e.g., at tap point 1120 , a valid signal from ALU- 0 and a not ready signal from ALU- 1 at tap point 1130 ).
- statistics on selected tap points may be collected, such as counting the number of clock cycles that a transfer state and a wait state are detected.
- Embodiments of the present invention provide a variety of benefits that are useful in an embedded graphics processor core 250 .
- power, space, and CPU capabilities may be comparatively limited.
- ALU's 350 are clock gated when processing is not required (e.g., by detecting a kill bit), reducing processing power requirements.
- the raster stage 310 needs only generate pixel packets for the subset of pixel data that is processed on, also reducing power requirements.
- the programmable ALU stage 340 requires a smaller chip area than a conventional pipeline with dedicated stages for performing dedicated graphics function reducing cost.
- the programmable processor 205 may be implemented as blocks that are configurable by software, providing improved efficiency. Test monitoring may be configured to test a subset of test points, reducing bandwidth and analysis requirements by software.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Graphics (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Generation (AREA)
Abstract
A configurable graphics pipeline has more than one possible process flow of pixel packets through elements of the graphics pipeline. In one embodiment, a data packet triggers an element of the graphics pipeline to discover an identifier.
Description
- This is a continuation application of U.S. application Ser. No. 10/846,106, filed May 14, 2004 which application is hereby incorporated by reference in its entirety.
- The present invention is generally related to programmable processors. More particularly, the present invention is directed towards low power programmable processors for graphics applications.
- The generation of three-dimensional graphical images is of interest in a variety of electronic games and other applications. Conventionally, some of the steps used to create a three-dimensional image of a scene include generating a three-dimensional model of objects to be displayed. Geometrical primitives (e.g., triangles) are formed which are mapped to a two-dimensional projection along with depth information. Rendering (drawing) primitives includes interpolating parameters, such as depth and color, over each two-dimensional projection of a primitive.
- Graphics Processing Units (GPUs) are commonly used in graphics systems to generate three-dimensional images in response to instructions from a central processing unit. Modern GPUs typically utilize a graphics pipeline for processing data.
FIG. 1 is a prior art drawing of a traditional pipeline architecture which is a “deep” pipeline having stages dedicated to performing specific functions. A transform stage 105 performs geometrical calculations of primitives and may also perform a clipping operation. A setup/raster stage 110 rasterizes the primitives. Atexture address 115 andtexture fetch 120 stage are utilized for texture mapping. Afog stage 130 implements a fog algorithm. Analpha test stage 135 performs an alpha test. Adepth test 140 performs a depth test for culling occluded pixels. Analpha blend stage 145 performs an alpha blend color combination algorithm. Amemory write stage 150 writes the output of the pipeline. - The traditional GPU pipeline architecture illustrated in
FIG. 1 is typically optimized for fast texturing using the OpenGL™ graphics language. A benefit of a deep pipeline architecture is that it permits fast, high quality rendering of even complex scenes. - There is an increasing interest in utilizing three-dimensional graphics in wireless phones, personal digital assistants (PDAs), and other devices where cost and power consumption are important design requirements. However, the traditional deep pipeline architecture requires a significant chip area, resulting in greater cost than desired. Additionally, a deep pipeline consumes significant power, even if the stages are performing comparatively little processing. This is because many of the stages consume about the same amount of power regardless of whether they are processing pixels.
- As a result of cost and power considerations, the conventional deep pipeline architecture illustrated in
FIG. 1 is unsuitable for many graphics applications, such as implementing three-dimensional games on wireless phones and PDAs. - Therefore, what is desired is a processor architecture suitable for graphics processing applications but with reduced power and size requirements.
- A configurable graphics pipeline has more than one possible process flow of pixel packets through elements of the graphics pipeline. A data packet triggers an element of the graphics pipeline to discover an identifier.
- In one embodiment of a method, a received data packet triggers elements of the graphics pipeline to discover an identifier for each element indicative of the location of the element within the process flow. Each element writes an identifier in a configuration register indicative of its relative location within the process flow. In one embodiment, an element reads a current value of an identifier in a data packet, writes the current value to a configuration register, increments the identifier, and forwards the data packet with an incremented identifier to the next element of the process flow.
- The invention is more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, in which:
-
FIG. 1 is a diagram of a prior art pipeline for three-dimensional graphics; -
FIG. 2 is a block diagram of an integrated circuit including a programmable graphics processor in accordance with one embodiment of the present invention; -
FIG. 3 is a block diagram of a programmable graphics processor in accordance with one embodiment of the present invention; -
FIG. 4 illustrates exemplary pixel packets in accordance with one embodiment of the present invention; -
FIG. 5 illustrates an exemplary arrangement of pixel packets into rows of a group of pixel packets in accordance with one embodiment of the present invention; -
FIG. 6 is a block diagram of a single Arithmetic Logic Unit in accordance with one embodiment of the present invention; -
FIG. 7 is a block diagram of a sequence of two Arithmetic Logic Units in accordance with one embodiment of the present invention; -
FIG. 8 is a block diagram of a configurable programmable graphics processor in accordance with one embodiment of the present invention; -
FIG. 9 illustrates interleaving of rows of pixel packets in accordance with one embodiment of the present invention; -
FIG. 10 is a block diagram illustrating Arithmetic Logic Units having configuration registers in accordance with one embodiment of the present invention; and -
FIG. 11 is a block diagram illustrating a configurable test point selector in accordance with one embodiment of the present invention. - Like reference numerals refer to corresponding parts throughout the several views of the drawings.
-
FIG. 2 is a block diagram of one embodiment of the present invention. Aprogrammable graphics processor 205 is coupled to aregister interface 210, ahost interface 220, and a memory interface, such as a direct memory access (DMA)engine 230 for memory read/ write operations with a graphics memory (not shown), such as a frame buffer.Host interface 220 permitsprogrammable graphics processor 205 to receive commands for generating graphical images from a host. For example, the host may send vertex data, commands and program instructions toprogrammable graphics processor 205. A memory interface, such as aDMA engine 230, permits read/write operations to be performed with a graphics memory (not shown).Register interface 210 provides an interface for interfacing with registers ofprogrammable graphics processor 205. -
Programmable graphics processor 205 may be implemented as part of asystem 290 that includes at least one othercentral processing unit 260 executing asoftware application 270 that acts as the host forprogrammable graphics processor 205. Anexemplary system 290 may, for example, comprise a handheld unit, such as a cell phone or personal digital assistant (PDA). For example,software application 270 may include agraphics application 275 for generating graphical images on adisplay 295. Additionally, as described below in more detail, in someembodiments software application 270 may include graphics processormanagement software application 280 for performing management functions associated withprogrammable graphics processor 205, such as for example, pipeline re-configuration, register configuration, and testing. - In one embodiment,
programmable graphics processor 205,register interface 210,host interface 220, andDMA engine 230 are part of an embeddedgraphics processing core 250 formed on a single integratedcircuit 200 which includes a host, such as an integratedcircuit 200 formed on a chip including acentral processing unit 260 havingsoftware 270 resident on a memory. Alternatively,graphics processing core 250 may be disposed on a first integrated circuit andCPU 260 disposed on a second integrated circuit. -
FIG. 3 is a block diagram illustrating in more detail aprogrammable graphics processor 205 in accordance with one embodiment of the present invention. It includes asetup stage 305, araster stage 310, agatekeeper stage 320, adata fetch stage 330, Arithmetic Logic Unit (ALU)stage 340, adata write stage 355, and arecirculation path 360. In one embodiment,programmable graphics processor 205 includesALUs 350 configured to execute a shader program to implement three-dimensional graphics operations such as a texture combine, fog, alpha blend (e.g., color blending), alpha test (e.g., color test), Z depth test, or other shading algorithms. However, it will be understood throughout the following discussion thatprogrammable graphics processor 205 may also be configured to perform other types of processing operations. - A
setup stage 305 receives instructions from a host, such as a software application running onintegrated circuit 200. In one embodiment,setup stage 305 performs the functions of geometrical transformation of coordinates (X-form), clipping, and setup. The setup unit takes vertex information (e.g., x, y, z, color and/or texture attributes) and applies a user defined view transform to calculate screen space coordinates for each geometrical primitive (hereinafter described as triangles because primitives are typically implemented as triangles), which is then sent to theraster stage 310 to draw the given triangle. Avertex buffer 308 may be included to provide a buffer for vertex data used bysetup stage 305. In one embodiment,setup stage 305 sets up barycentric coefficients. In one implementation,setup stage 305 is a floating point Very Large Instruction Word (VLIW) machine that supports 32-bit IEEE floating point, S15.16 fixed point and packed 0.8 formats. -
Raster stage 310 receives data fromsetup stage 205 regarding triangles that are to be rendered (e.g., converted into pixels). In some embodiments, an instruction RAM (not shown) may, for example, be included inraster stage 310 for programming instructions forraster stage 310.Raster stage 310 processes each pixel of a given triangle and determines parameters that need to be calculated for a pixel as part of rendering, such as calculating color, texture, alpha-test, alpha-blend, z-depth test, and fog parameters. In one embodiment,raster stage 310 calculates barycentric coefficients for pixel packets. In a barycentric coordinate system, distances in a triangle are measured with respect to its vertices. The use of barycentric coefficients reduces the required dynamic range, which permits using fixed-point calculations that require less power than floating point calculations. -
Raster stage 310 generates at least one pixel packet for each pixel of a triangle that is to be processed. Each pixel packet includes fields for a payload of pixel attributes required for processing (e.g. color, texture, depth, fog, (x,y) location). Additionally, each pixel packet has associated sideband information including an instruction sequence of operations to be performed on the pixel packet. An instruction area in raster stage 210 (not shown) assigns instructions to pixel packets. -
FIG. 4 illustratesexemplary pixel packets embodiment raster stage 210 partitions pixel attributes into two or more different types ofpixel packets - Each pixel packet has associated
sideband information 410 andpayload information 420. Exemplary sideband information includes avalid field 412, killfield 414, tag field, and aninstruction field 416 that includes a current instruction.Exemplary pixel packet 430 includes a first set of (s,t) texture coordinates 422 and 424 fields along with afog field 426.Exemplary pixel packet 460 includes acolor field 462, and a second set of a texture coordinates (s, t) 464 and 466. In one embodiment, each pixel packet representspayload information 420 in fixed-point representation. Examples of pixel attributes that may be included in a pixel packet with a pixel packet size of 20 bits for pixel attributes include: one Z.16 sixteen bit Z depth values; one 16 bit S/T texture coordinates and a 4 bit level of detail; a pair of color values, each with 8 bit precision; or packed 5555 ARGB color with five bits each in each ARGB variable. - Sideband information for a pixel packet may include the (x,y) location of a pixel. However, in one embodiment, a start span command is generated by
raster stage 310 at an (x,y) origin where it starts to walk across a triangle along a scan line. The use of a start span command permits an (x,y) location to be omitted from pixel packets. The start span command informs other entities (e.g., data writestage 355 and data fetch stage 330) of an initial (x,y) location at the start of a scan line. The (x,y) position of other pixels along the scan line can be inferred by the number of pixels a given pixel is away from the origin. In one embodiment, data writestage 355 and data fetchstage 330 include local caches adapted to increment local counters and update an (x,y) location based on a calculation of the number of pixels that they encounter after the span start command. - Referring to
FIG. 5 , in one embodiment,raster stage 310 generates at least onerow 510 of pixel packets for each pixel that is to be processed. In some embodiments, eachrow 510 hascommon sideband information 410 defining an instruction sequence for therow 510. If more than onerow 510 is required for a pixel, therows 510 are organized as agroup 520 of rows that are processed in succession with each new clock cycle. In one embodiment, 80 bit pixel data is partitioned into four 20 bit pixel attribute register values, with the four pixel register values defining a “row” 510 of a pixel packet (K, R0, R1, R2, and R3 for a pixel. - An iterator register pool (not shown) of
raster stage 310 has corresponding registers to support therows 510 of pixel packets. In one implementation,raster stage 310 includes a register pool supporting up to 4 rows of pixel packets. Some types of pixel packet attributes, such as texture, may require a high precision. Conversely, some types of pixel packet attributes may require less precision, such as colors. The register pool can be arranged to support high precision and low precision values for each pixel packet in arow 510. In one embodiment the register pool includes 4 high precision and 4 low precision perspective correct iterated values per row, plus Z depth values. This permits, for example, software to assign the precision of the iterator for processing a particular pixel packet attribute. In one embodiment,raster stage 310 includes a register pool adapted to keep track of an integer portion of texture, permitting fractional bits of texture to be sent as data packets. -
Raster stage 310 may, for example, receive instructions from the host that require an operation to be performed on a pixel. In response,raster stage 310 generates one ormore rows 510 of pixel packets having associated instruction sequences, with the pixel packet rows and instructions arranged to perform the desired processing operation. As described below in more detail, in oneembodiment ALU stage 340 permits scalar arithmetic operations to be performed in which the operands include a pre-selected subset of pixel attributes within arow 510 of pixel packets, constant values, and temporarily stored results of previous calculations on pixel packets. - A variety of graphics operations can be formulated as one or more scalar arithmetic operations. Additionally, a variety of vector graphics operations can be formulated as a plurality of scalar arithmetic operations. Thus, it will be understood that the
programmable graphics processor 205 of the present invention may be programmed to perform any graphics operation on a pixel that can be expressed as a sequence of scalar arithmetic operations, such as a fog operation, color (alpha) blending, texture combine, alpha test, or depth test, such as those described in the Open GL® Graphics System: A Specification (Version 1.2), the contents of which are hereby incorporated by reference. For example, in response toraster stage 310 detecting a desired graphics processing function to be performed on a pixel (e.g., a fog operation),raster stage 310 may use a programmable mapping table or mapping algorithm to determine an assignment of pixel packets and associated instructions for performing scalar arithmetic operations required to implement the graphics function on a pixel. The mapping may, for example, be programmed by graphicsprocessor management application 280. - Returning again to
FIG. 3 , as each pixel of a triangle is walked byraster stage 310,raster stage 310 generates pixel packets for further processing which are received bygatekeeper stage 320.Gatekeeper stage 320 performs a data flow control function. In one embodiment,gatekeeper stage 320 has an associatedscoreboard 325 for scheduling, load balancing, resource allocation, and hazard avoidance of pixel packets.Scoreboard 325 tracks the entry and retirement of pixels. Pixel packets enteringgatekeeper stage 320 set the scoreboard and the scoreboard is reset as the pixel packets drain out ofprogrammable processor 205 after completion of processing. As an illustrative example, if acompact display 295 has an area of 128 by 32 pixels,scoreboard 325 may maintain a table for each pixel of the display to monitor pixels. -
Scoreboard 325 provides several benefits. For example,scoreboard 325 prevents a hazard where one pixel in a triangle is on top of another pixel being processed and in flight. In one embodiment,scoreboard 325 monitors idle conditions and clocks off idle units using scoreboarding information. For example, if there are no valid pixels,scoreboard 325 may turn off the ALUs to save power. As described below in more detail, thescoreboard 325 tracks pixel packets that are capable of being processed byALUs 350 along with those having a kill bit set such that the pixel packet flows throughALUs 350 without active processing. In one embodiment,scoreboard 325 tracks (x,y) positions of recirculated pixel packets. If a pixel packet is recirculated,scoreboard 325 increments the instruction sequence in the pixel packet in a subsequent pass to the next instruction for the pixel, e.g., if the instruction is for a fog operation onpass number 1 the instructions is iterated to an alphablending operation onpass number 2. - A data fetch
stage 330 fetches data for pixel packets passed on bygatekeeper 320. This may include, for example, fetching color, depth, and texture data by performing appropriate color, depth, or texture data reads for each row of pixel packets. The data fetchstage 330 may, for example, fetch pixel or texel data by requesting a read from a memory interface (e.g., reading a framebuffer (not shown) using DMA engine 230). In one embodiment, data fetchstage 330 may also manage a local cache, such as a texture/fog cache 332, a color/depth cache 334, and a Z cache for depth data (not shown). Data that is fetched is placed onto a corresponding pixel packet field prior to sending the pixel packet on to the next stage. In one embodiment, data fetchstage 330 includes an instruction random access memory (RAM) with instructions for accessing data required by the pixel packet attribute fields. In some embodiments, data fetchstage 330 also performs a Z depth test. In this embodiment, data fetchstage 330 compares the Z depth value of a pixel packet to stored Z values using one or more depth comparison tests. If the Z depth value of the pixel indicates that the pixel is occluded, the kill bit is set. - The row of pixel packets enters an arithmetic logic unit (ALU)
stage 340 for processing.ALU stage 340 has a set ofALUs 350 including at least oneALU 350, such as ALUs 350-0, 350-1, 350-2, and 350-3. While fourALUs 350 are illustrated, more or less ALUs 350 may be used inALU stage 340 depending upon the application. Anindividual ALU 350 reads the current instruction for at least one row of apixel packet 510 and implements any instruction to perform a scalar arithmetic operation that it is programmed to support. Instructions are included in eachALU 350 and may, for example, be stored on a local instruction RAM (not shown inFIG. 3 ). - Each
ALU 350 includes instructions for performing at least one arithmetic operation on a first product of operands (a*b) and a second product of operands (b*c) where a, b, c, and d are operands and * is a multiplication. Some or all of the operands may correspond, for example, to register value attributes within arow 510 of a pixel packet. AnALU 350 may also have one or more operand values that are constant or software loadable. In some embodiments, an ALU may support using temporarily stored results from previous operations on pixel packets. - In one embodiment, each
ALU 350 is programmable. A crossbar (not shown) or other programmable selector may be included within anALU 350 to permit the operands and the destination of a result to be selected in response to an instruction from software (e.g. software application 270). For example, in one embodiment, an operation command code may be used to select the source of each operand (a, b, c, d) from attributes of any register value within arow 510 of pixel packets, temporary values, and constant values. In this embodiment, the operation command also instructs anALU 350 where to send the result of the arithmetic operation, such as updating a pixel packets with the result, saving the result as a temporary value, or both updating a pixel packet with the result and saving the result as a temporary value. Thus, for example, an ALU can be programmed to read specific attributes within a pixel packet as operands and apply the scalar arithmetic operation indicated by the current instruction. The operation command code can also include commands to complement operands (e.g., calculate 1−x, where x is the read value), negate operands (e.g., calculate −x, where x is the read value), or clamp an operand or a result. Other examples of operation command codes may include, for example, a command to select a data format. - An example of an arithmetic operation performed by an
ALU 350 is a scalar arithmetic operation of the form (a*b)+(c*d) on at least one variable within a pixel packet where a, b, c, and c are operands and the * operation is a multiplication. EachALU 350 preferably also may be programmed to perform other mathematical operations such as complementing operands and negating operands. Additionally, in some embodiments, eachALU 350 may calculate minimum and maximum values from (a*b, c*d), and perform logical comparisons (e.g., a logical result if a*b is equal to, not equal to, less than, or less than or equal to c*d). - In some embodiments, each
ALU 350 may also include instructions for determining whether to generate a kill bit inkill field 414 based on a test, such as a comparison of a*b and c*d (e.g., kill if a*b not equal to c*d, kill if a*b is equal to c*d, kill if a*b less than c*d, or kill if a*b is greater than or equal to c*d). Examples of ALU operations that may generate a kill bit include an alpha test in which a color value is compared to a test color value, such as the expression IF (alpha>alpha reference), then kill the pixel, where alpha is a color value, and alpha reference is a reference color value. Another example of an ALU operation that may generate a kill bit is a Z depth test where the Z value of a pixel is compared to at least one Z value of a previous pixel having the same location and the pixel is killed if the depth test indicates that the pixel is occluded. - In one embodiment, an
individual ALU 350 is disabled in regards to processing a pixel packet if the kill bit is set in a pixel packet. In one embodiment, a clock gating mechanism is used to disableALU 350 when a kill bit is detected in the sideband information. As a result, after a kill bit is generated for a pixel packet, theALUs 350 do not waste power on the pixel packet as it propagates throughALU stage 340. However, note that a pixel packet with a kill bit set still propagates onwards, permitting it to be accounted for by data writestage 355 andscoreboard 325. This permits all pixel packets to be accounted for byscoreboard 325, even those pixel packets marked by a kill bit as requiring no further ALU processing. In one embodiment, if anyrow 510 of a pixel is marked by a kill bit,other rows 510 of the same pixel are also killed. This may be accomplished, for example, by forwarding kill information between stages or by one or more stages keeping track of pixels in which arow 510 is marked by a kill bit. In some embodiments, once a kill bit is set, only the sideband information 410 (which includes the kill bit) for arow 510 of pixel packets propagates on to the next stage. - The output of
ALU stage 340 goes to data writestage 355. The data writestage 355 converts processed pixel packets into pixel data and writes the result to a memory interface (e.g., via DMA engine 230). In one embodiment, write values for a pixel are accumulated inwrite buffer 352 and the accumulated writes for a pixel are written to memory in a batch. Examples of functions that data writestage 355 may perform include color and depth writeback, and format conversion. In some embodiments, data writestage 355 may also identify pixels to be killed and set the kill bit. - A
recirculation path 360 is included to recirculate pixel packets back togatekeeper 320.Recirculation path 360 permits, for example, processes requiring a sequence of arithmetic operations to be performed using more than one pass throughALU stage 340. Data writestage 355 indicates retired writes togatekeeper stage 320 for scoreboarding. -
FIG. 6 is a block diagram of an exemplaryindividual ALU 350.ALU 350 has aninput bus 605 with data buses for receiving arow 510 of a pixel packet in corresponding registers R0, R1, R2, and R3. Aninstruction RAM 610 is included for ALU instructions. An exemplary set of instructions is illustrated in block 620. In one embodiment,ALU 350 may be programmed to read any one of the four 20 bit register values from arow 510 and select a set of operands fromrow 510. Additionally,ALU 350 may be programmed to select as operands temporary values from registers (T) 630, such as two 20 bit temporary values perALU 350, which are temporarily saved from a previous result, as indicated bypath 640.ALU 350 may also select as operands constant values (not shown), which may also be programmed by software. In one embodiment, a first stage of multiplexers (MUXs) 645 selects operands from the row of pixel packets, anytemporary values 630, and any constant values (not shown).Format conversion modules 650 may be included to convert the operands into a desired data format suitable for the ALU's 350 computational precision in thearithmetic computation unit 670.ALU 350 includes elements to permit each operand or its complement to be selected in a second stage ofMUXs 660. The resulting four operands are input to a scalararithmetic computation unit 670 that can perform two multiplications and an addition. The resultant value may be optionally clamped to a desired range (e.g., 0 to 1.0) using adamper 680. Therow 510 of pixel packets exits onbuses 690. - In one embodiment, selected pixel packet attributes may be in a one sign 1.8 (S1.8) format. The S1.8 format is a
base 2 number with an 8 bit fraction that is in the range of [−2 to +2). The S1.8 format permits a higher dynamic range for calculations. For example, in calculations dealing with lighting, the S1.8 format permits increased dynamic range, resulting in improved realism. If a result of a scalar arithmetic operation performed in S1.8 must be in the range of [0,1], the result may be clamped to force the result into the range [0,1]. As an illustrative example, a shading calculation for color data may be performed in the S1.8 format and the result then clamped. Note that in embodiments of the present invention that different types of pixel packets may have data attributes represented in different formats. For example color data may be represented in a first type of pixel packet in S1.8 format whereas (s,t) texture data may be represented in a second type of pixel packet by a high precision 16 bit format. In some embodiments, the pixel packet bit size is set by the bit size requirement of the highest precision pixel attributes. For example, since texture attributes typically require greater precision than color, the pixel packet size may be set to represent texture data with a high level of precision, such as 16 bit texture data. The improved dynamic range of the S1.8 format permits, for example, efficient packing of data for more than one color component into a 20 bit pixel packet size selected for higher precision data texture data requiring, for example, 16 bits for texture data and a 4 bit level of detail (LOD). For example, since each S1.8 color component requires ten bits, two color components may be packed into a 20 bit pixel packet. -
FIG. 7 illustrates anexemplary ALU stage 340 that includes more than oneALU 350 arranged as a pipeline in which two or more ALU 350s are chained together. As previously described, anindividual ALU 350 may be programmed to read one or more operands from a pixel packet, generate a result of an arithmetic operation, and update either a pixel packet or a temporary register with the result. Each ALU may be assigned to read operands, generate arithmetic results, and update one or more pixel packets or temporary values before passing on a row of pixel packets to the next ALU. - The flow of data between
ALUs 350 inALU stage 340 may be configured in a variety of ways depending upon the processing operations to be performed, ALU latency, and efficiency considerations. As previously described, the present invention permits each ALU to be programmed to read selected operands within a row of pixel packets and update a selected pixel packet register with a result. In one embodiment,ALU stage 340 includes at least oneALU 350 for each color channel (e.g., red, green, blue, and alpha). This permits, for example, load balancing in which the ALUs are configured to operate in parallel upon a row of pixel packets 510 (though at different points in time due to pipelining) to perform similar or different processing tasks. As one example of howALUs 350 may be programmed, a first ALU 350-0 may be programmed to perform calculations for a first color component, a second ALU 350-1 may be programmed to perform operations for a second color component, a third ALU 350-2 may be programmed to perform operations for a third color component, and a fourth ALU 350-3 may be programmed to perform a fog operation. Thus, in some embodiments eachALU 350 may be assigned different processing tasks for a row ofpixel packets 510. Additionally, as described below in more detail, in some embodiments software may configure theALUs 350 to select a data flow ofALUs 350 withinALU stage 340, including an execution order of theALUs 350. However, since the data flow may be configured, it will be understood that in some embodiments the data flow along a chain of ALUs may be arranged so that the results of one ALU 350-0 update one or more pixel packet registers which are read as operands by a subsequent ALU 350-1. -
FIG. 8 is a block diagram of an embodiment of a portion of aprogrammable graphics processor 205 having a reconfigurable pipeline in which the process flow of pixel packets through the stages is configurable in response to software commands, such as software commands from graphicsprocessor management application 280.Distributors stage stage 830, data writestage 855, and individual ALU's 850, although it will be understood that other types may also be reconfigured usingdistributors - In one embodiment, a data fetch
stage 830, data writestage 855, and individual ALU's 850 have respective inputs each connected tofirst distributor 890 and respective outputs each connected tosecond distributor 895. Eachdistributor stage 830,ALUs 850, and data writestage 855. Thedistributors incoming pixel packets 810 through data fetchstage 830, data writestage 855, andindividual ALUs 850.Signal inputs 892 and 894permit distributors stage 830, data writestage 855, andALUs 850. One example of a reconfiguration is assigning an execution order of theALUs 850. Another example of a reconfiguration is bypassing data fetchstage 830 if it is determined that the data fetch stage is not required for a certain time processing task. As still another example of reconfiguration, it may be desirable to change the order in which data fetchstage 830 is coupled to ALUs. As another example, it may be desirable to reorder the data writestage 855. As an illustrative example, there may be instances where it is more efficient to operate on a texture coordinate prior to a data fetch, in which case the data flow is arranged to have data fetchstage 830 receive pixel packets after theALU 850 performing the texture operation. Thus, one benefit of a reconfigurable pipeline is that a software application can reconfigure theprogrammable graphics processor 205 to increase efficiency. - Referring again to
FIG. 5 , as previously discussedraster stage 310 generatesrows 510 of pixel packets for processing. Therows 510 may be further arranged into agroup 520 of rows, such as a sequence of fourrows 510, that are passed on for processing in successive clock cycles. However, some operations that can be performed on arow 510 of pixel packets may require the result of an arithmetic operation of another row of pixel packets. Consequently, in oneembodiment raster stage 310 arranges pixel packets in agroup 520 of rows to account for data dependencies. As an illustrative example, if a texture operation on one pixel packet requires the result of another pixel packet in one row, thegroup 520 is arranged so that the pixel packet having the dependent texture operation is placed in a later row. - Referring to
FIG. 9 , in one embodiment, pixels are alternately assigned byraster stage 310 as either odd or even. Corresponding registers (R0, R1, R2, and R3) for each row of a pixel are correspondingly assigned as even or odd. Evenrows 905 of pixel packets for even pixels andodd rows 910 for odd pixels are then interleaved utilizing one or more rules to avoid data dependencies. Interleaving every other row provides an additional clock cycle to account for ALU latency. Thus, ifRow 0 for the even pixel requires two clock cycles to generate a resultant required byRow 1 of the even pixel, the interleaving ofRow 0 for the odd pixel provides the additional clock cycle of time required by the ALU latency. As an illustrative example consider a multitexture operation whereRow 0 for the even pixel is a blending operation andRow 1 for the same pixel corresponds to a blend with second texture requiring the result of the first blending operation. If the ALU latency for the first operation is two clock cycles, then interleaving permits the results of the blending operation to be available for the texture with blend operation. - In an interleaved embodiment, sideband information is preferably included to coordinate the interleaved data flow. For example, in one embodiment sideband information in each pixel packet includes an even/odd field to distinguish even and odd rows. Each
ALU 350 may also include two sets of temporary registers corresponding to temporary registers for even pixels and odd pixels to provide an appropriate temporary value for even/odd pixel packets. The even/odd field is use to select the appropriate set of temporary registers, e.g., even temporary registers are selected for odd pixels whereas an odd set of temporary registers are selected for even pixels. In one embodiment, constant registers are shared by both even and odd pixels to reduce the total amount of storage needs for constant values used for both even and odd pixels. In one embodiment, the software host may set the temporary registers at a constant value for an extended period of time to emulate constant registers. While an interleaving of two pixels is one implementation, it will be understood that the interleaving may be further extended to interleave more than two pixels if, for example, ALU latency corresponds to more than two clock cycles. One benefit of havingraster stage 310 interleave pixel packets is that ALU latency is taken into account by hardware, reducing the burden on software to account for ALU latency that would otherwise occur if for example,raster stage 310 did not interleave pixels. - As previously discussed, in a configurable pipeline, the data flow within the ALU 350s may be configured. For example, in hardware, each
ALU 350 may be substantially identical. However, a particular ALU may be configured to have a more than one place in the data flow, e.g., a different execution order. Consequently, an identifier needs to be provided in eachALU 350 to indicate its place within the data flow. The identifier may, for example, be provided to eachALU 350 by a direct register write technique of eachALU 350. However, this approach has the disadvantage of requiring significant software overhead. Consequently, in one embodiment a packet technique is utilized to trigger elements requiring configuration information to discover their relative location within the process flow and write a corresponding identifier in a local register. - Referring to
FIG. 10 , in one embodiment the register address space of theALUs 350 is software configurable using a packet initialization technique to communicate an identification (ID) to eachALU 350 using data packets. EachALU 350 may, for example, include conventional network modules for receiving and forwarding data packets. In one embodiment, anID packet 1010 is initiated by a software application. TheID packet 1010 contains an initial ID code, such as a number. TheID packet 1010 is injected in the graphics pipeline at a point before elements requiring an ID code and then is passed on to subsequent elements of the process flow defined by the current pipeline configuration. In one embodiment, aconfiguration register 1020 in afirst ALU 350 receives the ID packet, writes the current value of the ID code into the configuration register and then increments the ID code of the ID packet before passing the ID packet onto the next ALU. This process is continued, with eachsubsequent ALU 350 writing the current value of the ID code into its configuration register, and then passing on the ID packet with an incremented ID code to the next ALU. It will be understood that other stages along the data flow path may also have configuration registers set in a similar manner. For example, the elements in a configuration flow may also include a data fetch stage or a data write stage that also have configuration registers set by reading an ID packet and which increment the ID code before passing the ID packet with the increment ID to the next element in the configuration flow. One benefit of this form of register configuration is that it requires no hardware differences betweenALU 350 units, permitting software reconfiguration of the data flow through the pipeline. Thus for example, in one embodiment graphicsprocessor management application 280 needs only generate aninitial ID packet 1010, such as by issuing a command to generate anID packet 1010 viahost interface 220 that is received by an ID packet generator 1030. - In an alternate embodiment, ID codes are written into the configuration registers using a broadcast packet technique to trigger elements requiring configuration registers to be written to discover their ID. In this embodiment, the elements (e.g., ALUs 350) may use a network protocol to discover their ID. A broadcast packet technique is useful, for example, in embodiments in which a pipeline is branched to permit branches of the pipeline to process pixels in parallel.
-
FIG. 11 illustrates an embodiment that includes a diagnostic monitoring capability. In one embodiment, there is a sequence of taps along elements ofgraphics processor 205, such as taps associates with eachALU 350 and data fetchstage 330. Taps may also be included at other stages as well. A configurabletest point selector 1105 is adapted to permit selected taps, such as twotaps processor management application 280. Configurabletest point selector 1105 may, for example, be implemented using multiplexers. In one embodiment, at least onecounter 1110 is included for statistics collection of each selected test point. In one embodiment, an instrumentation packet generated by software provides information on the taps to be monitored and enables counting for the selected test points. Additionally, an instrument register may be included to gate statistics collection on and off based on the operation mode of the pipeline (e.g., an instrument register may be provide to permit software to enable counting for specific types of graphics operations, such as enabling statistical counting when alphablending operations occur). One benefit of configurabletest point selector 1105 is that it permits software, such as graphicsprocessor management application 280, to have statistical data collected for only test points of interest, reducing the hardware complexity and cost while still allowing software to analyze any portion of the behavior ofprogrammable processor 205. The test points of interest may, for example be selected to collect statistics associated with thoseALUs 350 processing specific kinds of data, such asALUs 350 processing texture data. Additionally, the statistics collection may be enabled for specific graphics operations, such as alphablending. - In one embodiment, configurable
test point selector 1105 utilizes a three-wire protocol. Each element, such as an ALU 350-0, that has valid payload data generates a valid signal, which may, for example flow down to the next element (e.g., ALU 350-1). An element that is ready to receive a payload generates a ready signal, which may, for example, flow up to the previous element. However, if an element is not ready to receive a payload, the element generates a not ready signal, which may, for example correspond to not asserting the ready signal. An enable signal corresponds to an element being enabled for monitoring, such as by software control via a pipelined register write to a monitoring enable control bit stored adjacent to the point being monitored. The signal may be tapped off directly from an element generating the signal or from elements receiving these signals. - The valid, ready, and not-ready signals at selected tap points can be used to determine an operating state. A transfer state corresponds to a clock tick having a valid payload (i.e. the valid bit set) for data flowing downstream and a ready signal from a downstream block in the downstream block to receive the data (e.g., at
tap point 1120, a valid signal from ALU-0 and a ready signal from ALU-1 at tap point 1130). A wait state corresponds to a clock tick with a valid payload that is blocked because the block below is not ready to receive data (e.g., attap point 1120, a valid signal from ALU-0 and a not ready signal from ALU-1 at tap point 1130). In this embodiment, statistics on selected tap points may be collected, such as counting the number of clock cycles that a transfer state and a wait state are detected. - Embodiments of the present invention provide a variety of benefits that are useful in an embedded
graphics processor core 250. In a system that is a compact, lowpower handheld system 290, power, space, and CPU capabilities may be comparatively limited. In one embodiment, ALU's 350 are clock gated when processing is not required (e.g., by detecting a kill bit), reducing processing power requirements. Additionally, theraster stage 310 needs only generate pixel packets for the subset of pixel data that is processed on, also reducing power requirements. Theprogrammable ALU stage 340 requires a smaller chip area than a conventional pipeline with dedicated stages for performing dedicated graphics function reducing cost. Theprogrammable processor 205 may be implemented as blocks that are configurable by software, providing improved efficiency. Test monitoring may be configured to test a subset of test points, reducing bandwidth and analysis requirements by software. These and other previously described features make theprogrammable graphics processor 205 of interest for use in an embeddedgraphics processor core 250. - The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that specific details are not required in order to practice the invention. Thus, the foregoing descriptions of specific embodiments of the invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously, many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, they thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the following claims and their equivalents define the scope of the invention.
Claims (15)
1. A method of configuring a plurality of Arithmetic Logic Units (ALUs) within a programmable ALU stage of a configurable graphics pipeline having more than one possible process flow of pixel packets through the ALUs of said programmable ALU stage, the method comprising:
initiating the generation of an identification packet within said programmable ALU stage,
the identification packet flowing between elements of the programmable ALU stage and triggering said ALUs of said graphics pipeline to discover an identifier for each ALU indicative of the location of the ALU within a process flow; and
wherein the generation of the initial identification packet triggers the ALUs to discover their relative location with the process flow without requiring a direct register write of each ALU.
2. The method of claim 1 , wherein said identification packet triggering said ALUs is generated in response to a software command from a software entity.
3. The method of claim 1 , wherein each ALU in response to the identification packet writes an identifier in a configuration register indicative of an execution order within said process flow.
4. The method of claim 1 , wherein each ALU writes a current value of an identifier of said identification packet into a configuration register of the ALU, increments said identifier, and forwards said identification packet to the next ALU.
5. The method of claim 4 , wherein a software entity initiates the identification packet to trigger the ALUs to discover their relative location with the process flow without requiring a direct register write of each ALU.
6. The method of claim 1 , wherein a software entity initiates the identification packet to trigger the ALUs to discover their relative location with the process flow without requiring a direct register write of each ALU.
7. A graphics processor, comprising:
a programmable Arithmetic Logic Unit (ALU) stage for processing pixel packets, said ALU stage including a plurality of ALUs with each ALU including a configuration register to store an identifier indicative of an execution order of the ALU within a process flow, each ALU having a set of at least one possible arithmetic operation that is performed on an incoming pixel packet having a corresponding current instruction command;
a data fetch stage to fetch data for said pixel packets;
a data write stage to perform a memory write of pixel data of processed pixel packets received from said ALU stage;
a first distributor coupled to respective inputs of said ALU stage, said data fetch stage, and said data write stage; and
a second distributor coupled to respective outputs of said ALU stage, said data fetch stage, and said data write stage;
said first distributor and said second distributor adapted for reconfiguring a process flow of pixel packets through said data fetch stage, said ALU stage, and said ALU write stage in response to a command from a host;
wherein each ALU of said ALU stage is configured to receive an identification packet initiated by a software entity, each ALU writing a current value of an identifier of said identification packet into the configuration register of the ALU, incrementing said identifier, and forwarding said identification packet to the next ALU; wherein the software entity initiates the identification packet to trigger the ALUs to discover their relative location with the process flow without requiring a direct register write of each ALU.
8. The graphics processor of claim 7 , wherein said ALU stage is configured to execute a shader program.
9. The graphics processor of claim 7 wherein the execution order determines a sequence in which a plurality of ALUs are programmed to read operands, generate arithmetic results, and update one or more pixel packets or temporary values before passing on a row of pixel packets to the next ALU.
10. The graphics processor of claim 7 , wherein the execution order determines the processing task performed by each ALU.
11. A method of configuring a plurality of Arithmetic Logic Units (ALUs) within a programmable ALU stage of a configurable graphics pipeline having more than one possible process flow of pixel packets through the ALUs of said programmable ALU stage, the method comprising:
a software entity initiating the injection of an identification packet into said programmable ALU stage of said configurable graphics pipeline;
each successive ALU within the programmable ALU stage reading a current value of an identifier in said identification packet, writing said current value to a configuration register within the ALU, incrementing said identifier, and forwarding said identification packet with an incremented identifier to the next ALU of said programmable ALU stage; and
the identifier with each ALU being indicative of an execution order within a process flow with the software entity initiating the identification packet to trigger the ALUs to discover their relative location with the process flow without requiring a direct register write of each ALU.
12. The method of claim 11 , wherein said identification packet is initiated in response to a software command from the software entity.
13. The graphics processor of claim 11 , wherein said ALU stage is configured to execute a shader program.
14. The graphics processor of claim 11 , wherein the execution order determines a sequence in which a plurality of ALUs are programmed to read operands, generate arithmetic results, and update one or more pixel packets or temporary values before passing on a row of pixel packets to the next ALU.
15. The graphics processor of claim 11 , wherein the execution order determines the processing task performed by each ALU.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/115,789 US20080204461A1 (en) | 2004-05-14 | 2008-05-06 | Auto Software Configurable Register Address Space For Low Power Programmable Processor |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/846,106 US7389006B2 (en) | 2004-05-14 | 2004-05-14 | Auto software configurable register address space for low power programmable processor |
US12/115,789 US20080204461A1 (en) | 2004-05-14 | 2008-05-06 | Auto Software Configurable Register Address Space For Low Power Programmable Processor |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/846,106 Continuation US7389006B2 (en) | 2004-05-14 | 2004-05-14 | Auto software configurable register address space for low power programmable processor |
Publications (1)
Publication Number | Publication Date |
---|---|
US20080204461A1 true US20080204461A1 (en) | 2008-08-28 |
Family
ID=35308978
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/846,106 Active 2026-03-13 US7389006B2 (en) | 2004-05-14 | 2004-05-14 | Auto software configurable register address space for low power programmable processor |
US12/115,789 Abandoned US20080204461A1 (en) | 2004-05-14 | 2008-05-06 | Auto Software Configurable Register Address Space For Low Power Programmable Processor |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/846,106 Active 2026-03-13 US7389006B2 (en) | 2004-05-14 | 2004-05-14 | Auto software configurable register address space for low power programmable processor |
Country Status (1)
Country | Link |
---|---|
US (2) | US7389006B2 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070189397A1 (en) * | 2006-02-15 | 2007-08-16 | Samsung Electronics Co., Ltd. | Method and system for bit reorganization and packetization of uncompressed video for transmission over wireless communication channels |
US20070230461A1 (en) * | 2006-03-29 | 2007-10-04 | Samsung Electronics Co., Ltd. | Method and system for video data packetization for transmission over wireless channels |
US20090265744A1 (en) * | 2008-04-22 | 2009-10-22 | Samsung Electronics Co., Ltd. | System and method for wireless communication of video data having partial data compression |
US20120001927A1 (en) * | 2010-07-01 | 2012-01-05 | Advanced Micro Devices, Inc. | Integrated graphics processor data copy elimination method and apparatus when using system memory |
US8175041B2 (en) | 2006-12-14 | 2012-05-08 | Samsung Electronics Co., Ltd. | System and method for wireless communication of audiovisual data having data size adaptation |
US20130141455A1 (en) * | 2011-12-02 | 2013-06-06 | Novatek Microelectronics Corp. | Image dithering module |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7091982B2 (en) * | 2004-05-14 | 2006-08-15 | Nvidia Corporation | Low power programmable processor |
JPWO2006109623A1 (en) * | 2005-04-05 | 2008-11-06 | 松下電器産業株式会社 | Computer system, data structure representing configuration information, and mapping apparatus and method |
US9804995B2 (en) | 2011-01-14 | 2017-10-31 | Qualcomm Incorporated | Computational resource pipelining in general purpose graphics processing unit |
US9978343B2 (en) * | 2016-06-10 | 2018-05-22 | Apple Inc. | Performance-based graphics processing unit power management |
KR20180038793A (en) * | 2016-10-07 | 2018-04-17 | 삼성전자주식회사 | Method and apparatus for processing image data |
Citations (51)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4228497A (en) * | 1977-11-17 | 1980-10-14 | Burroughs Corporation | Template micromemory structure for a pipelined microprogrammable data processing system |
US4658354A (en) * | 1982-05-28 | 1987-04-14 | Nec Corporation | Pipeline processing apparatus having a test function |
US4816913A (en) * | 1987-11-16 | 1989-03-28 | Technology, Inc., 64 | Pixel interpolation circuitry as for a video signal processor |
US4845666A (en) * | 1985-07-01 | 1989-07-04 | Pixar | Computer system for processing binary numbering format and determining the sign of the numbers from their two most significant bits |
US4860248A (en) * | 1985-04-30 | 1989-08-22 | Ibm Corporation | Pixel slice processor with frame buffers grouped according to pixel bit width |
US5230039A (en) * | 1991-02-19 | 1993-07-20 | Silicon Graphics, Inc. | Texture range controls for improved texture mapping |
US5301272A (en) * | 1992-11-25 | 1994-04-05 | Intel Corporation | Method and apparatus for address space aliasing to identify pixel types |
US5526502A (en) * | 1992-03-30 | 1996-06-11 | Sharp Kabushiki Kaisha | Memory interface |
US5669010A (en) * | 1992-05-18 | 1997-09-16 | Silicon Engines | Cascaded two-stage computational SIMD engine having multi-port memory and multiple arithmetic units |
US5710577A (en) * | 1994-10-07 | 1998-01-20 | Lasermaster Corporation | Pixel description packet for a rendering device |
US5778250A (en) * | 1994-05-23 | 1998-07-07 | Cirrus Logic, Inc. | Method and apparatus for dynamically adjusting the number of stages of a multiple stage pipeline |
US5793386A (en) * | 1996-06-28 | 1998-08-11 | S3 Incorporated | Register set reordering for a graphics processor based upon the type of primitive to be rendered |
US5828402A (en) * | 1996-06-19 | 1998-10-27 | Canadian V-Chip Design Inc. | Method and apparatus for selectively blocking audio and video signals |
US5872991A (en) * | 1995-10-18 | 1999-02-16 | Sharp, Kabushiki, Kaisha | Data driven information processor for processing data packet including common identification information and plurality of pieces of data |
US5881077A (en) * | 1995-02-23 | 1999-03-09 | Sony Corporation | Data processing system |
US5929860A (en) * | 1996-01-11 | 1999-07-27 | Microsoft Corporation | Mesh simplification and construction of progressive meshes |
US5982457A (en) * | 1997-01-07 | 1999-11-09 | Samsung Electronics, Co. Ltd. | Radio receiver detecting digital and analog television radio-frequency signals with single first detector |
US6025854A (en) * | 1997-12-31 | 2000-02-15 | Cognex Corporation | Method and apparatus for high speed image acquisition |
US6028613A (en) * | 1997-03-20 | 2000-02-22 | S3 Incorporated | Method and apparatus for programming a graphics subsystem register set |
US6072500A (en) * | 1993-07-09 | 2000-06-06 | Silicon Graphics, Inc. | Antialiased imaging with improved pixel supersampling |
US6072508A (en) * | 1997-03-14 | 2000-06-06 | S3 Incorporated | Method and apparatus for shortening display list instructions |
US6091857A (en) * | 1991-04-17 | 2000-07-18 | Shaw; Venson M. | System for producing a quantized signal |
US6122726A (en) * | 1992-06-30 | 2000-09-19 | Discovision Associates | Data pipeline system and data encoding method |
US6124854A (en) * | 1995-05-08 | 2000-09-26 | The Box Worldwide Llc | Interactive video system |
US6157751A (en) * | 1997-12-30 | 2000-12-05 | Cognex Corporation | Method and apparatus for interleaving a parallel image processing memory |
US6259461B1 (en) * | 1998-10-14 | 2001-07-10 | Hewlett Packard Company | System and method for accelerating the rendering of graphics in a multi-pass rendering environment |
US6333744B1 (en) * | 1999-03-22 | 2001-12-25 | Nvidia Corporation | Graphics pipeline including combiner stages |
US20020107903A1 (en) * | 2000-11-07 | 2002-08-08 | Richter Roger K. | Methods and systems for the order serialization of information in a network processing environment |
US6446155B1 (en) * | 1999-06-30 | 2002-09-03 | Logitech Europe S. A. | Resource bus interface |
US6516032B1 (en) * | 1999-03-08 | 2003-02-04 | Compaq Computer Corporation | First-order difference compression for interleaved image data in a high-speed image compositor |
US6532515B1 (en) * | 2000-08-02 | 2003-03-11 | Ati International Srl | Method and apparatus for performing selective data reads from a memory |
US6597363B1 (en) * | 1998-08-20 | 2003-07-22 | Apple Computer, Inc. | Graphics processor with deferred shading |
US20040012600A1 (en) * | 2002-03-22 | 2004-01-22 | Deering Michael F. | Scalable high performance 3d graphics |
US6725463B1 (en) * | 1997-08-01 | 2004-04-20 | Microtune (Texas), L.P. | Dual mode tuner for co-existing digital and analog television signals |
US6731296B2 (en) * | 1999-05-07 | 2004-05-04 | Broadcom Corporation | Method and system for providing programmable texture processing |
US6734861B1 (en) * | 2000-05-31 | 2004-05-11 | Nvidia Corporation | System, method and article of manufacture for an interlock module in a computer graphics processing pipeline |
US6747660B1 (en) * | 2000-05-12 | 2004-06-08 | Microsoft Corporation | Method and system for accelerating noise |
US6751690B1 (en) * | 1999-08-17 | 2004-06-15 | Eric Swanson | Data converter with statistical domain output |
US6762763B1 (en) * | 1999-07-01 | 2004-07-13 | Microsoft Corporation | Computer system having a distributed texture memory architecture |
US6891544B2 (en) * | 2000-02-11 | 2005-05-10 | Sony Computer Entertainment Inc. | Game system with graphics processor |
US6972803B2 (en) * | 2003-09-10 | 2005-12-06 | Gennum Corporation | Video signal format detector and generator system and method |
US6980209B1 (en) * | 2002-06-14 | 2005-12-27 | Nvidia Corporation | Method and system for scalable, dataflow-based, programmable processing of graphics data |
US20060001779A1 (en) * | 2004-07-01 | 2006-01-05 | Pierre Favrat | Television receiver for digital and analog television signals |
US20060050077A1 (en) * | 2004-09-09 | 2006-03-09 | International Business Machines Corporation | Programmable graphics processing engine |
US20060075433A1 (en) * | 2002-12-04 | 2006-04-06 | Koninklijke Philips Electronics, N.V. | System and method for detecting services which can be provided by at least two different services sources |
US20060152519A1 (en) * | 2004-05-14 | 2006-07-13 | Nvidia Corporation | Method for operating low power programmable processor |
US7142214B2 (en) * | 2004-05-14 | 2006-11-28 | Nvidia Corporation | Data format for low power programmable processor |
US7199799B2 (en) * | 2004-05-14 | 2007-04-03 | Nvidia Corporation | Interleaving of pixels for low power programmable processor |
US7250953B2 (en) * | 2004-05-14 | 2007-07-31 | Nvidia Corporation | Statistics instrumentation for low power programmable processor |
US7268786B2 (en) * | 2004-05-14 | 2007-09-11 | Nvidia Corporation | Reconfigurable pipeline for low power programmable processor |
US7633506B1 (en) * | 2002-11-27 | 2009-12-15 | Ati Technologies Ulc | Parallel pipeline graphics system |
-
2004
- 2004-05-14 US US10/846,106 patent/US7389006B2/en active Active
-
2008
- 2008-05-06 US US12/115,789 patent/US20080204461A1/en not_active Abandoned
Patent Citations (52)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4228497A (en) * | 1977-11-17 | 1980-10-14 | Burroughs Corporation | Template micromemory structure for a pipelined microprogrammable data processing system |
US4658354A (en) * | 1982-05-28 | 1987-04-14 | Nec Corporation | Pipeline processing apparatus having a test function |
US4860248A (en) * | 1985-04-30 | 1989-08-22 | Ibm Corporation | Pixel slice processor with frame buffers grouped according to pixel bit width |
US4845666A (en) * | 1985-07-01 | 1989-07-04 | Pixar | Computer system for processing binary numbering format and determining the sign of the numbers from their two most significant bits |
US4816913A (en) * | 1987-11-16 | 1989-03-28 | Technology, Inc., 64 | Pixel interpolation circuitry as for a video signal processor |
US5230039A (en) * | 1991-02-19 | 1993-07-20 | Silicon Graphics, Inc. | Texture range controls for improved texture mapping |
US6091857A (en) * | 1991-04-17 | 2000-07-18 | Shaw; Venson M. | System for producing a quantized signal |
US5526502A (en) * | 1992-03-30 | 1996-06-11 | Sharp Kabushiki Kaisha | Memory interface |
US5669010A (en) * | 1992-05-18 | 1997-09-16 | Silicon Engines | Cascaded two-stage computational SIMD engine having multi-port memory and multiple arithmetic units |
US6122726A (en) * | 1992-06-30 | 2000-09-19 | Discovision Associates | Data pipeline system and data encoding method |
US5301272A (en) * | 1992-11-25 | 1994-04-05 | Intel Corporation | Method and apparatus for address space aliasing to identify pixel types |
US6072500A (en) * | 1993-07-09 | 2000-06-06 | Silicon Graphics, Inc. | Antialiased imaging with improved pixel supersampling |
US5778250A (en) * | 1994-05-23 | 1998-07-07 | Cirrus Logic, Inc. | Method and apparatus for dynamically adjusting the number of stages of a multiple stage pipeline |
US5710577A (en) * | 1994-10-07 | 1998-01-20 | Lasermaster Corporation | Pixel description packet for a rendering device |
US5881077A (en) * | 1995-02-23 | 1999-03-09 | Sony Corporation | Data processing system |
US6124854A (en) * | 1995-05-08 | 2000-09-26 | The Box Worldwide Llc | Interactive video system |
US5872991A (en) * | 1995-10-18 | 1999-02-16 | Sharp, Kabushiki, Kaisha | Data driven information processor for processing data packet including common identification information and plurality of pieces of data |
US5929860A (en) * | 1996-01-11 | 1999-07-27 | Microsoft Corporation | Mesh simplification and construction of progressive meshes |
US5828402A (en) * | 1996-06-19 | 1998-10-27 | Canadian V-Chip Design Inc. | Method and apparatus for selectively blocking audio and video signals |
US5793386A (en) * | 1996-06-28 | 1998-08-11 | S3 Incorporated | Register set reordering for a graphics processor based upon the type of primitive to be rendered |
US5982457A (en) * | 1997-01-07 | 1999-11-09 | Samsung Electronics, Co. Ltd. | Radio receiver detecting digital and analog television radio-frequency signals with single first detector |
US6072508A (en) * | 1997-03-14 | 2000-06-06 | S3 Incorporated | Method and apparatus for shortening display list instructions |
US6028613A (en) * | 1997-03-20 | 2000-02-22 | S3 Incorporated | Method and apparatus for programming a graphics subsystem register set |
US6725463B1 (en) * | 1997-08-01 | 2004-04-20 | Microtune (Texas), L.P. | Dual mode tuner for co-existing digital and analog television signals |
US6157751A (en) * | 1997-12-30 | 2000-12-05 | Cognex Corporation | Method and apparatus for interleaving a parallel image processing memory |
US6025854A (en) * | 1997-12-31 | 2000-02-15 | Cognex Corporation | Method and apparatus for high speed image acquisition |
US6597363B1 (en) * | 1998-08-20 | 2003-07-22 | Apple Computer, Inc. | Graphics processor with deferred shading |
US6259461B1 (en) * | 1998-10-14 | 2001-07-10 | Hewlett Packard Company | System and method for accelerating the rendering of graphics in a multi-pass rendering environment |
US6516032B1 (en) * | 1999-03-08 | 2003-02-04 | Compaq Computer Corporation | First-order difference compression for interleaved image data in a high-speed image compositor |
US6333744B1 (en) * | 1999-03-22 | 2001-12-25 | Nvidia Corporation | Graphics pipeline including combiner stages |
US6731296B2 (en) * | 1999-05-07 | 2004-05-04 | Broadcom Corporation | Method and system for providing programmable texture processing |
US6446155B1 (en) * | 1999-06-30 | 2002-09-03 | Logitech Europe S. A. | Resource bus interface |
US6762763B1 (en) * | 1999-07-01 | 2004-07-13 | Microsoft Corporation | Computer system having a distributed texture memory architecture |
US6751690B1 (en) * | 1999-08-17 | 2004-06-15 | Eric Swanson | Data converter with statistical domain output |
US6891544B2 (en) * | 2000-02-11 | 2005-05-10 | Sony Computer Entertainment Inc. | Game system with graphics processor |
US6747660B1 (en) * | 2000-05-12 | 2004-06-08 | Microsoft Corporation | Method and system for accelerating noise |
US6734861B1 (en) * | 2000-05-31 | 2004-05-11 | Nvidia Corporation | System, method and article of manufacture for an interlock module in a computer graphics processing pipeline |
US6532515B1 (en) * | 2000-08-02 | 2003-03-11 | Ati International Srl | Method and apparatus for performing selective data reads from a memory |
US20020107903A1 (en) * | 2000-11-07 | 2002-08-08 | Richter Roger K. | Methods and systems for the order serialization of information in a network processing environment |
US20040012600A1 (en) * | 2002-03-22 | 2004-01-22 | Deering Michael F. | Scalable high performance 3d graphics |
US6980209B1 (en) * | 2002-06-14 | 2005-12-27 | Nvidia Corporation | Method and system for scalable, dataflow-based, programmable processing of graphics data |
US7633506B1 (en) * | 2002-11-27 | 2009-12-15 | Ati Technologies Ulc | Parallel pipeline graphics system |
US20060075433A1 (en) * | 2002-12-04 | 2006-04-06 | Koninklijke Philips Electronics, N.V. | System and method for detecting services which can be provided by at least two different services sources |
US6972803B2 (en) * | 2003-09-10 | 2005-12-06 | Gennum Corporation | Video signal format detector and generator system and method |
US20060152519A1 (en) * | 2004-05-14 | 2006-07-13 | Nvidia Corporation | Method for operating low power programmable processor |
US7091982B2 (en) * | 2004-05-14 | 2006-08-15 | Nvidia Corporation | Low power programmable processor |
US7142214B2 (en) * | 2004-05-14 | 2006-11-28 | Nvidia Corporation | Data format for low power programmable processor |
US7199799B2 (en) * | 2004-05-14 | 2007-04-03 | Nvidia Corporation | Interleaving of pixels for low power programmable processor |
US7250953B2 (en) * | 2004-05-14 | 2007-07-31 | Nvidia Corporation | Statistics instrumentation for low power programmable processor |
US7268786B2 (en) * | 2004-05-14 | 2007-09-11 | Nvidia Corporation | Reconfigurable pipeline for low power programmable processor |
US20060001779A1 (en) * | 2004-07-01 | 2006-01-05 | Pierre Favrat | Television receiver for digital and analog television signals |
US20060050077A1 (en) * | 2004-09-09 | 2006-03-09 | International Business Machines Corporation | Programmable graphics processing engine |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070189397A1 (en) * | 2006-02-15 | 2007-08-16 | Samsung Electronics Co., Ltd. | Method and system for bit reorganization and packetization of uncompressed video for transmission over wireless communication channels |
US8665967B2 (en) | 2006-02-15 | 2014-03-04 | Samsung Electronics Co., Ltd. | Method and system for bit reorganization and packetization of uncompressed video for transmission over wireless communication channels |
US20070230461A1 (en) * | 2006-03-29 | 2007-10-04 | Samsung Electronics Co., Ltd. | Method and system for video data packetization for transmission over wireless channels |
US8175041B2 (en) | 2006-12-14 | 2012-05-08 | Samsung Electronics Co., Ltd. | System and method for wireless communication of audiovisual data having data size adaptation |
US20090265744A1 (en) * | 2008-04-22 | 2009-10-22 | Samsung Electronics Co., Ltd. | System and method for wireless communication of video data having partial data compression |
US8176524B2 (en) | 2008-04-22 | 2012-05-08 | Samsung Electronics Co., Ltd. | System and method for wireless communication of video data having partial data compression |
US20120001927A1 (en) * | 2010-07-01 | 2012-01-05 | Advanced Micro Devices, Inc. | Integrated graphics processor data copy elimination method and apparatus when using system memory |
US8760452B2 (en) * | 2010-07-01 | 2014-06-24 | Advanced Micro Devices, Inc. | Integrated graphics processor data copy elimination method and apparatus when using system memory |
US20130141455A1 (en) * | 2011-12-02 | 2013-06-06 | Novatek Microelectronics Corp. | Image dithering module |
US9041728B2 (en) * | 2011-12-02 | 2015-05-26 | Novatek Microelectronics Corp. | Image dithering module |
Also Published As
Publication number | Publication date |
---|---|
US20050253856A1 (en) | 2005-11-17 |
US7389006B2 (en) | 2008-06-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7091982B2 (en) | Low power programmable processor | |
EP1759380B1 (en) | Low power programmable processor | |
US20080204461A1 (en) | Auto Software Configurable Register Address Space For Low Power Programmable Processor | |
US8248422B2 (en) | Efficient texture processing of pixel groups with SIMD execution unit | |
CN109426519B (en) | Inline data inspection for workload reduction | |
US6807620B1 (en) | Game system with graphics processor | |
KR101012625B1 (en) | Graphics processor with arithmetic and elementary function units | |
US6947047B1 (en) | Method and system for programmable pipelined graphics processing with branching instructions | |
US7941644B2 (en) | Simultaneous multi-thread instructions issue to execution units while substitute injecting sequence of instructions for long latency sequencer instruction via multiplexer | |
US8775777B2 (en) | Techniques for sourcing immediate values from a VLIW | |
US7710427B1 (en) | Arithmetic logic unit and method for processing data in a graphics pipeline | |
US8786618B2 (en) | Shader program headers | |
US7199799B2 (en) | Interleaving of pixels for low power programmable processor | |
US20190286430A1 (en) | Apparatus and method for efficiently accessing memory when performing a horizontal data reduction | |
Kim et al. | Homogeneous stream processors with embedded special function units for high-utilization programmable shaders | |
US7268786B2 (en) | Reconfigurable pipeline for low power programmable processor | |
US7250953B2 (en) | Statistics instrumentation for low power programmable processor | |
US7142214B2 (en) | Data format for low power programmable processor | |
US20180218474A1 (en) | Method and apparatus for adaptive pixel hashing for graphics processors | |
US8599208B2 (en) | Shared readable and writeable global values in a graphics processor unit pipeline | |
US8427490B1 (en) | Validating a graphics pipeline using pre-determined schedules |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NVIDIA CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HUTCHINS, EDWARD A.;ANGELL, BRIAN K.;REEL/FRAME:020905/0684;SIGNING DATES FROM 20040513 TO 20040514 Owner name: NVIDIA CORPORATION,CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HUTCHINS, EDWARD A.;ANGELL, BRIAN K.;SIGNING DATES FROM 20040513 TO 20040514;REEL/FRAME:020905/0684 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |