GB2493425A

GB2493425A - Constructing an acceleration structure

Info

Publication number: GB2493425A
Application number: GB1212642.1A
Authority: GB
Inventors: Kirill Vladimirovich Garanzha; Jacopo Pantaleoni; David Kirk Mcallister
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2011-08-04
Filing date: 2012-07-16
Publication date: 2013-02-06
Also published as: CN103106681A; GB201212642D0; JP2013037691A; DE102012213292A8; KR20130016120A; DE102012213292A1; US20130033507A1

Abstract

A system, method, and computer program product are provided for constructing an acceleration structure, in use, a plurality of primitives associated with a scene is identified and acceleration structure is constructed, utilizing the primitives. The structure may include a hierarchical linearized bounding volume hierarchy, a plurality of nodes including child nodes representing bounding boxes located within a parent node and leaf nodes representing one or more primitives residing within respective parent bounding boxes. The construction may include sorting the primitives along a space filling curve that spans a bounding box of the scene and is determined by calculating a Morton code of a centroid of each primitive or by using a least significant radix algorithm. The primitives may be clustered using a run-length compression algorithm and the primitives may be partitioned within each cluster and the construction may be performed entirely in a GPU.

Description

I

SYSTEM, METHOIX AND COMPUTER PROGRAM PRODUCT FOR CONSTRUCTING AN ACCELER4TION

STRUCTU RE

FIELD OF THE INVENTION

[0001] The present invention relates to rendering images. and more particularly to performing ray tracing.

BACKGROUND

[0002] Traditionally, ray tracing has been used to generate images within a displayed scene. For example. inrenections between a plurality of rays and a plurality of primitives of the displayed scene maybe determined in order to render images associated with the primitives. However, current techniques for performing ray tracing have been associated with various lindtadcins, [0003] For example, current methods for pertbrming ray tracing may inefficiently construct acceleration strucwres used in association with the ray tracing. This may result in time-intensive construction of acceleration structures that are associated with large amounts of primitives.

(0004] There is thus a need for addressing these andior other issues associated with

the prior art.

SUMMAx [0005] A system, methon, and computer program product am ncovicled for construcing an acc&eration structure. in use, a plurality of primitives associated with a scene is identified, Additionally, an acceleration structure is constructed, utilizing the orimitives BRIEF DESCRWHON OF tUt DRAWINGS [0006] Figure 1 shows a method for constructing an acceleration structure, in accordance with one embodiment.

[0007] FIgure 2 shows a task queue system used in performing partitioning during the construction of an acceleration structure, in accordance with another embodiment.

[0008] Figure 3 shows a sorting of a group of primitives using Morton codes, in accordance with yet another embodiment.

10009] Figure 4 shows a plurality of middle-split queues corresponding to the sorting performed in Figure 3, in accordance with yet another embodiment.

[0010j FigureS shows a data flow visualization of a SAN binning procedure, in accordance with yet another embodiment.

[0011] Figure 6 illustrates an exemplary system in which the various architecture and/or functionality of the various previous embodiments may be implemented.

aIAQ&UQBfl (0012) Figure 1 shows a method 100 for constmcting an acceleration structure, in accordance with one embodiment. As shown in operation 102, a plurality of primitives associated with a scene is identified. in one embodiment, the scene may include a scene that is in the process of being rendered. For example. the scene may be in the process of being rendered using ray tracing. In another embodiment, the plurality of primitives may be included within the scene. For example, the scene may be composed of the plurality of the primitives. In yet another embodiment, the plurality of' primitives may include a plurality of triangles. Of course, however, the plurality of primitives may include any primitives used to perform ray tracing.

[0013] Additionally, as shown in operation 104, an acceleration structure is constructed, utilizing the primitives. In one embodiment, the acceleration structure may include a bounding volume hierarchy (Dvii). In another embodiment, the acceleration structure may include a linearized bounding volume hierarchy (LBVH). In yet another embodiment, the acceleration structure may include a hierarchical linearized bounding volume hierarchy (HLBVH).

[0014] In another embodiment, the acceleration strucwre may include a plurality of nodes. For example, the acceleration structure may include a hierarchy of nodes, where child nodes represent bounding boxes located within respective parent node bounding boxes, and where leaf nodes represent one or more primitives that reside within respective parent bounding boxes, In this way, the acceleration structure may include a bounding volume hierarchy which may organize the primitives into a plurality of hierarchical boxes to be used during ray tracing.

[0015] Further, in one embodiment, constructing the acceleration structure may include sorting the primitives. For example, the primitives may be sorted along a space-filling curve (e.g., a Morton curve, a 1-lilbert curve, etc.) that spans a bounding box of the scene. In another embodiment, the space-filling curve may be determined by calculating a Morton code of a centroid of each primitive in the scene (e.g., an average location in the middle of the primitive may be transformed from three dimensional (3D) coordinates into a one dimensional coordinate associated with a recursively designed Morton curve, etc.).

[00161 In another example, the sorting may be performed utilizing a least significant digit mdix sorting algorithm. In another embodiment, constructing the acceleration structure may include forming clusters of primitives (e.g., coarse cluster of primitives, etc.) within the scene, For example, die clusters may be formed utilizing a mn-length encoding compression algorithm.

[0017] Further still, in one embodiment, constructing the acceleration structure may include partitioning primitives within each formed cluster. For example, constructing the acceleration structure may include partitioning all primitives within each cluster using spatial middle splits (e.g. LBVH-style spatial middle splits, etc.). In another example, constructing the accelention structure may include creating a tree (e.g., a top-level tree, etc.), utilizing the clusters. For example, constructing the acceleration structure may include creating a top-level tree by partitioning the clusters (e.g., utilizing a binned surface area heuristic (SAil), a SAH-optimized tree construction algorithm, etc.). In another embodiment, the SAH may utilize a parallel binning scheme.

(0018] Also, in one embodiment, partitioning the primitives and the clusters may be performed utilizing one or more task queues. For example, a task queue system may be used to parallelize work during the construction of the acceleration structure (e.g.. by creating a pipeline, etc.). In another embodiment, the acceleration structure may be constructed utilizing one or more algorithms. For example, sorting the primitives, forming the clusters of the primitives, partitioning the prImitives, and creating the tree may all be performed utilizing one or more algorithms.

(0019] Additionally, in one embodiment, constructing the acceleration structure may be performed utilizing a graphics pmcessing unit ((PU). For example, a (PU may perform the entire construction of the acceleration structure, In this way, the transfer of data between the OPU and system memory associated with a central processing unit (CPU) may be avoided, which may decrease the time necessary to construct the acceleration structure.

[00201 More illustrative information will now be set forth regaitling various optional architectures and features with which the foregoing framework may or may not be implemented, per the desires of the user. It should be strongly noted that the following information is set forth for illustrative purposes and should not be construed as limitIng in any manner. Any of the tbllowing features may be optionally incorporated with or without the exclusion of other features described.

(0021] Figure 2 shows a task queue system 200 used in performing partitioning during the construction of an acceleration structure, in accordance with another embodiment. As an option, the present task queue system 200 may be carried out in the context of the functionality of Figure 1. Of course, however, the task queue system 204) may be implemented in any desired envfronment. It should also be noted that the aforementioned definitions may apply during the present description.

[0022] As shown, the task queue system 200 includes a plurality of warps 2024 and 2028 that each fetch sets of tasks to process (e.g., from an input queue. etc.). In one embodiment, each of the plurality of warps 2024 and 2028 may include a unit of work (e.g., a physical SIMT unit of work on a CPU, etc.). In another embodiment, each individual task may correspond to processing a single node during the construction of an acceleration structure.

(0023] Additionally, in one embodiment, at run time, each of the plurality of warps 202A and 2028 may continue to fetch sets of tasks to process from the input queue, where each set may contain one task per thread. Additionally, each of the plurality of warps 202A and 2028 may use a single global memory atomic add per warp to update the queue head. Further, each thread in each of the plurality of warps 202A and 20213 computes a number of output tasks 204 that it will generate.

[0024] Further still, after each thread in each of the plurality of warps 202A and 20213 has computed the number of output tasks 204 that it will generate, all threads in each of the plurality of warps 202A and 2028 participate in a warp-wide prefix sum 206 to compute the offset of their output tasks relative to the common base of each of the plurality of warps 202A and 2028. In one embodiment, the first thread in each of the plurality of warps 202A and 2028 may perform a single global memory atomic add to compute a base address in an output queue of the plurality of warps 202A and 2028.

Also, in one embodiment, a separate queue may be used per level, which may enable all the processing to be performed inside a single kernel call, while at the same time producing a breadth-first tree layout [0025] In one embodiment, constructing the acceleration structure may include using one or more algorithms to create both a standard LBVH and a higher quality SAIl hybrid.

See, for example, "H LB VH: Hierarchical LBVH construction for real-rime ray tracing of dynamic geometry," (Pantaleoni et aL, High-Performance Graphics 2010. ACM Siggraph / Eurographics Symposium Proceedings, Eurographics, 87-95), which is hereby incorporated by reference in its entirety, and which describes methods for constructing an LBVH and an HLBVH.

(0026] Additionally, in another embodiment, constructing the acceleration structure may include sorting primitives along a 30-bit Morton curve that spans a bounding box of a scene. See, for example, "Fast bvh construction on OPUs," (Lauterbach et al.. Coinput.

Graph. Forum 28,2,375-384), which is hereby incorporated by reference in its entirety, and which describes methods for sorting primitives and constructing BVHs. In yet another embodiment, the primitives may be sorted utilizing a brute forte algorithm (e.g., a least-significant digit radix sorting algorithm, etc.).

[0027] In still another embodiment, utilizing an observation that Morton codes define a hierarchical grid, where each 3n bit code identifies a unique voxel in a regular grid with 2" entries per side, and where in one embodiment, the first 3m bits of the code identify the parent voxel in the coarser grid with 2" subdivisions per side, coarse clusters of objects may be formed falling in each 3m bit bin. In another embodiment, the grid in which the unique voxel is identified may include different amounts of entries per side. In yet another embodiment, forming the coarse clusters of objects may be performed utilizing an instance of a run--length encoding compression algorithm, and may be irnp;emented with a. single compaction operaflon.

[04128] Further, in one embodiment, after the clusters are identified, all the primitives may he partitioned inside each cluster (e.g.. using LRVH-style spatiai middle splits, etc.).

In another embodiment, a top-level tree may toen be created. where the clusters may he partitioned with a binned SAB builder, See, for example. "On fast Construction of SAU based Bounding Volun'te Hierarchies." (Wald, I., In Proceeding-s of the 2007 Eurographics/IFEiF. Symposium on Interactive Ray Tracing, Eurographics). which is hereby incorporated by reference in its entirety, and which describes methods for partitioning dusters.

[0029] Further still, in one embodment, both the spatial middle split partitioning ann the SAR builder may rely on an efficient task queue system (e.g.. the task cItcue system 200, etc.), which may parallelize work over the individual nodes of the output hierarchies.

[0030] Also, in one embodiment, middle spIt hierarchy emission may he performed.

For example, it may he noted that each node in the hierarchy may correspond to a consecutive range of primitives sorted by their Morton codes, and that splitting a node may require finding the first element in the range whose code differed from the preceding element. Additionally, in another embodiment. complex machinery may be avotacd by reverting to a standard ordering that may be used on a seria device. For example, each node may he mapped to a single thread, and each thread may he allowed, to find its own split plane.

[0031] in yet another embodiment, instead of looping through the entire range of primitives in the node, it may be observed that it is possible to reformulate the proHem as a simple binary search. For example, it maybe determined that if a node is located at a level I, the Morton codes of the primitives of the nodes may have the exact same set of high /-I hits. In another embodiment, the first bit p 1 by which the first and last Morton a code in the node's range differ may be determined. n still another embodiment, a binary search may be performed to locate die first Morton code thai contains a I at bt p. [0032] In this way, for a node containing N' primitives, the algorithm may find the split plane by touching only O(log2(N)) memory cells, instead of the entire set of N Morton codes.

[0033 Additionally, in one embed iment, middle spits may sometimes fail, whtch may lead t.o occasional large eaves. In another embodiment, when such a failure is detected. the leaves may he split by the object--median. in yet another embodiment, after the topology of the BHV has been computed, a bottom--up re4itting procedure may he run to compute the bounding boxes of each node in the tree. This process may be simplified by the fact that the BVH is stored in hreadth4irsr order. In another enhooirnent, one kernel launch may be used pci tree level, and one thread. may be used per node in the le s'el.

[0034] Figure 3 shows a sorting 300 of a group of phniitives using Morton codes, in accordance with another embodiment. As an option. the present sorting 300 may be carried out in the context of the functionality of Figures 14. 01 course, however, the sorting 300 may he implemented in any desired environment. it should also be noted that the aforementioned definitions may apply-during the present description.

[0035] As shown, centroids of a plurality at' hounded primitives 302Aj located within a twodimensional projection are each assigned Morton codes (e.g., four'hit Morton codes, etc.). Additionally, the plurality' of bounded primitives 302AJ are sorted into a sequence of rows 306 Aj, where the assigned Morton codes are used as keys. For example, for every respective pnrnitive of sequence 30$ AJ, the Morton code bits are shown in separate rows 308. Additionally, binary search partitions 310 are made to the sequence of rows 30$ A4. Further. Figure 4 shows a pluralit of iniddIespUt queues 402AE corresponding to the sortinc; 300 peribrmed in Figure 3, in accordance with another embodiment. to

10036] Additionally, in one embodiment, a SAH-opdmized tree construction algorithm may be run over the coarse clusters defined by the first 3m bits of the Morton curve. In one embodiment, in may be between 5 and 7. Of course, however, in may include any integer. In another embodiment, the construction algorithm may run in a bounded memory footprint. For example, if Nclusters are processed, space may be preallocated only for 2N4 nodes.

(0037] Table 1 illustrates pseudo-code for the SAH binning procedure associated with the optimized tree construction algorithm. Of course, it should be noted that the pseudo-code shown in Table us set forth for illustrative purposes only, and thus should not be construed as limiting in any manner.

Table 1

lntqin = 0; mt numQElerns = 1; hltop_queueJnlt(queueLqln], Clusters, numClusters); while(numQElems> 0) // Init aft bins (empty boundIng boxes, reset counters) blns_inlt(queue[qln], numQElems); // compute bin statistics accumulate_blns(queue(qln], Clusters, numClusters); mt output_counter = 0; // compute best spOts sah_spftt( queue(qin], numQElems, queue(1-qln], &output_counter, BvhReferences, numBvhNodes); // dIstribute clusters to their new spilt task distrlbute_dusters( queue(qinj, Clusters, numClusters); numQElems = output..,,counter; numBvhNodes += output_counter; qln = 1 -qin; BvhLevelOffset(numevhLevels÷+J = numBvhNodes; 3. I]

[0038] In one embodiment, in a pass. a cluster from the prior pass (with its aggregate hounding box) may he treated as a pri rnitive. In another embodiment, the computation may be split into split tasks organized in a single input queue and a single output queue.

In yet another embodiment, each task may correspond to a node that needs to he split, and may be. described by three input fields (e.g, the node's hounding box, the number of clusters inside the node, and the node ID).

[00391 Additionally, in one embodiment, two additional nodes may be computed on the fly (e.g.. the best split platte and the ID 4 the first child split task). In another embodiment, these fields may be stored in a structure of an-ays (SOA) format, which may keep a number (eg, five. etc,) of separate arrays indexed by a task ID. In yet another embodiment, an array (eg., cluster split td. etc.) may he kept that maps each cluster to the current node (ic. split task, etc) it belongs to. where the array may he updated with every splitting operation.

[0640] Further, in one embodiment, the loop in Fable 1 may start by assigning all clusters to the root node, which may form a split-task (I. Then, for each oop iteration, binning, SM-I evaluation,and cluster distribution steps maybe peribrmed. For example.

each n-ode's bounding box may he split into i (e,g.. /v! including an integer such as eight.

etc) sab-shaned bins in each dimension, See, for example, "Ray Tracing Deformable Scenes using Dynamic Bounding Vohn'ne Hierarchies," (Waid, et al., ACM Transactions on Graphics 26, 1. 48.5493), which is hereby incorporated by reference in its entirety, and which describes methods for spfltting node bounding boxes, [0041] Further still, in another embodiment, a bin may store an initially ertpty bounding box and a count, fri yet another embodiment, each cluster's bounding box may he accumulated into the bin containins, its centroid, and the count of the number of clusters failing within the bin may be atomically incremented. In still another embodiment, this procedure may be executed in parallel across the clusters, where each thread may look, at a single duster and may accumulate its bounding box into the corresponding bin within the corresponding split-task, using atomic mitt/max to grow the bins' bounding boxes, [0042] Also, in one embodiment, for each split-task in the input queue, the surface area metric may be evaluated for all the split planes in each dimension between the uniformly distributed bins, and the best one may be selected. In another embodiment, if the split-task contains a single cluster, the subdivision may be stopped; otherwise, two output split-tasks may be created, where bounding boxes corresponding to the left and right subspaces may be determined by the SAIl split.

[00431 In addition, in one embodiment, the mapping between clusters and split-tasks may be updated, where each cluster may be mapped to one of the two output split-tasks generated by its previous owner. In order to determine the new split-task ID, the -th cluster's bin id may be compared to the value stored in the best split field of the corresponding split-task. Table 2 illustrates pseudo-code for a comparison of the i-tb cluster's bin Id to the value stored in the best split field of the corresponding split-task.

Of course, it should be noted that the pseudo-code shown in Table Zis set forth for illustrative purposes only, and thus should not be construed as limiting in any manner.

Table2

hit old_id = cluster_splitJd(i]; hit binJd = cluster_binJd[iJ; liii split_id = queue(in].best_split( old_id J; mt new_id = queue[in].new_task[ old_id]; cluster_split_id(i] = newjd + (bin_id c splitjd ? 0: 1); [0044] Further, in one embodiment, there may be some flexibility in the oMer of the algorithm phases. For example, refitting may be performed separately for bottom-level and top-level phases to trade off cluster bounding box precision against parallelism.

[0045:1 Agure S shows a data flow visualization 500 of a SM! binning procedure, in accordance with another embodiment. As an option, the present data flow visualization 500 may be carried out in the context cyf the functionality of Figures 14, Of course.

however, the data [row VIsualIrati.on 500 may be impiemente.d in any desired envrronment, It should also be noted that the aforementioned definitions may appk' during the present cescription.

[0046] As shown, clusters 502A and 502B contribute to forming the bin statistics 504 of their parent node. Additionally, nodes in the input task queue 506 are split, generating two entries 508A and 5USD into the output queue 510.

[0047] Additionally, in one embodiment. specialized builders for clusters of fine intricate geometry (e.g, hair, fur, Foliage, etc.) may be integrated. in another embodiment, this work may be easily integrated with triangle splitting, strategies. See, for example. "Early split clipping for bounding volume hierarchies." (Ernst, et nil., Symposium on Interactive Ray Tracing, 0, 737$), which is hereby incorporated by reference in its entirety, and which describes triangle splitting strategies. In yet another embodiment, cosrivress-sorudecompress techniques may be re-incomorated in order to exploit coherence internal to the mesh.

[0048] In this way, HLBVH may be unpemented. based on generic task queues.

which may inciud.e a flexible parad[gm or work dispatching that may be used to build simple and fast jarailei algorithms. Additionafly. in one embodiment, the same mechanism may be used to implement a massivek' panniel binned SAH builder for the high quality H RVH variant. l another embodiment, the HLBVH impementatior may he pe' lot med enttre..y on the GPI' h this way ss ich'wuratio' and netnov copies hetwcer CPU and (JELl imny he e'n'run fle r cxamplc, when o lining the eliminahon of these overheads the resulting builder may be faster (e.g., 5l0 times faster, etc.) than previous techniques. In another example, when considering just the kernel times aione may also he faster (e.g., up to 3 time.s faster, etc.) than previous techniques.

[0049] Adñitionaiiy, irt one embodiment, high quality bounding. volume, hierarchies may he produced, in reiil4i me even for moderately complex models. In another embodiment, the algorithms may be faster than previous HLBVH imolementations. This 4.

may be. possible thanks to a general simphfication offered by the adoption of work queues, which may allow a significant reduction in the number of high latency kernel launches and may reduce data transformation passes.

[0050] Further, in one embodiment, hierarchical linear bounduig volume hierarchies (HLB VHs) may be able to reconstructing the spatial index needed for ray tracing in real-time, even in the presence or. millions of fully dynarnw triangles. In another ernhodinent, the aforernendoned algorithms may enable a simpler and faster variant of UL.BVH, where all the complex bookkeeping of prefix sums, compaction and artial breadth-first tree traversal needed for spatial partitioning may be replaced with an elegant pipeline built on top of efficient work queues and binary search. In yet another embodiment, the new algorithm may be both faster and more memory efficient, which tTIay remove the need for temporary storage of geometry data for intermediate computations. Also, in one.

esnhodirnem, the same pipeline may be extended to paraEle!ize the construction of the top-level SAH optimh ed tree on the CPU, which may eliminate round-trips to the CPU.

thereby accelerating the overall construction speed (e.g., by a [actor of five to ten times.

etc.).

[0051] In another embodiment, a novel variant of hierarchical linear hounding volume hierarchies (.HLBVHs) may he provided that is stmple, fast and easy to generalize, in one embodiment, an ad--hoc, complex mix of prefix-sums, compaction and partial breadth-first tree traversal primitives used to perform an actual object partitioning step may he replaced with a single, elegant pipeline based on efficient work-queues, in this way, the origtnal. HI..BVH algorithm mar be simplified, and superior speeds may be ot'fbred Additionally, in one embodiment, the new pipeline may also remove the need for all additional temporary storage that may have been previously required.

[0052] Further still, in one embodiment, surface area heuristic (SAlt optimzed HLBVH hybrid may he parallelized. For example. the added flexibility of a task-based prpeltne may be combined with the efficiency of a parallel binning scheme. In this way, a speedup factor of up to ten times traditional methods may he obtained, Additionally, by parallelizing the entire pipeline, all acceleration structure construction may he run on the OPtS, which may eliminate costly copies between a CPU and GP.J nwmory spaces.

[0053] Also, in one embodiment, all algorithms used to construct the acceleradon structure may he implemented using CUDA parallel computing architecture. See, for example, "Scalable parallel programming with cuda" (Niekolls, et aL, ACM Queue 6, 2.

4153), which is hereby incorporated by reference in its entirety, and which describes tmplemcntauons of parallel computing with CUDA, Additionally. the construction of the acceleration structure may be performed utilizing efficient sorting primitives. See, for example. "Revisiting sorting for GPGPU stream architectures," (Merrill, et al,,Tech, Rep. CS2OIO-03, Department of Computer Science, University of Virginia, February), which is hereby incorporated by reference in its entirety. and which describes efficient sorting. piitives.

[0054 Additionally, in one embodiment, the acceleration structure may include constructing a RVH. For example, a 3D extent of a scene may he discretized using n bits per dimension, and each point may he assigned a linear coordinate along a space4ihing Morton curve of order n (which may be computed by interieaving th.e binary digits of the dscreuzed coorUinates. In another enihothnent, pnmnrves may then be sorted according to the Morton code of their centrosd, in still another embodiment, the hierarchy may be built by grouping the primitives in clusters with the sante 3n bit code, then grouping the. clusters with the same 3tn* I) high order bits, and so on, UntAl a complete tree is built. in yet another embodiment, the 3m high order bits of a Morton code may identify the isarent voxel in a coarse grid with 2" divisions per side, such that this process may correspond to splitting the primitives recursively in the spatial middle.

from top to bottom.

[0055] Further, in one embodiment. HLBVH may improve on the basic algorithm in multiple ways. For example. it may provide a faster construction algorithm applyinga compresssotideconipress strategy to exploit spatiar and temporal coherence in the input mesh, in another example, it may introduce a highquality hybrid builder, in which the top of the hierarchy is built using a Surface Area Heuristic (SAS) sweep bulkier over the clusters defined by the voxelization at level m. See, for example. "Automatic creation of object hierarchies for ray tracing.t' (Goldsmith, et al., IEEE Computer Graphics and Applications 7, 5, 14-20), which is hereby incorporated by reference in its entirety, and which describes an exemplary SAM.

[0056] In another embodiment, a custom scheduler may be built based on task-queues to implement a light-weight threading model, which may avoid overheads of built in hardware threads support. See, for example, Past Construction of SAM BV}ls on the Intel Many Integrated Core (MIC) Architecture," (Wald, 1, IEEE Transactions on Visualization and Computer Graphics), which is hereby incorporated by reference in its entirety, and which describes a parallel binned-SAN BVFI builder optimized for a prototype many core architecture, [0037) Figure 6 illustrates an exemplary system 600 in which the various architecture and/or functionality of the various previous embodiments may be implemented. As shown, a system 600 is provided including at least one host processor 601 which is connected to a communication bus 602. The system 600 also includes a main memory 604. Control logic (software) and data are stored in the main memory 604 which may take the form of random access memory (RAM).

[0058] The system 600 also includes a graphics processor 606 and a display 608. i.e. a computer monitor. In one embodiment, the graphics processor 606 may include a plurality of shader modules, a rasterization module, etc. Each of the foregoing modules may even be situated on a single semiconductor platform to form a graphics processing unit (OPU).

[0059] In the present description, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.

[0060] The system 600 may also include a secondary storage 610. The secondary storage 610 includes, for example, a hard disk drive and/or a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, etc. The removable storage drive reads from and/or writes to a removable storage unit in a well known manner.

[00611 Computer programs, or computer control logic algorithms, may be stored in the main memory 604 and/or the secondary storage 610. Such computer programs, when executed, enable the system 600 to perform various functions. Memory 604, storage 610 and/or any other storage are possible examples of computer-readable media.

[0062] In one embodiment, the architecture and/or functionality of the various previous figures may be implemented in the context of the host processor 601, graphics processor 606, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the host processor 601 and the graphics processor 606, a chipset (i.e. a gmup of integrated circuits designed to work and sold as a unit for performing related functions, etc.), and/or any other integrated circuit for that matter.

[0063] still yet, the architecture and/or functionality of the various previous figures may be implemented in the context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system, and/or any other desired system. For example, the system 600 may take the for i of a desktop computer, lap-top computer, and/or any other type of logic. Still yet, the system 600 may take the form of various other devices m including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a television, etc. (0OMJ Further, while not shown, the system 600 may be coupled to a network (e.g. a telecommunications network, local area network (LAN). wireless network, wide area network (WAN) such as the Internet, peer-to-peer network, cable network, etc.) for communication purposes.

1006S] While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation.

Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents. CLATh