Connect public, paid and private patent data with Google Patents Public Datasets

Shader interfaces


Publication number
WO2009158679A2 PCT/US2009/048960 US2009048960W WO2009158679A2 WO 2009158679 A2 WO2009158679 A2 WO 2009158679A2 US 2009048960 W US2009048960 W US 2009048960W WO 2009158679 A2 WO2009158679 A2 WO 2009158679A2
Grant status
Patent type
Prior art keywords
Prior art date
Application number
Other languages
French (fr)
Other versions
WO2009158679A3 (en )
WO2009158679A8 (en )
Michael V. Oneppo
Craig C. Peeper
Andrew L. Bliss
John L. Rapp
Mark M. Lacey
Original Assignee
Microsoft Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date



    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformations of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • G06F8/4441Reducing the execution time required by the program code
    • G06F9/00Arrangements for programme control, e.g. control unit
    • G06F9/06Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
    • G06F9/44Arrangements for executing specific programmes
    • G06F9/4421Execution paradigms
    • G06F9/4428Object-oriented
    • G06F9/443Object-oriented method invocation or resolution
    • G06F9/00Arrangements for programme control, e.g. control unit
    • G06F9/06Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogramme communication ; Intertask communication
    • G06F9/541Interprogramme communication ; Intertask communication via adapters, e.g. between incompatible applications
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/50Lighting effects
    • G06T15/80Shading
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/28Indexing scheme for image data processing or generation, in general involving image processing hardware


Allocation of memory registers for shaders by a processor is described herein. For each shader, registers are allocated based on the shader's level of complexity. Simpler shader instances are restricted to a smaller number of memory registers. More complex shader instances are allotted more registers. To do so, developers' high level shading level (HLSL) language includes template classes of shaders that can later be replaced by complex or simple versions of the shader. The HLSL is converted to bytecode that can be used to rasterize pixels on a computing device.




[0001] Today's graphic processing units (GPUs) host all of the computations necessary to generate high-quality graphics on computer screens, leaving a computing device's central processing unit (CPU) available for other tasks. Specifically, GPUs render graphics on computer screens by processing numerous programs called "shaders." In short, a shader is a specialized computer program that performs an operation for rendering a two-dimensional (2D) or three-dimensional (3D) graphic. In modern GPUs, realistic scenes are generated by rendering geometry with various virtual materials that are controlled by the shaders. These materials are represented in shader program code, which processes a variety of inputs (including texture maps, light locations, and other data) to generate the visual result. Using shaders, developers can control virtually any graphics or graphic effect by incorporating different vertex shading, primitive shading, and pixel shading.

[0002] The current methodology for rendering complex 3D graphic scenes in real time consists of supporting parallel-architecture processors in conjunction with customized logic units to hide latency by distributing the overhead across multiple parallel units. The pipelines utilized are designed around a primitive rasterization pipeline that, when provided a high level 3D description of a collection of linear primitives like points, line segments, or triangles, will convert, or rasterize, the collection to the projected pixel representations. In existing 3D hardware technologies, small programs called "shaders" are used to define the operation of certain stages of the rendering algorithm, like the transformations of the vertices of the primitives or computing the color of a single pixel on the screen. The shaders define a small amount of work to be performed in large parallel execution batches, often distributed across many specialized processors on a graphics processing unit (GPU).

[0003] Creation of shaders is done through a highly specialized programming language designed to target the hardware architectures available, and an equivalent compiler is available to take the code and reduce it down to instructions the hardware and associated device driver can use. Developers use this technology in order to customize the rendering pipeline to only the behavior desired for a specific application. For example, if the developer is creating an application that performs a non-photorealistic 3D rendering of very complex themes, the developer can optimize the shaders to be very simple in order to maximize the complexity of the scene. Conversely, if the developer wishes to have very high-fidelity material properties and lighting applied to less complex scenes, the developer may create highly-customized shaders to create very realistic effects that may be very complex. Furthermore, shaders are compiled into an abstract binary form, which a device driver maps for hardware to run.

[0004] To illustrate this point, consider a game scene in which a character is exposed to multiple light sources. One of the light sources may be simply ambient light from a moon at night. Another light source may be extending from a lamp post down a street. With the first light source, that being from the moon, a shader can be written to control the light emitting from the moon. In this case, the light is constant and only needs to be represented by a simple program to disperse the light throughout the scene. The lamp post, however, may be more complex. With the lamp post, the light may only be configured to shine in specific directions; however, the light from the lamp post may not bend around corners. Therefore, a shader written to govern the light from the lamp post may require a more complex computation than the shader written to govern the light coming from the moon. In either scenario, a GPU must rasterize pixels according to the underlying computations from each shader.

[0005] The common architecture for GPUs provide the trade-off between scene complexity and shader complexity by making resources on the system flexible. To execute, the shader typically requires a processing unit, shader-instanced data, global resources (e.g., texture images), intermediary register banks to perform computations, and a set of output registers. For simple shaders, meaning the shaders require relative few registers to compute, many more shaders can be run simultaneously, resulting in an underlying application or game getting higher frame rates because more work can be done in parallel. For more complex shaders, meaning the shaders require more registers to compute, fewer instances of the more complex shaders can be executed in parallel because more registers are being used. In other words, allocation of registers have a direct determination on the number of shaders that can be processed in parallel. Because the time required to render graphics depends on parallel processing of shaders, it is advantageous to process as many shaders as possible, and thus the allocation of registers is crucial to performance.


[0006] This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. [0007] One aspect of the invention allows a single shader to have alternate paths of varying complexity within it. In this aspect, a selection of a particular path through the shader is provided in a way that allows efficient register allocation. Memory registers are allocated based on the level of complexity of a shader path or instance. In one embodiment, shader instances are shader programs, or portions thereof, developed by shader developers. Simpler shader paths or instances may be restricted to a smaller number of memory registers. More complex shader paths or instances may be allotted more registers. Another aspect of the invention is directed to a GPU that is configured to allocated memory registers in such a manner.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS [0008] The present invention is described in detail below with reference to the attached drawing figures, wherein:

[0009] FIG. 1 is a block diagram of an exemplary operating environment for use in implementing an embodiment of the present invention;

[0010] FIG. 2 is a diagram illustrating the allocation of memory registers of a computing device in accordance with an embodiment of the invention; and [0011] FIG. 3 is a diagram of flow chart for allocating registers based on shader complexity in accordance with an embodiment of the invention.


[0012] The subject matter described herein is presented with specificity to meet statutory requirements. The description herein, however, is not intended to limit the scope of this patent. Rather, it is contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the term "block" may be used herein to connote different elements of methods employed, the term should not be interpreted as implying any particular order among or between various steps herein disclosed.

[0013] Further, various technical terms are used throughout this description. A definition of such terms can be found in Newton's Telecom Dictionary by H. Newton, 21s* Edition (2005). These definitions are intended to provide a clearer understanding of the ideas disclosed herein but are not intended to limit the scope of the present invention. The definitions and terms should be interpreted broadly and liberally to the extent allowed the meaning of the words offered in the above-cited reference.

[0014] The invention can generally be described as one or more systems for, methods to, and computer-storage media for providing dynamic code linkage to optimally allocate registers for shaders based on their level of complexity. In one embodiment, developers can create their own shader classes and registers are allocated for each shader based on complexity. Simpler shaders are allocated fewer registers, and complex shaders are allocated additional registers.

[0015] As one skilled in the art will appreciate, embodiments of the present invention may be embodied as, among other things: a method, system, or computer-program product that is embodied on one or more tangible computer-readable media. Accordingly, the embodiments may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware.

[0016] Dynamic code binding provide a mechanism for abstracting the implementation of a function from the consumers of the function, offered by providing a level of indirection between a function call and implementation. Traditionally, this indirection is performed by first looking into a virtual function table to find the location of the function to execute. When the application is executed, the table which was previously empty is filled with the locations of the function implementations (the actual act of "linking"), therefore allowing the application to actually execute the functions as needed. In one embodiment, a subset dynamic linkage is provided that reduces the number of permutations of specialized shaders while still offering global optimizations across the abstraction boundary.

[0017] Instead of providing a single compiled implementation for each abstract function, embodiments generate compiled code in such a way that each use of a specific instance of a shader is compiled as if it was inlined in the code and then stored in a table sorted by function type and call location. It is important to understand that embodiments described herein differ from typical linkage in that at runtime no calling convention is used. Instead, each time a function should be called, a version of the function is emitted to match the call site's register state and other state. Because a new version of the function is emitted for each location in the shader code that the function is called from, all optimizations used when inlining apply, except that the function code must remain functionally separate from main shader code. Because embodiments described herein differ from "real" linkage, the amount of code generated by embodiments described herein can become quite large. No code sharing occurs between multiple call sites. If code is larger than the code cache, and the penalty from the latency of the cache miss is not hidden.

[0018] Selectable inlining is used by some of the embodiments described herein. Selectable inlining allows a system to generate a shader instantiation that not is not only close to the optimal instruction usage, but also utilizes the minimum needed registers per invocation for a given task. In this embodiment, the total number of registers needed by a specific shader invocation can be calculated quickly by the device driver and allocated accordingly. This keeps very complex calculations from affecting register usage unless calculations are actually being performed. [0019] In order to maintain optimization, embodiments are configured to emit a different version of each used method per call site, acting as if the method were inline to allow for optimizations across the method-call boundary. This has a trade -off of space and - unlike standard linkage which creates only one compiled version of each method, embodiments described herein create many, potentially causing larger binary files. [0020] Embodiments described herein generally reference the Direct3D APIs included within various versions of the Windows® operating system (OS). Embodiments are not limited to Direct3D APIs. One skilled in the art will understand that various APIs in different OSs provides similar functionality to the calls and routines described herein. For clarity sake, however, reference is made herein to Direct3D.

[0021] Before proceeding further, a number of key definitions should be defined. While the below definitions should aid the reader in understanding the embodiments described herein, the definitions are provided merely for explanatory purposes. . [0022] A "primitive" is a basic unit for describing a shape. In computer graphics, the triangle is typically considered the fundamental primitive because all possible 2D and 3D shapes can be composed of triangles. As one skilled in the art will appreciate, other shapes may alternatively be used as primitives in rendering graphics. [0023] A "shader" is a small, specialized computer program that performs some aspect of a rendering computation. Shaders are responsible for a number of major aspects in the typical rendering pipeline. These aspects include, inter alia, vertex shading, primitive (or geometry) shading, and pixel shading. One skilled in the art will understand that vertex shading refers to determining the position and orientation of the vertices of a primitive — e.g., where to place the vertices of a triangle in 2D so the triangle appears to be in 3D. Primitive shading describes surface operations for a single primitive. And pixel shading colors each pixel based on a rendered primitive, in order to draw the primitive to the screen.

[0024] Shaders may be developed, or programmed, to handle virtually any aspect of a gaming experience. For example, a shader may be written to govern the reflection of light off of a character's skin, based on the color of the character, time, lighting, or other relevant variable. As previously mentioned, some shaders are relatively simple; whereas, others may require more complex computations. Simple shaders may, in some embodiments, require fewer memory registers to process than complex shaders.

[0025] A "rasterizer" is a component that takes an image made up of high-order primitives, such as lines, points, and triangles and converts the image into a raster image

(i.e., pixels) for output on a video display. A raster image is bitmap representation of the primitives with color. In one embodiment, a rasterizer is software, executed by a GPU, that is configured to color pixels according to primitives produced by shaders.

[0026] Direct3D is an application program interface (API) provided within Windows® for rendering 2D and 3D scenes. Direct 3D includes a primitive rasterizer with programmable stages that allows developers within the Windows® platform to load customized programs onto a GPU for rendering. Numerous versions of Direct3D are currently in use, and therefore should be understood by one of skill in the art.

[0027] A High Level Shading Language (HLSL) is a variant on a programming language (e.g., C, C++, C#, Java, or the like) designed for developing shaders.

Specifically, developers program shaders in HLSL. In operation, HLSL is compiled into an intermediary language (IL) for use with a graphics application.

[0028] An IL is a low-level, instruction-based, binary representation of the operations a shader stage should perform. It acts as an intermediate optimization step of the compiler and the native instruction set of the graphics hardware. Thus, the IL translates the instructions designated by a developer by using an HLSL into the byte code necessary by the graphics hardware (i.e., the GPU) in order to render graphics.

[0029] Just-in-time (JIT) refers to a fast-process compilation performed on the shader IL to convert the shader into the native instruction set of the graphics hardware. JIT simply refers to the point in time when actions occur in programming and processing. One skilled in the art will understand how JIT compilation works in other programmatic languages, such as Python and C#.

[0030] In one embodiment, the present invention takes the form of a computer- program product that includes computer-useable instructions embodied on one or more computer-readable media. Computer-readable media include both volatile and nonvolatile media as well as removable and nonremovable media. By way of example, and not limitation, computer-readable media comprise computer-storage media. Computer-storage media, or machine-readable media, include media implemented in any method or technology for storing information.

[0031] Examples of stored information include computer-useable instructions, data structures, program modules, and other data representations. Computer-storage media include, but are not limited to, random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory used independently from or in conjunction with different storage media, such as, for example, compact-disc read-only memory (CD-ROM), digital versatile discs (DVD), holographic media or other optical disc storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. These memory components can store data momentarily, temporarily, or permanently.

[0032] Having briefly described a general overview of the embodiments described herein, an exemplary computing device is described below. Referring initially to FIG. 1 in particular, an exemplary operating environment for implementing the present invention is shown and designated generally as computing device 100. Computing device 100 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should computing device 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated. In one embodiment, computing device 100 is a conventional computer (e.g., a personal computer or laptop). [0033] One embodiment of the invention may be described in the general context of computer code or machine -useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine.. Generally, program modules including routines, programs, objects, components, data structures, and the like refer to code that perform particular tasks or implement particular abstract data types. Embodiments described herein may be practiced in a variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, more specialty computing devices, etc. Embodiments described herein may also be practiced in distributed computing environments where tasks are performed by remote -processing devices that are linked through a communications network. [0034] With continued reference to FIG. 1, computing device 100 includes a bus 110 that directly or indirectly couples the following devices: memory 112, one or more processors 114, one or more presentation components 116, input/output ports 118, input/output components 120, and an illustrative power supply 122. Bus 110 represents what may be one or more busses (such as an address bus, data bus, or combination thereof). Although the various blocks of FIG. 1 are shown with lines for the sake of clarity, in reality, delineating various components is not so clear, and metaphorically, the lines would more accurately be gray and fuzzy. For example, one may consider a presentation component such as a display device to be an I/O component. Also, processors have memory. It will be understood by those skilled in the art that such is the nature of the art, and, as previously mentioned, the diagram of FIG. 1 is merely illustrative of an exemplary computing device that can be used in connection with one or more embodiments of the present invention. Distinction is not made between such categories as "workstation," "server," "laptop," "hand-held device," etc., as all are contemplated within the scope of FIG. 1 and reference to "computing device."

[0035] Computing device 100 typically includes a variety of computer-readable media. By way of example, and not limitation, computer-readable media may comprise RAM; ROM; EEPROM; flash memory or other memory technologies; CDROM, DVD or other optical or holographic media; magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or similar tangible media that configurable to store data and/or instructions relevant to the embodiments described herein.

[0036] Memory 112 includes computer-storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, nonremovable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard drives, cache, optical-disc drives, etc. Computing device 100 includes one or more processors that read data from various entities such as memory 112 or I/O components 120. Presentation component(s) 116 present data indications to a user or other device. Exemplary presentation components include a display device, speaker, printing component, vibrating component, etc.

[0037] I/O ports 118 allow computing device 100 to be logically coupled to other devices including I/O components 120, some of which may be built in. Illustrative components include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, etc. [0038] Computing device 100 also includes a GPU 124 capable of simultaneously processing multiple shaders in parallel threads. To do so, the GPU 124 may be equipped with various device drivers and, in actuality, comprise multiple processors. [0039] FIG. 2 is a diagram illustrating the allocation of memory registers of a computing device in accordance with an embodiment of the invention. As shown in FIG. 2, a plethora of memory registers 200 are available on the computing device. The same block of registers 200 is presented side-by-side to illustrate how the memory is allocated for either simple shader instances 202, 204, 206, and 208 and complex shader instances 210 and 212. Instances may include either programmatic representations of the shaders or bytecode renditions of the shaders. Moreover, the allocation of registers 200 may be performed by a GPU or CPU on the computing device.

[0040] Simple registers 202, 204, 206, and 208 are allocated to available blocks 214, 216, 218, and 220. Complex shaders are allocated to available blocks 222 and 224. Available blocks allocated to simple shaders (i.e., blocks 214, 216, 218, and 220) contain fewer registers than the available blocks allocated to complex shaders (i.e., blocks 222 and 224).

[0041] In one embodiment, the blocks of registers are allocated in code by designating two separate shaders for scenarios (simple and complex) and loading the appropriate scenario whenever necessary. To do so, pointers may be set for a given shader. [0042] FIG. 3 is a diagram of flow chart for allocating registers based on shader complexity in accordance with an embodiment of the invention. Initially, interfaces are declared in HLSL defining a template from which multiple shader classes can be instantiated, as indicated at 302. A variable for inlining a shader implementation is also defined in a shader program, as indicated at 304. In one embodiment, a shader implementation is a routine or sub-routine. Actual instances are designated, as indicated at 306, by pointing to a table storing data about the shader implementation. And method calls in the shader program are replaced with the actual class instances (routines or subroutines), as indicated at 308. To illustrate the aforesaid steps, code that can integrate with Direct3D is presented below and discussed in detail.

[0043] Direct3D contains a number of discreet shader stages, each meant for a separate purpose in the rendering pipeline. These six stages create a rendering pipeline where the developer writes code to control the operation of each shade or stage. To target these stages, the developer uses an HLSL and the associated HLSL compiler, which converts HLSL code into optimized shader byte code. As previously mentioned, byte code is a low-level representation of compiled HLSL code for use a graphics device driver in the graphics hardware.

[0044] In one embodiment, HLSL contains object-oriented constructs that allow for the grouping of functions and independent resources like variables and textures into classes. In this paradigm, interfaces can be declared to define a template from which multiple classes can be instantiated. When defined in this way, the classes that inherit from an interface define the implementations that are to be linked using dynamic linkage. In order to define a point in a shader program into which one of the implementations can be inserted, the developer creates an interface variable and the defined methods of that interface can be used without reference to the possible class inferences. In one embodiment, a single shader implementation is selected and built into the shader. In an alternative embodiment, a particular point in the HLSL code where an interface is used defines a place where all implementations are inline, and all implementation bodies are inserted into the shader. Then, when the shader is actually running, long after compilation, a particular implementation is chosen to execute. [0045] The following code shows how an actual class instance can be selected and method calls replaced with shader code.

1 interface Light

2 {

3 float3 Calculate(float3 Position, float3 Normal);

4 } 5

6 class AmbientLight : Light

V {

8 float3 Calculate(float3 Position, float3 Normal)

9 {

10 return AmbientValue ;

11 }


13 float3 AmbientValue;

14 } 15

16 class DirectionalLight : Light

17 {

18 float3 Calculate(float3 Position, float3 Normal)

19 {

20 float3 LightDir = normalize(Position - LightPosition);

21 float LightContrib = saturate( dot( Normal, -LightDir) );

22 return LightColor * LightContrib;

23 } 24

25 float3 LightPosition;

26 float3 LightColor;

27 } 28

29 AmbientLight My Ambient;

30 DirectionalLight MyDirectional; 31

32 float4 main (Light Mylnstance, float3 CurPos: CurPosition,

33 float3 Normal : Normal) : SV_Target

34 {

35 float4 Ret;

36 = Mylnstance. Calculate(CurPos, Normal);

37 Ret.w = 1.0; 38

39 return Ret;

40 }

The above example is written in HLSL, and the actual representation of bytecode is in

binary. An explanation of the above code is presented in the following paragraphs. [0046] Lines 1-4 define an interface called Light, which is the parent interface of classes defined in the example. In line 3, a prototype for the Calculate method is defined. Calculate must be implemented by any subclass of Light. Lines 6-14 define AmbientLight, which is a simple implementation of the Light interface (i.e., a simple shader definition). Lines 18-23 show an implementation of the Calculate method with a signature that is identical to Light: :Calculate but with code more complex than DirectionalLight::Calculate (i.e., a complex shader definition).

[0047] Lines 25-26 show depict local class variables needed for the operation of DirectionalLight. Lines 29-30 show class instance definitions for a two variables: My Ambient (a simple shader definition) and MyDirectional (a complex shader definition). These variables act as binding points to a rendering pipeline and identify the possible implementations that can be selected for use in the My Instance variable's place described below.

[0048] Lines 32-40 show the main shader portion of the program. The first argument is a generic interface variable of type Light. At the point that Light is used, a special invocation instruction is inserted (in one embodiment). As a result, all implementation bodies will appear in the shader bytecode. Tables may then link the invocation to the bodies it might execute. The remaining parameters are standard rendering pipeline variables used in a standard Direct3D shader.

[0049] In operation, the above code may be written in HLSL code and sent to an HLSL compiler for conversion into bytecode. The bytecode will, in turn, be provided to a driver on the GPU to be set as the program for the shader stage described in the HLSL code. In previous versions of Direct3D, this bytecode consisted of a low-level description of the inputs, outputs, and dependent resources needed by the shader and assembly-style instructions that define the operation of the shader stage. With respect to embodiments described herein, the Direct3D bytecode further includes of sub-routines that define inputs, outputs, and operational instructions for the sub-routine and tables that define the usage points of abstract interface methods and which of the method definitions can be inlined into various points in the shader. It is important to note that the example below is written in semi-readable text called disassembly, and the actual representation of bytecode is in binary.

[0050] The bytecode is meant to represent a highly optimized, expressive definition of the state and expected execution of the shader stage. In versions of Direct3D prior to Direct3D 10, this bytecode matched the exact instructions that could be executed on the graphics hardware. Because of divergent architectures in Direct3D 10 (e.g., pipeline emulation on the CPU), the bytecode was revised to instead provide an intermediate representation — the IL. In one embodiment, device drivers for a GPU are provided the IL

and convert the code to the proper native instructions in a JIT process. In another embodiment, the IL is designed in such a way that the JIT operation can be performed with minimal need for optimization or reformatting. Additionally, separate class instances may have the JIT operation performed on them at the creation of the shader definition — rather

than when linkage occurs — in order to assure that at link time, the linkage can be performed as a trivial inline. The aforesaid can be seen in the exemplary code below. It should be noted that the code presented below is not meant to limit embodiments of the present invention. Other code may alternatively be used.

1 dcl func table ftO { fbO }


3 dcl func table ftl { fbl }


5 dcl_func_ptr fpθ[l][l] = { ftO, ftl };


7 fbO: in:(const ivO.xyzw,

8 const ivl.xyzw,

9 const iv2.xyzw), 10 out:(oo0.xyzw)

11 mov ooO, cb[ivθ.x][ivθ.y]

12 ret


14 fbl : in:(const ivO.xyzw,

15 const ivl.xyzw,

16 const iv2.xyzw),

17 out(ooθ.xyzw)

18 add, ivl.xyzx, -cb[ivθ.x][ivθ.y].xyzx

19 dp3 rO.w, rO.xyzx, rO.xyzx

20 rsq rO.w, rO.w

21 mul, rO.xyzx, rO.wwww

22 dp3 sat rO.x, iv2.xyzx, -rO.xyzx

23 mul,, cb[ivθ.x][ivθ.y+l].xyz

24 ret


26 main:

27 fcall fpO[O][O], in:(cbl4[0], vO, vl), out:(oO)

28 ret

[0051] Line 1 depicts a class instance table for AmbientLight and lists all function implementations for a specific class instance. In the above code, there is only one function, Calculate, which is called once in the main shader code. Therefore, only one implementation exists, fbO. Additional methods, functions, routines, or calls to existing methods in the class AmbientLight could be referenced as well. Furthermore, line 3 shows a class instance table for the variable DirectionalLight discussed in reference to the previous code.

[0052] The table interface used to dispatch via Mylnstance is on line 5. The first array dimension indicates if the interface variable is an array. In this case there is only one element, so the dimension is given a value of 1. The second array dimension is the number

of call sites for the interface. In this case there is only one method, Calculate, so the dimension is one. Finally, the list in braces is the list of class instance tables that can be used by this interface variable. Since both ftO (AmbientLight) and ftl (DirectionalLight) inherit from Light, these are the two tables that are listed. [0053] Lines 7-12 show IL for AmbientLight's implementation of Calculate, optimized for the call site at line 27. If there were additional call sites that used the Calculate function for the AmbientLight class, there would be multiple blocks like this one optimized for the specific call site. Note that registers labeled as "iv" and "ov" are used instead of standard HLSL registers like x, s, or cb. If multiple call sites emit the same set of instructions, the redundant blocks can be removed and the various call sites will point to a single block. This means that the call site enumerates the registers, requiring that a substitution of registers needs to occur as part of the inlining process. Additionally, lines 14-24 show IL for DirectionalLight's implementation of Calculate, optimized for the call site at line 27.

[0054] The main shader code block is presented in lines 26-28. In 27, an Fcall instruction indicates an array element and defines the call site for the Calculate routine for a variable, MyMaterial. The first parameter indicates the interface table that is to be used (fpO). The first bracketed index defines the method index. In this case, there is only one call site so the index is one. The second bracketed index defines the index of the call site being executed. In this case there is only one invocation of the Calculate method, so this index is zero, "in" and "out" indicate the registers that are utilized by this call. The first "in" parameter always refers to the place in constant memory that the class instance variables are stored — this case cbl4, element zero.

[0055] In one embodiment, the fcall instruction refers to the method to be called by providing an index, but does not define the exact implementation to call. When generating the IL and then later the native hardware instructions for program execution, code is emitted up to the fcall routine and the current state of the registers and other shader states are partially cached and restored around implementation generation. The code for the first implementation is generated starting with the current state of register allocation, scratch registers, etc. Once this generation step is complete, the state is restored to the cached state, and the generation is repeated for the next possible implementation is compiled. This cycle repeats until all implementations have code generated. Finally the current state is restored the cached state and the impacts of the outputs of the fcall are applied to the current state, and code generation continues after the fcall. The resulting methods

generated are defined in the IL and referenced in class instance tables, which match the structure of the interface definitions and have reference to each compiled function version created.

[0056] To implement the above in a C API, minimal changes are made to Direct3D. In one embodiment, a new API is added for referencing the class instances provided by a shader. Additionally, another API is created to reference a class instance. [0057] In operation, when this shader object is bound to the pipeline, the application

has the opportunity to provide a listing of the specific class instances it wishes to utilize for the available bind points in the shader. To do so, a list objects is obtained by providing the names of the HLSL class instances for the shader to the API that references class instances.

[0058] The class-referencing API may only allow interaction in a batched mechanism, meaning that the application can only change the state of the rendering pipeline between sets of draw calls rather than more granularly, like between the rendering of pixels. Yet,

some embodiments may change states in between sets of primitives, making the class instances only changeable between batches of primitives. This can be seen in the following code.

1 pDevice->CreatePixelShader(pShaderCode, pMyClassLibrary, &pMyPS); 2

3 pMyDirectionalLight = pMyClassLibrary->GetClassInstance("MyDirectional");

4 pMyAmbientLight = pMyClassLibrary->GetClassInstance("My Ambient");

5 6 while (true)

V {

8 if (DirectionalLighting)

9 pDevice->PSSetShader(pMyPS, &pMyDirectionalLight, 1);

10 else

11 pDevice->PSSetShader(pMyPS, &pMyAmbientLight, 1); 12

13 RenderS cene();

14 }

[0059] Line 1 illustrates a routine, CreatePixelShader, that is provided a string parameter that contains the compiled shader bytecode in pShaderCode, a pointer to an API that references class instances (pMyClassLibrary), and a pointer to a pointer to a pixel

shader (pMyPS). The routine examines the bytecode and populates pMyClassLibrary with information on what interfaces and class instances are available in the shader. Additionally, the bytecode is also provided to a device driver of a GPU, which performs a JIT conversion of the code to a native representation and stores the converted code internally. Finally a reference is returned to the shader mpMyPS for later API use. [0060] In line 3, a pointer (pMyDirectionalLight) to the API that references class

instances is set to reference the MyDirectional class instance in the shader. In line 4, a pointer (pMyAmbientLight) to the API that references class instances is set to reference the My Ambient class instance in the shader. Lines 6-14 depict a loop that will render the

scene repeatedly until the program exits.

[0061] Line 8 shows code for selecting what call instance to use based on the global input DirectionalLighting. Based on the selection made in line 8, a call is made with the

shader object in pMyPS along with one of the two possible class instances that can be applied to the HLSL variable Mylnstance. The final argument indicates the length of the array provided in the second argument as there might be more than one interface to resolve in any one shader. Finally, a call is show in line 13 to a function for rendering the geometry of a scene.

[0062] Although the subject matter has been described in language specific to structural features and methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. For example, sampling rates and sampling periods other than those described herein may also be captured by the breadth of the claims.


1. One or more computer-readable media having computer-executable instructions embodied thereon for allocating memory registers to shader instances, the method comprising: declaring one or more interfaces to define a shader template, wherein one or more classes of a shader depend from the one or more interfaces (302); defining a variable in a shader program for inlining an actual shader instance (304); and replacing a call in the shader program with the actual shader instance (308).
2. The media of claim 1, further comprising allocating memory for the shader instance.
3. The media of claim 2, wherein a number of registers associated with the memory and allocated for the shader instance depends on a level of complexity for the shader instance.
4. The media of claim 1, further comprising: translating the shader instance into bytecode; and rasterizing pixels on a computing device based on the bytecode..
5. The media of claim 4, wherein the shader instance is compiled into bytecode by a high level shading language (HLSL) compiler.
6. The media of claim 1 , wherein the shader instance is programmed in a high level shading language (HLSL).
7. The media of claim 1, wherein the shader instance is used to determine an operation in a three-dimensional (3D) graphic.
8. The media of claim 1, further comprising: providing compiled shader bytecode; providing a pointer to an application program interface; providing a pointer to a pixel-shading shader; providing the shader bytecode to a device driver for a graphics processing uinit (GPU); and returining a reference to the shader instance.
9. The media of claim 8, further comprising: referencing the one or more classes of the shader; storing the one more classes in a memory location; and designating a pointer to the memory location.
10. The media of claim 1, further comprising binding the shader instance to a pipeline that is executed by a graphics processing unit (GPU).
11. The media of claim 10, wherein the pipeline simultaneously executes numerous shader instances in parallel.
12. The media of claim 1, wherein replacing a call in the shader program further comprises: receiving one of two or more shader instances of the shader; and replacing the call in the shader program with the one of two more shader instances of the shader.
13. A method for processing one or more shaders on a computing device, comprising: declaring one or more interfaces to define a shader template, wherein one or more classes of a shader depend from the one or more interfaces (302); defining a variable in a shader program for inlining an actual shader instance (304); and replacing a call in the shader program with the actual shader instance (308).
14. The media of claim 13, further comprising: providing compiled shader bytecode; providing a pointer to an application program interface; providing a pointer to a pixel-shading shader; providing the shader bytecode to a device driver for a graphics processing uinit (GPU); and returining a reference to the shader instance.
15. The media of claim 13, determining a number of registers to allocate to the shader instance based on a complexity associated with the shader instance.
16. The media of claim 13, wherein the shader instance is used to determine an operation in a three-dimensional (3D) graphic.
17. A computing device configured to render likenesses of three- dimensional graphics in two-dimensions, comprising: a memory unit with one or more memory registers (112); a graphics processing unit (GPU) configured to allocate the one or more memory registers associated with the shader based on a previously defined shader class (124).
18. The computing device of claim 17, wherein the GPU is further configured to: declare one or more interfaces to define a shader template, wherein one or more classes of a shader depend from the one or more interfaces; define a variable in a shader program for inlining an actual shader instance; and replace a call in the shader program with the actual shader instance.
19. The computing device of claim 17, wherein the GPU executes a rasterizer for coloring pixels based on the actual shader instance.
20. The computing device of claim 18, wherein the actual shader instance determines the position of one or more primitives.
PCT/US2009/048960 2008-06-27 2009-06-26 Shader interfaces WO2009158679A8 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US12/163,734 2008-06-27
US12163734 US8581912B2 (en) 2008-06-27 2008-06-27 Dynamic subroutine linkage optimizing shader performance

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN 200980124880 CN102077251B (en) 2008-06-27 2009-06-26 Shader interfaces
EP20090771210 EP2289050A4 (en) 2008-06-27 2009-06-26 Shader interfaces

Publications (3)

Publication Number Publication Date
WO2009158679A2 true true WO2009158679A2 (en) 2009-12-30
WO2009158679A3 true WO2009158679A3 (en) 2010-05-06
WO2009158679A8 true WO2009158679A8 (en) 2010-11-18



Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2009/048960 WO2009158679A8 (en) 2008-06-27 2009-06-26 Shader interfaces

Country Status (4)

Country Link
US (3) US8581912B2 (en)
CN (1) CN102077251B (en)
EP (1) EP2289050A4 (en)
WO (1) WO2009158679A8 (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8416238B2 (en) * 2009-02-18 2013-04-09 Autodesk, Inc. Modular shader architecture and method for computerized image rendering
US8379024B2 (en) * 2009-02-18 2013-02-19 Autodesk, Inc. Modular shader architecture and method for computerized image rendering
US8368694B2 (en) * 2009-06-04 2013-02-05 Autodesk, Inc Efficient rendering of multiple frame buffers with independent ray-tracing parameters
US8970588B1 (en) * 2009-07-31 2015-03-03 Pixar System and methods for implementing object oriented structures in a shading language
US9245371B2 (en) * 2009-09-11 2016-01-26 Nvidia Corporation Global stores and atomic operations
US8756590B2 (en) 2010-06-22 2014-06-17 Microsoft Corporation Binding data parallel device source code
US8677186B2 (en) 2010-12-15 2014-03-18 Microsoft Corporation Debugging in data parallel computations
US8997066B2 (en) 2010-12-27 2015-03-31 Microsoft Technology Licensing, Llc Emulating pointers
US8539458B2 (en) 2011-06-10 2013-09-17 Microsoft Corporation Transforming addressing alignment during code generation
US9378560B2 (en) * 2011-06-17 2016-06-28 Advanced Micro Devices, Inc. Real time on-chip texture decompression using shader processors
US9495722B2 (en) * 2013-05-24 2016-11-15 Sony Interactive Entertainment Inc. Developer controlled layout
US9779535B2 (en) 2014-03-19 2017-10-03 Microsoft Technology Licensing, Llc Configuring resources used by a graphics processing unit
US9766954B2 (en) 2014-09-08 2017-09-19 Microsoft Technology Licensing, Llc Configuring resources used by a graphics processing unit
KR20160033479A (en) 2014-09-18 2016-03-28 삼성전자주식회사 Graphic processing unit and method of processing graphic data using the same
US9786026B2 (en) 2015-06-15 2017-10-10 Microsoft Technology Licensing, Llc Asynchronous translation of computer program resources in graphics processing unit emulation
CN105374070A (en) * 2015-12-11 2016-03-02 中国航空工业集团公司西安航空计算技术研究所 3D graphic processing algorithm modeling simulation method

Family Cites Families (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5838686A (en) * 1994-04-22 1998-11-17 Thomson Consumer Electronics, Inc. System for dynamically allocating a scarce resource
US6041179A (en) * 1996-10-03 2000-03-21 International Business Machines Corporation Object oriented dispatch optimization
US7548238B2 (en) * 1997-07-02 2009-06-16 Nvidia Corporation Computer graphics shader systems and methods
US6704927B1 (en) * 1998-03-24 2004-03-09 Sun Microsystems, Inc. Static binding of dynamically-dispatched calls in the presence of dynamic linking and loading
US6175956B1 (en) * 1998-07-15 2001-01-16 International Business Machines Corporation Method and computer program product for implementing method calls in a computer system
US6654951B1 (en) * 1998-12-14 2003-11-25 International Business Machines Corporation Removal of unreachable methods in object-oriented applications based on program interface analysis
US6507946B2 (en) * 1999-06-11 2003-01-14 International Business Machines Corporation Process and system for Java virtual method invocation
JP4118456B2 (en) * 1999-06-29 2008-07-16 株式会社東芝 Program language processing system, code optimization methods, and machine-readable storage medium
US6658657B1 (en) * 2000-03-31 2003-12-02 Intel Corporation Method and apparatus for reducing the overhead of virtual method invocations
US6704297B1 (en) 2000-08-23 2004-03-09 Northrop Grumman Corporation Downlink orderwire integrator and separator for use in a satellite based communications system
US6941550B1 (en) * 2001-07-09 2005-09-06 Microsoft Corporation Interface invoke mechanism
US7564460B2 (en) 2001-07-16 2009-07-21 Microsoft Corporation Systems and methods for providing intermediate targets in a graphics system
US7103878B2 (en) * 2001-12-13 2006-09-05 Hewlett-Packard Development Company, L.P. Method and system to instrument virtual function calls
US7159212B2 (en) 2002-03-08 2007-01-02 Electronic Arts Inc. Systems and methods for implementing shader-driven compilation of rendering assets
US7015909B1 (en) 2002-03-19 2006-03-21 Aechelon Technology, Inc. Efficient use of user-defined shaders to implement graphics operations
JP3956112B2 (en) * 2002-06-12 2007-08-08 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Maschines Corporation Compiler, register allocation unit, a program, a recording medium, compiling, and register allocation method
US6809732B2 (en) 2002-07-18 2004-10-26 Nvidia Corporation Method and apparatus for generation of programmable shader configuration information from state-based control information and program instructions
US20040095348A1 (en) * 2002-11-19 2004-05-20 Bleiweiss Avi I. Shading language interface and method
US6839062B2 (en) * 2003-02-24 2005-01-04 Microsoft Corporation Usage semantics
US7530062B2 (en) * 2003-05-23 2009-05-05 Microsoft Corporation Optimizing compiler transforms for a high level shader language
US7523406B2 (en) * 2003-07-22 2009-04-21 Autodesk Inc. Dynamic parameter interface
US8274517B2 (en) 2003-11-14 2012-09-25 Microsoft Corporation Systems and methods for downloading algorithmic elements to a coprocessor and corresponding techniques
US7463259B1 (en) * 2003-12-18 2008-12-09 Nvidia Corporation Subshader mechanism for programming language
US7218291B2 (en) * 2004-09-13 2007-05-15 Nvidia Corporation Increased scalability in the fragment shading pipeline
US20060082577A1 (en) 2004-10-20 2006-04-20 Ugs Corp. System, method, and computer program product for dynamic shader generation
US7598953B2 (en) * 2004-11-05 2009-10-06 Microsoft Corporation Interpreter for simplified programming of graphics processor units in general purpose programming languages
US7548244B2 (en) * 2005-01-12 2009-06-16 Sony Computer Entertainment Inc. Interactive debugging and monitoring of shader programs executing on a graphics processor
US7394464B2 (en) 2005-01-28 2008-07-01 Microsoft Corporation Preshaders: optimization of GPU programs
US8144149B2 (en) * 2005-10-14 2012-03-27 Via Technologies, Inc. System and method for dynamically load balancing multiple shader stages in a shared pool of processing units
US20070091088A1 (en) * 2005-10-14 2007-04-26 Via Technologies, Inc. System and method for managing the computation of graphics shading operations
US20070153015A1 (en) * 2006-01-05 2007-07-05 Smedia Technology Corporation Graphics processing unit instruction sets using a reconfigurable cache
US20070229520A1 (en) * 2006-03-31 2007-10-04 Microsoft Corporation Buffered Paint Systems
US8766996B2 (en) * 2006-06-21 2014-07-01 Qualcomm Incorporated Unified virtual addressed register file
US8601456B2 (en) * 2006-08-04 2013-12-03 Microsoft Corporation Software transactional protection of managed pointers
US7750913B1 (en) * 2006-10-24 2010-07-06 Adobe Systems Incorporated System and method for implementing graphics processing unit shader programs using snippets
US8379032B2 (en) * 2007-09-28 2013-02-19 Qualcomm Incorporated System and method of mapping shader variables into physical registers

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
See references of EP2289050A4 *

Also Published As

Publication number Publication date Type
US8581912B2 (en) 2013-11-12 grant
US20170039754A1 (en) 2017-02-09 application
US20140063029A1 (en) 2014-03-06 application
CN102077251A (en) 2011-05-25 application
EP2289050A4 (en) 2012-01-11 application
EP2289050A2 (en) 2011-03-02 application
US9824484B2 (en) 2017-11-21 grant
US9390542B2 (en) 2016-07-12 grant
CN102077251B (en) 2014-01-08 grant
US20090322751A1 (en) 2009-12-31 application
WO2009158679A3 (en) 2010-05-06 application
WO2009158679A8 (en) 2010-11-18 application

Similar Documents

Publication Publication Date Title
McCool et al. Shader metaprogramming
Engler VCODE: a retargetable, extensible, very fast dynamic code generation system
Seiler et al. Larrabee: a many-core x86 architecture for visual computing
US6417858B1 (en) Processor for geometry transformations and lighting calculations
US6384833B1 (en) Method and parallelizing geometric processing in a graphics rendering pipeline
US20080276262A1 (en) Parallel runtime execution on multiple processors
Shreiner et al. OpenGL programming guide: The Official guide to learning OpenGL, version 4.3
US6717599B1 (en) Method, system, and computer program product for implementing derivative operators with graphics hardware
US20070035545A1 (en) Method for hybrid rasterization and raytracing with consistent programmable shading
Rost et al. OpenGL shading language
Parker et al. Optix: a general purpose ray tracing engine
US20050275760A1 (en) Modifying a rasterized surface, such as by trimming
Munshi et al. OpenGL ES 2.0 programming guide
Peercy et al. Interactive multi-pass programmable shading
Kessenich et al. The opengl shading language
US20070261038A1 (en) Code Translation and Pipeline Optimization
US20080276064A1 (en) Shared stream memory on multiple processors
Woop et al. RPU: a programmable ray processing unit for realtime ray tracing
Hopf et al. Hierarchical splatting of scattered data
US20080276220A1 (en) Application interface on multiple processors
Proudfoot et al. A real-time procedural shading system for programmable graphics hardware
US7015909B1 (en) Efficient use of user-defined shaders to implement graphics operations
US20070220525A1 (en) General purpose software parallel task engine
US7006101B1 (en) Graphics API with branching capabilities
US20070018990A1 (en) Color computation of pixels using a plurality of vertex or fragment shader programs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09771210

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE