US11748933B2 - Method for performing shader occupancy for small primitives - Google Patents
Method for performing shader occupancy for small primitives Download PDFInfo
- Publication number
- US11748933B2 US11748933B2 US17/168,168 US202117168168A US11748933B2 US 11748933 B2 US11748933 B2 US 11748933B2 US 202117168168 A US202117168168 A US 202117168168A US 11748933 B2 US11748933 B2 US 11748933B2
- Authority
- US
- United States
- Prior art keywords
- quad
- shader
- partially covered
- warp
- primitive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/50—Lighting effects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/005—General purpose rendering architectures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/50—Lighting effects
- G06T15/80—Shading
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2200/00—Indexing scheme for image data processing or generation, in general
- G06T2200/28—Indexing scheme for image data processing or generation, in general involving image processing hardware
Definitions
- the present disclosure relates to graphics processing units (GPUs), and more particularly, to a method for performing shader occupancy for small primitives.
- Modern GPUs include a programmable, highly parallel, set of computation engines, and a collection of various fixed-function units.
- the fixed-function units may include a texture address generation and filtering unit, a primitive clipping unit, a culling unit, a viewport transforming unit, a binning unit, a rasterization setup and rasterization unit, a depth comparison unit, a blending unit, and/or other units.
- GPUs may be used for graphics-intensive operations and/or compute-intensive workloads.
- Graphics data may flow through a GPU in a pipeline fashion, performing steps outlined in one or more Application Programming Interfaces (APIs), such as OpenGL-ES, Vulkan, DirectX, or the like.
- APIs Application Programming Interfaces
- the GPUs may conform to the standards specified, which may be directed to texture coordinates and texture address generation. More specifically, during a pixel shading stage in the pipeline, a shader program may make texture requests and receive filtered texture data.
- a directional derivative calculation may be performed in each of the X and Y dimensions to determine the minification or magnification of the texture being accessed with respect to the pixel (or sample) spacing of the coverage.
- sample and the term “pixel” may be used interchangeably insomuch as it is understood that the same operations are performed at either the pixel level or the sub-pixel sample level.
- Calculating a directional derivative may use at least two data values in each of the two dimensions.
- pixel shaders may operate on 2 ⁇ 2 quads (i.e., blocks of four pixels) as their minimum quantum of work.
- An input primitive may be a projection of a three-dimensional (3D) primitive onto a two-dimensional (2D) image-space, and rasterized to determine pixel coverage.
- a primitive may be a triangle defined by a triplet of (x,y) coordinate pairs. Regardless of the actual coverage formed by a given input primitive, work supplied to a parallel processor shader subsystem may be a collection of these 2 ⁇ 2 quads, which may result in a large inefficiency if many of the quads are only partially filled (i.e., partially covered).
- One approach for reducing this inefficiency may involve recognizing cases of partial coverage, and transferring the coverage from one adjacent primitive to the quad of another. While this approach may reduce the total number of quads sent to the shader, and thus may help to reduce total energy consumption, such an approach comes at the expense of losing some image quality.
- the merging of quads may use certain heuristic thresholds applied and set to control its application, thereby attempting to avoid unwanted visual artifacts due to ascribing coverage from one primitive to an adjacent primitive, and as an approximation, using that adjacent primitive's attribute data. Nevertheless, such a quad merge approach remains lossy.
- Various embodiments of the disclosure include a GPU, comprising one or more shader cores and a shader warp packer unit.
- the shader warp packer unit may be configured to receive a first primitive associated with a first partially covered quad, and a second primitive associated with a second partially covered quad.
- the shader warp packer unit may be configured to determine that the first partially covered quad and the second partially covered quad have non-overlapping coverage.
- the shader warp packer unit may be configured to pack the first partially covered quad and the second partially covered quad into a packed quad.
- the shader warp packer unit may be configured to send the packed quad to the one or more shader cores.
- the first partially covered quad and the second partially covered quad are spatially disjoint from each other. The term disjoint may imply non-overlapping.
- the one or more shader cores may be configured to receive and process the packed quad with no loss of information relative to the one or more shader cores individually processing the first partially covered quad and the second partially covered quad.
- a method for performing shader occupancy for small primitives using a GPU may include receiving, by a shader warp packer unit, a first primitive associated with a first partially covered quad, and a second primitive associated with a second partially covered quad.
- the method may include determining, by the shader warp packer unit, that the first partially covered quad and the second partially covered quad have non-overlapping coverage.
- the method may include packing, by the shader warp packer unit, the first partially covered quad and the second partially covered quad into a packed quad.
- the method may include sending, by the shader warp packer unit, the packed quad to one or more shader cores.
- the first partially covered quad and the second partially covered quad are spatially disjoint from each other. The term disjoint may imply non-overlapping.
- the method may include receiving and processing, by the one or more shader cores, the packed quad with no loss of information relative to the one or more shader cores individually processing the first partially covered quad and the second partially covered quad.
- FIG. 1 A illustrates a block diagram of a GPU including a dynamic branching pixel shader warp packer unit in accordance with some embodiments.
- FIG. 1 B illustrates a GPU including the dynamic branching pixel shader warp packer unit of FIG. 1 A in accordance with some embodiments.
- FIG. 1 C illustrates a mobile personal computer including a GPU having the dynamic branching pixel shader warp packer unit of FIG. 1 A in accordance with some embodiments.
- FIG. 1 D illustrates a tablet computer including a GPU having the dynamic branching pixel shader warp packer unit of FIG. 1 A in accordance with some embodiments.
- FIG. 1 E illustrates a smart phone including a GPU having the dynamic branching pixel shader warp packer unit of FIG. 1 A in accordance with some embodiments.
- FIG. 2 is a block diagram showing a 2 ⁇ 2 quad in accordance with some embodiments.
- FIG. 3 is a block diagram showing a primitive in a partially-covered 2 ⁇ 2 quad in accordance with some embodiments.
- FIG. 4 is a block diagram showing another primitive in another partially-covered 2 ⁇ 2 quad in accordance with some embodiments.
- FIG. 5 is a block diagram showing a packed 2 ⁇ 2 quad in accordance with some embodiments.
- FIG. 6 is a block diagram showing another packed 2 ⁇ 2 quad in accordance with some embodiments.
- FIG. 7 is a diagram associated with batycentric factor computation of neighboring pixels expressed relative to a pixel at (x,y).
- FIG. 8 is a flow diagram illustrating a technique for performing shader occupancy for small primitives in accordance with some embodiments
- FIG. 9 is a flow diagram illustrating another technique for performing shader occupancy for small primitives in accordance with some embodiments.
- first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first pixel could be termed a second pixel, and, similarly, a second pixel could be termed a first pixel, without departing from the scope of the inventive concept.
- Some embodiments disclosed herein may include a technique for performing shader occupancy for relatively small primitives.
- additional information may be packaged along with each pixel in the quad, thereby allowing for calculations that may be needed for directional derivative calculations.
- the technique may include full packing of quads from potentially separate primitives, along with auxiliary information that can be used to produce information that would have otherwise been produced by missing “helper” pixels or “H” pixels in the 2 ⁇ 2 quads, thereby increasing processing efficiency.
- Some embodiments disclosed herein improve the efficiency of graphics-intensive operations within a GPU, which may involve the use of programmable units and/or fixed-function units within the GPU.
- Embodiments disclosed herein may not transfer coverage from one primitive to another, but instead, may provide mechanisms in which coverage from two or more primitives may exist in the same quad, without losing any precision.
- no associated heuristic thresholds are needed to maintain image quality. Additional information may be present, and calculations may occur, for the pixels within a quad that has coverage front more than one incoming primitive. Accordingly, small primitive processing efficiency may be improved.
- the use of small and micro-polygons may be increased, thereby resulting in higher geometric complexity and fidelity, such as when used with graphic-intensive gaming applications.
- FIG. 1 A illustrates a block diagram of a GPU 100 including a dynamic branching pixel shader warp packer unit 105 in accordance with some embodiments.
- FIG. 1 B illustrates a GPU 100 including the dynamic branching pixel shader warp packer unit 105 of FIG. 1 A in accordance with some embodiments.
- FIG. 1 C illustrates a mobile personal computer 180 a including a GPU 100 having the dynamic branching pixel shader warp packer unit 105 of FIG. 1 A in accordance with some embodiments.
- FIG. 1 D illustrates a tablet computer 180 b including a GPU 100 having the dynamic branching pixel shader warp packer unit 105 of FIG. 1 A in accordance with some embodiments.
- FIG. 1 E illustrates a smart phone 180 c including a GPU 100 having the dynamic branching pixel shader warp packer unit 105 of FIG. 1 A in accordance with some embodiments. Reference is now made to FIGS. 1 A through 1 E .
- the dynamic branching pixel shader warp packer unit 105 may perform shader occupancy for relatively small primitives 130 , 135 ).
- 2 ⁇ 2 quads such as quad 115 and quad 120 may each be only partially filled (i.e., partially covered). It will be understood that while reference is generally made herein to 2 ⁇ 2 quads, other sized quads can be used without departing from the inventive concept described.
- the small primitive 130 only partially fills the 2 ⁇ 2 quad 115 .
- the small primitive 135 only partially fills the 2 ⁇ 2 quad 120 .
- the dynamic branching pixel shader warp packer unit 105 may pack two or more primitives (e.g., 130 , 135 ) into a same 2 ⁇ 2 quad 140 .
- additional attribute information may be packaged and/or stored along with each pixel (e.g., 150 ) in the 2 ⁇ 2 quad 140 , thereby allowing for information that may be needed to calculate directional derivatives 160 , 165 , 192 ).
- the technique may include full packing of 2 ⁇ 2 quads from potentially separate primitives (e.g., 130 , 135 ), along with the attribute information (e.g., 145 ), which can be used to produce information that would have otherwise been produced by missing H pixels in the 2 ⁇ 2 quads, thereby increasing processing efficiency.
- Embodiments disclosed herein improve the efficiency of graphics-intensive operations within the GPU 100 , which may involve the use of programmable units and/or fixed-function units such as one or more shader core(s) 110 within the GPU 100 .
- Embodiments disclosed herein may not transfer coverage from one primitive (e.g., 130 ) to another (e.g., 135 ), but instead, may provide mechanisms in which coverage from two or more primitives (e.g., 130 , 135 ) may exist in the same 2 ⁇ 2 quad 140 , without losing any precision.
- Additional attribute information may be present, and calculations may occur, for the pixels (e.g., 150 ) within the 2 ⁇ 2 quad 140 , which may have coverage from more than one incoming primitive (e.g., 130 , 135 ). Accordingly, small primitive processing efficiency may be improved. Moreover, the use of small and micro-polygons may be increased, thereby resulting in higher geometric complexity and fidelity, such as when used with graphic-intensive gaming applications.
- the dynamic branching pixel shader warp packer unit 105 may merge the 2 ⁇ 2 quads (e.g., 115 , 120 ) from two different primitives (e.g., 130 , 135 ) and place them within the same 2 ⁇ 2 quad 140 . This leads to better warp occupancy.
- the dynamic branching pixel shader warp packer unit 105 may include a hysteresis window 155 , which may collect non-overlapping quads (e.g., 115 , 120 ) from various primitives (e.g., 130 , 135 ).
- various primitives encountered may be opportunistically combined into a single same 2 ⁇ 2 quad 140 for processing in the one or more shader cores 110 , as further described below.
- Primitives that are candidates for shared quad processing but that fall beyond the hysteresis window 155 may be processed using conventional techniques, but would not get the benefit of improved shader efficiency.
- the non-overlapping coverage quality may not be required.
- the first partially covered quad 115 and the second partially covered quad 120 may have overlapping coverage, although at the cost of some additional buffering of data while pixels are processed in the one or more shader cores 110 .
- the dynamic branching shader warp packer unit 105 may receive one or more primitives (e.g., 130 , 135 ), and determine whether at least two partially covered 2 ⁇ 2 quads (e.g., 115 , 120 ) do not have overlapping coverage. According to embodiments disclosed herein, the at least two partially covered 2 ⁇ 2 quads (e.g., 115 , 120 ) can be at a same location or a different location. Spatial proximity is not required.
- the dynamic branching shader warp packer unit 105 may pack the at least two partially covered 2 ⁇ 2 quads (e.g., 115 , 120 ) into a packed 2 ⁇ 2 quad (e.g., 140 ).
- the dynamic branching shader warp packer unit 105 may send the packed 2 ⁇ 2 quad (e.g., 140 ) to the one or more shader cores 110 for processing.
- a compiler may be modified to generate a code sequence 170 to support a dynamic branching hybrid mode.
- the dynamic branching shader warp packer unit 105 may compute directional derivatives (e.g., 160 , 165 , 192 ) based on the code sequence 170 using one or more lanes 125 .
- the one or more lanes 125 may be one or more computational threads, for example.
- the one or more lanes 125 may be processed, for example, by a multi-threaded and/or multi-core microprocessor.
- the compiler may generate the code sequence 170 to support cross-lane directional derivative computation (e.g., 160 ) and/or same-lane directional derivative computation (e.g., 165 ).
- H pixel data may be piled on to alive pixels.
- a compiler may generate code to compute the directional derivative within the same lane along side the code for cross-lane computation.
- a directional derivative (e.g., 160 , 165 , 192 ) may be computed in a cross-lane operation or a same-lane operation, A compiler can generate code for both approaches.
- the particular mode of operation (i.e., same-lane or cross-lane) can be chosen at run time using one or more flags 114 , which may be generated and sent by the dynamic branching shader warp packer unit 105 as part of wave packing.
- the one or more flags 114 can be binary flags, for example.
- the dynamic branching shader warp packer unit 105 may pass to the one or more shader cores 110 the packed warp 191 including the coverage map 190 having coverage for each of the 2 ⁇ 2 quads (e.g., 140 ) in a warp.
- the warp may have a power-of-two width, such as 16, 32, and 64, and so forth.
- the dynamic branching shader warp packer unit 105 may provide the one or more flags 114 to the one or more shader cores 110 .
- the one or more flags 114 may be used to control (e.g., provide for a dynamic branch) in the one or more shader cores 110 code for a given pixel (e.g., 150 ), and thus collectively over four threads, for a given 2 ⁇ 2 quad (e.g., 140 ), Depending on the quad's coverage and the II pixels present (or not present), the one or more flags 114 may be used to control branches taken regarding the computation of the directional derivatives (e.g., 160 , 165 , 192 ).
- the compiler may generate the code segment 170 for different scenarios of computing the directional derivatives (e.g., 160 , 165 , 192 ) in each pixel, or across pixels, and may select branches among these cases dynamically as directed by the additional attribute information 145 , which may be passed along with the 2 ⁇ 2 quad (e.g., 140 ) from the dynamic branching shader warp packer unit 105 .
- the additional attribute information 145 may be passed along with the 2 ⁇ 2 quad (e.g., 140 ) from the dynamic branching shader warp packer unit 105 .
- the one or more shader cores (e.g., 110 ) and/or each of the lanes 125 can determine whether a horizontal directional derivative or a vertical directional derivative, or both, are computed in the same-lane (e.g., 165 ) or cross-lane (e.g., 160 ). This determination can be based on whether horizontal or vertical pixel neighbors in the 2 ⁇ 2 quad 140 are present, and/or based on information available to the dynamic branching pixel shader warp packer unit 105 .
- the dynamic branching pixel shader warp packer unit 105 may generate two bitmasks (e.g., 175 , 180 ), one for each direction.
- the dynamic branching pixel shader warp packer unit 105 may generate a bitmask 175 for the vertical direction, and a bitmask 180 for the horizontal direction.
- the dynamic branching pixel shader warp packer unit 105 may use knowledge of the quad coverage from a primitive to make this determination.
- Each of the bitmasks (e.g., 175 , 180 ) may include 1 bit per pixel.
- Primitives may be rasterized in multiples of a 2 ⁇ 2 quad.
- One or more 2 ⁇ 2 quads may be packed together into pixel shader warps, which may be sent to the one or more shader cores 110 for execution.
- Calculating a directional derivative (e.g., 160 , 165 , 192 ) may use at least two data values in each of two dimensions. Accordingly, pixels in a 2 ⁇ 2 quad may be used to compute directional derivatives.
- This directional derivative may be useful in computing a particular level-of-detail (LOD), which may be used in a variety of ways, including determining a mip-level of a texture.
- LOD level-of-detail
- 2 ⁇ 2 quads may also help to reduce texture traffic by exploiting spatial locality between pixels in a 2 ⁇ 2 quad. Accordingly, the GPU 100 may schedule a 2 ⁇ 2 quad into the one or more pixel shader cores 110 even when not all pixels inside the 2 ⁇ 2 quad are occupied. Pixels which are not visible but are used only for the purposes of computing the directional derivative (e.g., 160 , 165 , 192 ) may be referred to as H pixels. When the majority of primitives is large, then so too is the majority of 2 ⁇ 2 quads that are fully-filled. Stated differently, embodiments disclosed herein are more important when the majority of the primitives is not large. In some embodiments, large may be defined as covering 10s or 100s of 2 ⁇ 2 quads. The larger a primitive, the smaller the fraction of partially-covered 2 ⁇ 2 quads there are compared to fully-covered (by that primitive) 2 ⁇ 2 quads.
- the dynamic branching pixel shader warp packer unit 105 may independently schedule a pixel. This may be made possible by assigning attributes of neighboring pixels in the X and Y directions to the same the pixel.
- the directional derivative (e.g., 160 , 165 , 192 ) may be fully computed in a same-lane operation instead of a cross-lane operation.
- this can be achieved by copying the barycentric factors 185 of neighboring pixels in the horizontal and vertical directions of a 2 ⁇ 2 quad.
- each pixel may now contain three sets of barycentric factors 185 : i) one for itself; ii) one for a neighboring pixel along the horizontal direction; and iii) one for a neighboring pixel along the vertical direction.
- the barycentric factors 185 may be used to compute the attributes for the directional derivative (e.g., 160 , 165 , 192 ). This technique may increase register pressure because four additional values may be stored per lane. However, for each set of barycentric factors 185 , it is sufficient to store just two of them, while the third value can be computed by subtracting the sum of the two barycentric factors 185 from one (1).
- a pixel (e.g., 150 ) used for computing a directional derivative (e.g., 160 , 165 , 192 ) is part of the primitive's coverage
- the dynamic branching pixel shader warp packer unit 105 need not copy barycentric factors (e.g., 185 ) of that pixel, and the dynamic branching pixel shader warp packer unit 105 may use cross-lane operations to compute the directional derivative (e.g., 160 ).
- the dynamic branching pixel shader warp packer unit 105 may use single-lane operations to compute the directional derivative (e.g., 165 ),
- the barycentric factors 185 , one or more flags 114 , and/or a packed warp 191 having a coverage map 190 of the 2 ⁇ 2 quad 140 may be sent to the one or more shader cores 110 .
- the one or more shader cores 110 may use the coverage map 190 to determine whether the horizontal and/or vertical derivatives are computed in a same-lane operation or a cross-lane operation.
- the one or more shader cores 110 may employ dynamic branching to use either of these paths.
- a separate code entry point into a shader program may be provided, which may be preferable if it saves latency and energy associated with executing a dynamic branch instruction.
- the dynamic branch may be “performed” external to the shader program by having the program start from different entry points responsive to the coverage map 190 and/or other parameters.
- partial differentials 192 may be sent to the one or more shader cores 110 , as further described below.
- the GPU 100 may include a memory 116 to store the directional derivatives.
- the memory 116 may be a volatile or non-volatile memory or other suitable storage device.
- the GPU 100 may include one or more texture units 195 .
- Calculation of the LOD can be performed either in the one or more shader cores 110 or in the one or more texture units 195 .
- the additional attribute data e.g., 145
- the one or more shader cores 110 may be aware of sample instructions and/or texture operations that may be occurring, and therefore may have access to other texture coordinate information to provide to the one or more texture units 195 .
- the GPU 100 may include one or more interpolation units 198 .
- Interpolation of the attributes e.g., 145
- additional modification to the one or more interpolation units 198 may perform attribute interpolations for H pixels.
- a map 199 indicating which H pixels may be present and which attributes may be associated with each H pixel may be used by the one or more interpolation units 198 .
- FIG. 2 is a block diagram showing a 2 ⁇ 2 quad 200 in accordance with some embodiments.
- the 2 ⁇ 2 quad 200 may include four pixels 205 a , 205 b , 205 c , and 205 d .
- Each of the pixels may have a center point (e.g., 215 ).
- Each of the pixels may have a pixel number (e.g., 210 ).
- FIG. 3 is a block diagram showing the primitive 130 in a partially-covered 2 ⁇ 2 quad 115 in accordance with some embodiments.
- FIG. 4 is a block diagram showing the primitive 135 in another partially-covered 2 ⁇ 2 quad 120 in accordance with some embodiments. Reference is now made to FIGS. 1 A, 3 , and 4 .
- the 2 ⁇ 2 quad 115 may be filled based on the coverage of arriving primitives.
- the primitives 130 and 135 may each contribute coverage to a same 2 ⁇ 2 quad 140 .
- the primitives 130 and 135 may each include three vertices (e.g., 305 , 405 ).
- the 2 ⁇ 2 quad 115 and the 2 ⁇ 2 quad 120 would both need to be individually sent to the one or more shader cores 110 .
- the 2 ⁇ 2 quad 115 would be sent having two upper pixels covered by the primitive 130 , and two H pixels (e.g., 310 ) used for directional derivative calculations.
- the 2 ⁇ 2 quad 120 would be separately sent having one lower pixel covered by the primitive 135 , and three H pixels (e.g., 410 ) used for directional derivative calculations. Thus, although only three pixels are covered in the 2 ⁇ 2 quads 115 and 120 , a total of eight (8) threads would be allocated in the one or more shader cores 110 —one for each pixel of each 2 ⁇ 2 quad.
- FIG. 5 is a block diagram showing a packed 2 ⁇ 2 quad 140 in accordance with some embodiments.
- H pixels need not be used when the coverage belongs to another adjacent primitive (e.g., 130 , 135 ).
- the primitives 130 and 135 are present in the 2 ⁇ 2 quad 140 , and thus only the primitive 130 can have sufficient coverage along with the associated H pixel 310 to provide for directional derivative calculations.
- the primitive 135 has only a single pixel (i.e., 205 d ) of coverage and no associated H pixels in the 2 ⁇ 2 quad 140 .
- the pixel 205 d corresponding to the primitive 135 's coverage may be marked as needing two H pixel's worth of additional attribute information (e.g., 145 ) and/or attribute evaluation by the one or more shader cores 110 .
- each of the primitives e.g., 130 , 135
- only three (3) or four (4) pixels may be sent to the one or more shader cores 110 rather than eight (8) pixels, while still achieving a completely lossless result.
- FIG. 6 is a block diagram showing another packed 2 ⁇ 2 quad 600 in accordance with some embodiments.
- there are three primitives e.g., 130 , 135 , and 615 ) that are present with coverage in the same 2 ⁇ 2 quad 600 , and the primitives (e.g., 130 , 135 , and 615 ) may not have sufficient pixels in the 2 ⁇ 2 quad 600 for directional derivative calculations to be performed across lanes. Consequently, at least some additional attribute calculations based on the attribute information 145 associated with each pixel of the 2 ⁇ 2 quad 600 may be performed for one or more adjacent pixels.
- the bitmasks may be used to indicate for each of the pixels in the 2 ⁇ 2 quad 600 which neighboring pixels are or are not present for a given primitive (e.g., 130 , 135 , and 615 ). This enables the one or more shader cores 110 and/or the one or more texture units 195 to compute the directional derivatives based on the bitmasks (e.g., 175 , 180 ) and/or based on the additional attribute information (e.g., 145 ).
- the horizontal directional derivative calculation may be computed using conventional technology whereas the vertical directional derivative may be computed using an embodiment disclosed herein.
- the result is the denser packing, as described above, while no unnecessary additional work is placed on a given pixel when there is a neighboring pixel of the same primitive available for the directional derivative calculation.
- the overloading of additional work may only be present when a covered pixel or helper pixel is not available, the hater due to the particular pixel position being occupied by another primitive with coverage in the same 2 ⁇ 2 quad.
- each of the primitives e.g., 130 , 135 , and 615
- only four (4) pixels may be sent to the one or more shader cores 110 rather than twelve (12) pixels, while still achieving a completely lossless result.
- APIs Some application programming interfaces (APIs) of GPUS have specific rules about the operations that should and should not be performed for the H pixels that may be present in quads that do not have complete primitive coverage.
- the APIs may, specify that the H pixels should not perform any side-effects, such as reads, writes, or atomic accesses to memory.
- the APIs may also specify that the H pixels should not write any data to render targets, sometimes referred to as “attachments” by some APIs.
- attachments sometimes referred to as “attachments” by some APIs.
- such API restrictions are not problematic, and may even be convenient, because they lessen the burden on the compiler regarding the additional code needed for pixels in a quad that do not also have a vertical and/or a horizontal H pixel or actual pixel present.
- Barycentric evaluations for attributes may be provided. Regardless of whether the embodiments disclosed herein are invoked, for each pixel sent to the one or more shader cores 110 for processing, post front-end transformation, application-provided attributes may be interpolated to the pixel's location. This may take the form of interpolating a barycentric coordinate pair and using that as the basis for evaluating each of the primitives' attributes at each of the pixel locations. In some embodiments, this evaluation of attributes may be augmented to include the evaluation of the pixel's attributes and also that of one or two immediately adjacent pixels, to serve as directional derivative texture input.
- FIGS. 1 A through 7 Reference is now made to FIGS. 1 A through 7 .
- partial differentials 192 may be sent to the one or more shader cores 110 , which may be used to compute the directional derivatives (e.g., 160 , 165 , 192 ).
- the directional derivatives e.g., 160 , 165 , 192
- the directional derivatives may be determined as follows:
- (x 0 ,y 0 ), (x 1 ,y 1 ), (x 2 ,y 2 ) may be the three vertices (e.g., 305 , 405 , . . . ) of a primitive in screen space.
- A may be the total area of the primitive (e.g., 130 , 135 ).
- the one or more shader cores 110 and/or the one or more texture units 195 may compute the barycentric factors 185 of the horizontal and vertical pixel neighbors using the following relationships:
- the partial differentials 192 can be supplied to the one or more shader cores 110 , and the one or more shader cores 110 and/or the one or more texture units 195 can compute the barycentric factors 185 of the neighboring pixels to compute the directional derivative (e.g., 160 , 165 , 192 ).
- the partial differentials 192 can be used to directly compute the directional derivative (e.g., 160 , 165 , 192 ).
- the computation for one value is shown below.
- the same technique can be applied to all attributes.
- the directional derivative e.g., 160 , 165 , 192
- the directional derivative can be computed using two multiplications and one addition, assuming (t 0 -t 2 ) and (t 1 -t 2 ) have already been computed.
- Sending the barycentric factors 185 and partial differentials 192 may incur the same cost in terms of registers, e.g., four registers.
- computation of the directional derivative e.g., 160 , 165 , 192
- the latter may involve just two multiplications and one addition since the partial differentials 192 may already be available.
- FIG. 7 is a diagram associated with barycentric factor computation of neighboring pixels expressed relative to a pixel at (x,y). The figures may be expressed in equations as follows:
- the following table 1 shows a barycentric factor computation of neighboring pixels expressed relative to the pixel at (x,y):
- each attribute's value given by (s 0 ,t 0 ), (s 1 ,t 1 ), (s 2 ,t 2 ) at each of the three vertices, may be combined with the barycentric factors for this pixel as follows.
- s x,y s 0 ⁇ u x,y +s 1 ⁇ v x,y +s 2 ⁇ (1 ⁇ u x,y ⁇ v x,y )
- t x,y t 0 ⁇ u x,y +t 1 ⁇ v x,y +t 2 ⁇ (1 ⁇ u x,y ⁇ v x,y )
- the following table 2 shows directional derivative computation given the values at the three corners of the triangle (i.e., primitive):
- the BCI may copy the value of
- the embodiments disclosed herein may bring the shader rate more in line with the primitive setup rate.
- the embodiments disclosed herein do not require a massive redesign of the GPU pipeline.
- Some changes may be made in the pixel packing logic, and some additional software-accessible data may be provided to the shader core for pixel shaders.
- the techniques disclosed herein are completely lossless. Unlike conventional technologies, there is no change in the visual or numerical results, and thus no need for a heuristic of when applying these techniques may be aesthetically acceptable.
- the embodiments disclosed herein may be completely transparent to applications.
- Total latency for a given warp may grow in cases when multiple primitives' coverage may be packed in the same quad, but this is more than offset by other savings.
- the sum of the additional instructions used for interpolating the attributes is more than the threshold of the total number of instructions executed by the one or more shader cores 110 .
- the cost of interpolating additional attributes may be relatively high such that packing more threads does not benefit the overall performance.
- the number of texture accesses needed for the directional derivatives may be more than a certain threshold. In such cases, it may be better to disable embodiments disclosed herein, and instead perform calculations using conventional techniques. However, the trend is towards more complicated pixel shaders, requiring more calculations per thread.
- the compiler may statically analyze the shader program to determine if it's worth packing multiple primitives into a quad, and if so, set a flag or some other state variable to enable embodiments disclosed herein. In other words, it may be beneficial to switch off the disclosed techniques when the overhead of the disclosed techniques exceeds the benefits.
- a pixel may be marked as an H pixel, which can be used to indicate where these lanes need to be switched on and where these lanes need to be switched off to avoid unproductive work.
- FIG. 8 is a flow diagram 800 illustrating a technique for performing shader occupancy for small primitives in accordance with some embodiments.
- a shader warp packer unit may receive a first primitive associated with a first partially covered quad, and a second primitive associated with a second partially covered quad.
- the shader warp packer unit may determine that the first partially covered quad and the second partially covered quad have non-overlapping coverage.
- the shader warp packer unit may pack the first partially covered quad and the second partially covered quad into a packed quad even when the first partially covered quad and the second partially covered quad are spatially disjoint. The term disjoint may imply non-overlapping.
- the shader warp packer unit may send the packed quad to one or more shader cores.
- FIG. 9 is a flow diagram 900 illustrating another technique for performing shader occupancy for small primitives in accordance with some embodiments.
- the shader warp packer unit or the one or more shader cores may choose between a first operating mode or a second operating mode based on at least one run-time flag.
- the shader warp packer unit or the one or more shader cores may store attribute information associated with at least one pixel of a packed quad.
- a directional derivative may be computed in a single-lane operation based on the stored attribute information.
- a directional derivative may be computed in a cross-lane operation based on the stored attribute information.
- FIGS. 8 and 9 need not be performed in the order shown, and intervening steps may be present.
- a GPU includes one or more shader cores.
- the GPU may include a shader warp packer unit configured to receive a first primitive associated with a first partially covered quad, and a second primitive associated with a second partially covered quad.
- the shader warp packer unit is configured to determine that the first partially covered quad and the second partially covered quad have non-overlapping coverage.
- the shader warp packer unit is configured to pack the first partially covered quad and the second partially covered quad into a packed quad.
- the shader warp packer unit is configured to send the packed quad to the one or more shader cores.
- the first partially covered quad and the second partially covered quad are spatially disjoint from each other. The term disjoint may imply non-overlapping.
- the one or more shader cores are configured to receive and process the packed quad with no loss of information relative to the one or more shader cores individually processing the first partially covered quad and the second partially covered quad.
- the shader warp packer unit is configured to assign zero or more pixels from the packed quad to a single lane for a single-lane operation. For example, a single “coverage” pixel may be assigned to a lane, and then zero, one, or two H pixels may be assigned to the same lane.
- the shader warp packer unit is configured to cause the shader core(s) to compute a directional derivative in the single-lane operation.
- the one or more shader cores are configured to compute a directional derivative in the single-lane operation.
- the GPU may include a first operating mode in which at least one of i) the one or more texture units, or ii) the one or more shader cores are configured to compute the directional derivative in the single-lane operation.
- the GPU may include a second operating mode in which at least one of i) the one or more texture units, or ii) the one or more shader cores are configured to compute a second directional derivative in a cross-lane operation.
- the at least one of i) the shader warp packer unit or ii) the one or more shader cores are configured to choose at least one of the first operating mode or the second operating mode based on at least one run-time flag.
- the shader warp packer unit is configured to store attribute information associated with at least one pixel of the packed quad.
- at least one of i) the one or more texture units, or ii) the one or more shader cores are configured to compute a directional derivative based on the attribute information.
- At least one of i) the one or more texture units, or ii) the one or more shader cores are configured to compute a directional derivative based on one or more barycentric factors. In some embodiments, at least one of i) the one or more texture units, or ii) the one or more shader cores are configured to compute a directional derivative based on one or more partial differentials.
- At least one of i) the one or more texture units or ii) the one or more shader cores are configured to compute a first directional derivative in an X direction according to:
- A is the area of at least one of the first primitive or the second primitive.
- x 0 , y 0 , x 1 , y 1 , x 2 , and y 2 are coordinates of vertices of the at least one of the first primitive or the second primitive.
- t 0 , t 1 , and t 2 are the values of the “t” attribute at each of the three vertices, (x0, y0), (x1, y1), and (x2,y 2 ), respectively.
- the values “s” and “t” may represent two such primitive attributes, and may be written as (s,t) to denote a texture coordinate at some particular pixel in a primitive, having been interpolated from the (s,t) values at each of the three vertices.
- the GPU 100 may include a memory configured to store the first direction derivative and the second directional derivative.
- Some embodiments include a method for performing shader occupancy for small primitives using a GPU.
- the method may include receiving, by a shader warp packer unit, a first primitive associated with a first partially covered quad, and a second primitive associated with a second partially covered quad.
- the method may include determining, by the shader warp packer unit, that the first partially covered quad and the second partially covered quad have non-overlapping coverage. The non-overlapping coverage quality may not be required. In other words, the first partially covered quad and the second partially covered quad may have overlapping coverage, although at the cost of some additional buffering of data while pixels are processed in the one or more shader cores.
- the method may include packing, by the shader warp packer unit, the first partially covered quad and the second partially covered quad into a packed quad.
- the method may include sending, by the shader warp packer unit, the packed quad to one or more shader cores.
- the first partially covered quad and the second partially covered quad are spatially disjoint from each other.
- the method may include receiving and processing, by the one or more shader cores, the packed quad with no loss of information relative to the one or more shader cores individually processing the first partially covered quad and the second partially covered quad.
- the method may include assigning zero or more pixels from the packed quad to a single lane for a single-lane operation. For example, a single “coverage” pixel may be assigned to a lane, and then zero, one, or two H pixels may be assigned to the same lane.
- the method may include computing a directional derivative in the single-lane operation.
- the method may include computing, in a first operating mode, by at least one of i) the one or more texture units or ii) the one or more shader cores, the directional derivative in the single-lane operation.
- the method may include computing, in a second operating mode, by the at least one of i) the one or more texture units or ii) the one or more shader cores, a second directional derivative in a cross-lane operation.
- the method may include choosing, by at least one of i) the one or more texture units or ii) the one or more shader cores, at least one of the first operating mode or the second operating mode based on at least one run-time flag.
- the method may include storing, by the shader warp packer unit, attribute information associated with at least one pixel of the packed quad.
- the method may include computing, by at least one of i) the one or more texture units, or ii) the one or more shader cores, a directional derivative based on the attribute information.
- the method may include computing, by at least one of i) the one or more texture units or ii) the one or more shader cores, a first directional derivative in an X direction according to:
- x 0 , y 0 , x 1 , y 1 , x 2 , and y 2 are coordinates of vertices of at least one of the first primitive or the second primitive.
- a software module may reside in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, hard disk, a removable disk, a CD ROM, or any other form of storage medium known in the art.
- RAM Random Access Memory
- ROM Read Only Memory
- EPROM Electrically Programmable ROM
- EEPROM Electrically Erasable Programmable ROM
- the machine or machines include a system bus to which is attached processors, memory, e.g., RAM, ROM, or other state preserving medium, storage devices, a video interface, and input/output interface ports.
- processors e.g., RAM, ROM, or other state preserving medium
- storage devices e.g., RAM, ROM, or other state preserving medium
- video interface e.g., a graphics processing unit
- input/output interface ports e.g., a graphics processing unit
- the machine or machines can be controlled, at least in part, by input from conventional input devices, such as keyboards, mice, etc., as well as by directives received from another machine, interaction with a virtual reality (VR) environment, biometric feedback, or other input signal.
- VR virtual reality
- machine is intended to broadly encompass a single machine, a virtual machine, or a system of communicatively coupled machines, virtual machines, or devices operating together.
- exemplary machines include computing devices such as personal computers, workstations, servers, portable computers, handheld devices, telephones, tablets, etc., as well as transportation devices, such as private or public transportation, e.g., automobiles, trains, cabs, etc.
- the machine or machines can include embedded controllers, such as programmable or non-programmable logic devices or arrays, ASICs, embedded computers, cards, and the like.
- the machine or machines can utilize one or more connections to one or more remote machines, such as through a network interface, modern, or other communicative coupling.
- Machines can be interconnected by way of a physical and/or logical network, such as an intranet, the Internet, local area networks, wide area networks, etc.
- network communication can utilize various wired and/or wireless short range or long range carriers and protocols, including radio frequency (RF), satellite, microwave, Institute of Electrical and Electronics Engineers (IEEE) 545.11, Bluetooth®, optical, infrared, cable, laser, etc.
- RF radio frequency
- IEEE Institute of Electrical and Electronics Engineers
- Embodiments of the present disclosure can be described by reference to or in conjunction with associated data including functions, procedures, data structures, application programs, etc. which when accessed by a machine results in the machine performing tasks or defining abstract data types or low-level hardware contexts.
- Associated data can be stored in, for example, the volatile and/or non-volatile memory, e.g., RAM, ROM, etc., or in other storage devices and their associated storage media, including hard-drives, floppy-disks, optical storage, tapes, flash memory, memory sticks, digital video disks, biological storage, etc.
- Associated data can be delivered over transmission environments, including the physical and/or logical network, in the form of packets, serial data, parallel data, propagated signals, etc., and can be used in a compressed or encrypted format. Associated data can be used in a distributed environment, and stored locally and/or remotely for machine access.
- Embodiments of the present disclosure may include a non-transitory machine-readable medium comprising instructions executable by one or more processors, the instructions comprising instructions to perform the elements of the inventive concepts as described herein.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Graphics (AREA)
- Image Generation (AREA)
Abstract
Description
| TABLE 1 | ||
| Position | U | V |
| (x − 1, y) |
|
|
| (x, y − 1) |
|
|
| (x + 1, y) |
|
|
| (x, y + 1) |
|
|
| (x + 1, y + 1) |
|
|
s x,y =s 0 ·u x,y +s 1 ·v x,y +s 2·(1−u x,y −v x,y)
t x,y =t 0 ·u x,y +t 1 ·v x,y +t 2·(1−u x,y −v x,y)
| TABLE 2 | |||
| Direction | Directional Derivative Calculation | ||
| X |
|
||
| Y |
|
||
may already be available with a barycentric interpolation (BCI) unit of a GPU. Such pre-existing data can be copied to a vector register file, and then the calculations above may be reduced to four (4) multiplications and four (4) additions (or subtractions).
to a register, and then the number of operations may be reduced to four (4) multiplications and eight (8) additions (or subtractions). Both alternatives may involve allocating an additional four registers to store additional values.
and at least one of i) the one or more texture units or ii) the one or more shader cores are configured to compute a second directional derivative in a Y direction according to:
In some embodiments, A is the area of at least one of the first primitive or the second primitive. In some embodiments, x0, y0, x1, y1, x2, and y2 are coordinates of vertices of the at least one of the first primitive or the second primitive. In some embodiments, t0, t1, and t2 are the values of the “t” attribute at each of the three vertices, (x0, y0), (x1, y1), and (x2,y2), respectively. For each primitive arriving at the rasterizer and then the packer, there may be zero or more attributes to be interpolated across the primitive. The values “s” and “t” may represent two such primitive attributes, and may be written as (s,t) to denote a texture coordinate at some particular pixel in a primitive, having been interpolated from the (s,t) values at each of the three vertices. The
and computing, by the at least one of i) the one or more texture units or ii) the one or more shader cores, a second directional derivative in a Y direction according to:
In some embodiments, wherein A is a total area of at least one of the first primitive or the second primitive. In some embodiments, x0, y0, x1, y1, x2, and y2 are coordinates of vertices of at least one of the first primitive or the second primitive.
Claims (16)
Priority Applications (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/168,168 US11748933B2 (en) | 2020-08-03 | 2021-02-04 | Method for performing shader occupancy for small primitives |
| KR1020210090072A KR102766380B1 (en) | 2020-08-03 | 2021-07-09 | Method for performing shader occupancy for small primitives |
| CN202110812734.XA CN114092627A (en) | 2020-08-03 | 2021-07-19 | Method for executing shader occupation on small primitive |
| US17/503,259 US11798218B2 (en) | 2020-08-03 | 2021-10-15 | Methods and apparatus for pixel packing |
| CN202210092245.6A CN114862652A (en) | 2021-02-04 | 2022-01-26 | Method and apparatus for pixel packing |
| KR1020220015131A KR20220112710A (en) | 2021-02-04 | 2022-02-04 | Methods and apparatus for pixel packing related application data |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202063060653P | 2020-08-03 | 2020-08-03 | |
| US17/168,168 US11748933B2 (en) | 2020-08-03 | 2021-02-04 | Method for performing shader occupancy for small primitives |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/503,259 Continuation-In-Part US11798218B2 (en) | 2020-08-03 | 2021-10-15 | Methods and apparatus for pixel packing |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20220036631A1 US20220036631A1 (en) | 2022-02-03 |
| US11748933B2 true US11748933B2 (en) | 2023-09-05 |
Family
ID=80004462
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/168,168 Active 2041-05-05 US11748933B2 (en) | 2020-08-03 | 2021-02-04 | Method for performing shader occupancy for small primitives |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US11748933B2 (en) |
| KR (1) | KR102766380B1 (en) |
| CN (1) | CN114092627A (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10460513B2 (en) * | 2016-09-22 | 2019-10-29 | Advanced Micro Devices, Inc. | Combined world-space pipeline shader stages |
Citations (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7502035B1 (en) | 2005-12-19 | 2009-03-10 | Nvidia Corporation | Apparatus, system, and method for multi-sample pixel coalescing |
| US20170032488A1 (en) * | 2015-07-30 | 2017-02-02 | Arm Limited | Graphics processing systems |
| US20170116698A1 (en) | 2015-10-27 | 2017-04-27 | Nvidia Corporation | Techniques for maintaining atomicity and ordering for pixel shader operations |
| US20170161940A1 (en) | 2015-12-04 | 2017-06-08 | Gabor Liktor | Merging Fragments for Coarse Pixel Shading Using a Weighted Average of the Attributes of Triangles |
| US9721376B2 (en) | 2014-06-27 | 2017-08-01 | Samsung Electronics Co., Ltd. | Elimination of minimal use threads via quad merging |
| US9779542B2 (en) | 2015-09-25 | 2017-10-03 | Intel Corporation | Apparatus and method for implementing flexible finite differences in a graphics processor |
| US20170309065A1 (en) | 2014-06-27 | 2017-10-26 | Samsung Electronics Co., Ltd. | Elimination of minimal use threads via quad merging |
| US9865074B2 (en) | 2014-04-05 | 2018-01-09 | Sony Interactive Entertainment America Llc | Method for efficient construction of high resolution display buffers |
| US10275851B1 (en) | 2017-04-25 | 2019-04-30 | EMC IP Holding Company LLC | Checkpointing for GPU-as-a-service in cloud computing environment |
| US20200018965A1 (en) | 2017-03-01 | 2020-01-16 | Sony Interactive Entertainment Inc. | Image processing |
| US20200184715A1 (en) | 2018-12-11 | 2020-06-11 | Samsung Electronics Co., Ltd. | Efficient redundant coverage discard mechanism to reduce pixel shader work in a tile-based graphics rendering pipeline |
| US20200184703A1 (en) | 2018-12-08 | 2020-06-11 | Arm Limited | Performing texturing operations for sets of plural execution threads in graphics processing systems |
| US10769752B2 (en) | 2017-03-17 | 2020-09-08 | Magic Leap, Inc. | Mixed reality system with virtual content warping and method of generating virtual content using same |
| WO2020191920A1 (en) | 2019-03-25 | 2020-10-01 | Huawei Technologies Co., Ltd. | Storing complex data in warp gprs |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9905046B2 (en) * | 2014-04-03 | 2018-02-27 | Intel Corporation | Mapping multi-rate shading to monolithic programs |
| US9824412B2 (en) * | 2014-09-24 | 2017-11-21 | Intel Corporation | Position-only shading pipeline |
-
2021
- 2021-02-04 US US17/168,168 patent/US11748933B2/en active Active
- 2021-07-09 KR KR1020210090072A patent/KR102766380B1/en active Active
- 2021-07-19 CN CN202110812734.XA patent/CN114092627A/en active Pending
Patent Citations (15)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7502035B1 (en) | 2005-12-19 | 2009-03-10 | Nvidia Corporation | Apparatus, system, and method for multi-sample pixel coalescing |
| US9865074B2 (en) | 2014-04-05 | 2018-01-09 | Sony Interactive Entertainment America Llc | Method for efficient construction of high resolution display buffers |
| US9972124B2 (en) | 2014-06-27 | 2018-05-15 | Samsung Electronics Co., Ltd. | Elimination of minimal use threads via quad merging |
| US9721376B2 (en) | 2014-06-27 | 2017-08-01 | Samsung Electronics Co., Ltd. | Elimination of minimal use threads via quad merging |
| US20170309065A1 (en) | 2014-06-27 | 2017-10-26 | Samsung Electronics Co., Ltd. | Elimination of minimal use threads via quad merging |
| US20170032488A1 (en) * | 2015-07-30 | 2017-02-02 | Arm Limited | Graphics processing systems |
| US9779542B2 (en) | 2015-09-25 | 2017-10-03 | Intel Corporation | Apparatus and method for implementing flexible finite differences in a graphics processor |
| US20170116698A1 (en) | 2015-10-27 | 2017-04-27 | Nvidia Corporation | Techniques for maintaining atomicity and ordering for pixel shader operations |
| US20170161940A1 (en) | 2015-12-04 | 2017-06-08 | Gabor Liktor | Merging Fragments for Coarse Pixel Shading Using a Weighted Average of the Attributes of Triangles |
| US20200018965A1 (en) | 2017-03-01 | 2020-01-16 | Sony Interactive Entertainment Inc. | Image processing |
| US10769752B2 (en) | 2017-03-17 | 2020-09-08 | Magic Leap, Inc. | Mixed reality system with virtual content warping and method of generating virtual content using same |
| US10275851B1 (en) | 2017-04-25 | 2019-04-30 | EMC IP Holding Company LLC | Checkpointing for GPU-as-a-service in cloud computing environment |
| US20200184703A1 (en) | 2018-12-08 | 2020-06-11 | Arm Limited | Performing texturing operations for sets of plural execution threads in graphics processing systems |
| US20200184715A1 (en) | 2018-12-11 | 2020-06-11 | Samsung Electronics Co., Ltd. | Efficient redundant coverage discard mechanism to reduce pixel shader work in a tile-based graphics rendering pipeline |
| WO2020191920A1 (en) | 2019-03-25 | 2020-10-01 | Huawei Technologies Co., Ltd. | Storing complex data in warp gprs |
Non-Patent Citations (3)
| Title |
|---|
| Corrected Notice of Allowability for U.S. Appl. No. 17/503,259, dated Jul. 3, 2023. |
| Notice of Allowance for U.S. Appl. No. 17/503,259, dated Jun. 15, 2023. |
| Office Action for U.S. Appl. No. 17/503,259, dated Mar. 30, 2023. |
Also Published As
| Publication number | Publication date |
|---|---|
| KR20220016774A (en) | 2022-02-10 |
| US20220036631A1 (en) | 2022-02-03 |
| KR102766380B1 (en) | 2025-02-13 |
| CN114092627A (en) | 2022-02-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11748840B2 (en) | Method for efficient re-rendering objects to vary viewports and under varying rendering and rasterization parameters | |
| US8040351B1 (en) | Using a geometry shader to perform a hough transform | |
| US9922457B2 (en) | Computing tessellation coordinates using dedicated hardware | |
| US9082204B2 (en) | Storage structures for stitching primitives in graphics processing | |
| US20110216068A1 (en) | Edge processing techniques | |
| TWI889203B (en) | Block-based rasterization method and device, image rendering method and device | |
| US20140063012A1 (en) | Computation reduced tessellation | |
| US12505604B2 (en) | Hybrid binning | |
| KR20210002753A (en) | Compiler support technology to reduce the memory usage of the graphics pipeline | |
| CN115330986B (en) | Graphics processing method and system in sub-block rendering mode | |
| US11030791B2 (en) | Centroid selection for variable rate shading | |
| JP4977712B2 (en) | Computer graphics processor and method for rendering stereoscopic images on a display screen | |
| US7804499B1 (en) | Variable performance rasterization with constant effort | |
| US11798218B2 (en) | Methods and apparatus for pixel packing | |
| US10192348B2 (en) | Method and apparatus for processing texture | |
| US11748933B2 (en) | Method for performing shader occupancy for small primitives | |
| US7405735B2 (en) | Texture unit, image rendering apparatus and texel transfer method for transferring texels in a batch | |
| US7616202B1 (en) | Compaction of z-only samples | |
| KR20220112710A (en) | Methods and apparatus for pixel packing related application data | |
| US7385604B1 (en) | Fragment scattering | |
| US7256796B1 (en) | Per-fragment control for writing an output buffer | |
| US12266139B2 (en) | Method and system for integrating compression | |
| US10755468B2 (en) | Image processing apparatus, image processing method, and program to improve speed for calculating a color of pixels in image data | |
| CN116957899B (en) | Graphics processor, system, device, equipment and method | |
| EP4315258A1 (en) | Post-depth visibility collection with two level binning |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| AS | Assignment |
Owner name: SAMSUNG ELECTRONICS CO., LTD., KOREA, REPUBLIC OF Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:VARADARAJAN, KESHAVAN;TANNENBAUM, DAVID C.;GURUPAD, FNU;REEL/FRAME:064853/0129 Effective date: 20210203 |