CN116894902A

CN116894902A - Reducing redundant rendering in a graphics system

Info

Publication number: CN116894902A
Application number: CN202310320862.1A
Authority: CN
Inventors: J·W·霍森; X·杨; M·祖切利
Original assignee: Imagination Technologies Ltd
Current assignee: Imagination Technologies Ltd
Priority date: 2022-03-31
Filing date: 2023-03-29
Publication date: 2023-10-17
Also published as: GB202204715D0; GB2617182A

Abstract

Redundant rendering in a graphics system is reduced. A method and system for performing rendering using a graphics processing unit implementing a tile-based graphics pipeline in which a rendering space is subdivided into tiles is disclosed. The rendered geometry data is received, the geometry data including primitives associated with one or more overlay vertex shader programs. The geometry data is processed using the vertex shader program to generate processed primitives, and a determination is made as to in which tile each of the processed primitives is located. For at least one selected tile, storing i) a representation of per-tile vertex shader data that identifies one or more vertex shader programs used to generate the processed primitives in the tile, and ii) a representation of per-tile rendering data that may be used in rendering the processed primitives in subsequent stages of the graphics pipeline.

Description

Reducing redundant rendering in a graphics system

Cross Reference to Related Applications

The present application claims priority from uk patent applications 2204714.6 and 2204715.3 filed 3/31/2022, which are incorporated herein by reference in their entirety.

Technical Field

The present disclosure relates to reducing redundant rendering in a graphics system.

Background

Graphics processing systems are typically configured to receive graphics data, for example, from an application running on a computer system, and render the graphics data to provide a rendering output. For example, graphics data provided to a graphics processing system may describe geometry within a three-dimensional (3D) scene to be rendered, and the rendering output may be a rendered image of the scene. Alternatively, the rendered image of the scene may be formed from multiple rendering outputs (e.g., formed from a composite rendering output).

Some graphics processing systems (which may be referred to as "tile-based" graphics processing systems) use a rendering space that is subdivided into a plurality of tiles. A "tile" is an area of rendering space and may have any suitable shape, but is generally rectangular (where the term "rectangle" includes a square). As some examples, a tile may cover a 16 x 16 pixel block or a 32 x 32 pixel block of an image to be rendered. Subdividing the rendering space into tiles allows rendering images in a tile-by-tile manner, wherein the graphics data of "tiles" may be temporarily stored "on-chip" during the rendering of the tiles, thereby reducing the amount of on-chip memory that needs to be implemented on a Graphics Processing Unit (GPU) of a graphics processing system.

Tile-based graphics processing systems typically operate in two phases. During the first phase, graphics data (e.g., as received from an application) is processed to generate a set of processed graphics data items, referred to as primitives. The primitives may represent geometric shapes describing the surfaces of structures within the scene. For example, the primitives may take the form of 2D geometry, lines, or points. The primitives have one or more vertices, e.g., triangle primitives have one vertex at each corner, i.e., three vertices. Objects or structures within a scene may be composed of one or more primitives. In some cases, a structure may be made up of many (e.g., hundreds, thousands, millions, etc.) of primitives. The processed primitives are then analyzed to determine, for each tile, which primitives are at least partially located within the tile.

This first stage may be referred to herein as a geometric processing stage. During this stage, the operations performed on the graphics data are typically per-vertex or per-primitive operations.

During the second phase, the tile may be rendered by processing primitives determined to be at least partially within the tile. In some cases, as part of the transition from the first phase to the second phase, primitives determined to be located within a tile may be sampled at sampling locations to determine which base areas (e.g., pixels) of the screen the primitives are present in. Fragments may then be generated for each of the base regions. The generated fragments may then be processed during a second phase to render tiles. Thus, the operations performed as part of the second stage to render tiles are typically per-pixel or per-fragment operations.

The output of the second stage (for the particular tile being rendered) may take the form of a set of values (e.g., color values) for each pixel within the tile. That is, the output of the second stage may be a set of values per pixel. After the first phase is completed, each tile may be processed sequentially (or at least partially in parallel) according to the second phase. The second stage may be referred to herein as a rendering stage.

FIG. 1 illustrates an example of a tile-based graphics processing system that may be used to render images of 3D scenes. A schematic diagram of a 3D scene is shown at 200 in fig. 2.

Graphics processing system 100 includes a Graphics Processing Unit (GPU) 102 and two portions of memory 104 ₁ And 104 ₂ These two portions may or may not form the same physical memoryA portion of the reservoir.

GPU 102 includes geometry processing logic 106, tiling unit 108, and rendering logic 110, where rendering logic 110 includes fetch unit 112 and fragment processing logic 114. Rendering logic 110 may be configured to implement Hidden Surface Removal (HSR) and texturing and/or shading on graphics data (e.g., primitive fragments) of tiles of a rendering space.

Geometry processing logic 106 is configured to receive graphics data (e.g., in the form of primitives) from an application describing a scene to be rendered (e.g., scene 200 in fig. 2). In the geometry processing stage, geometry processing logic 106 performs geometry processing functions such as clipping and culling to remove primitives that do not fall into a visible view. Geometry processing logic 106 may also project primitives into screen space (shown schematically at 202 in FIG. 2). Geometry processing logic 106 may also execute vertex shader programs on primitives, such as manipulating or changing primitives or vertex data. Geometry processing logic 106 may further perform operations such as hull shaders, tessellation, and domain shaders. The processed primitives output from geometry processing logic 106 are passed to tiling unit 108, which determines which primitives exist (i.e., at least partially intersect) within each tile (e.g., tiles 204A-D) of the rendering space of graphics processing system 100. The tiling unit 108 may assign primitives to tiles of the rendering space by creating a control stream (or "display list" or "tile list") for the tiles, wherein the control stream of the tiles includes an indication of the primitives present within the tiles. The processed primitive data is sorted and stored in memory 104 ₁ In a data structure called primitive block, and indicates which primitives are located in which tiles, a control stream is output from tiling unit 108 and stored in memory 104 ₁ Is a kind of medium.

In the rendering phase, the rendering logic 110 renders graphics data of tiles of the rendering space to generate rendering values, such as rendered image values. Rendering logic 110 may be configured to implement any suitable rendering technique, such as rasterization or ray tracing, to perform rendering. To render a tile, the fetch unit 112 obtains the tile from a primitive block, e.g., from the memory 104 ₁ Or from a high speed bufferThe control stream of a tile and primitives associated with the tile are retrieved from memory. The fragment processing logic 114 may perform operations including hidden surface removal and shading and/or texturing on primitive fragments (i.e., fragments formed by sampling primitives) to form rendered image values of tiles. Texturing and/or shading may be performed by executing a suitable shader program. The rendered image values (e.g., pixel color values) may then be transferred to the memory 104 ₂ For storage. The rendered image may be output from graphics processing system 100 and used in any suitable manner, such as for display on a display or stored in memory or transmitted to another device, etc.

When running certain applications (e.g., user interfaces, 2D games, applications with static background, etc.), it may be the case that the graphics processing system outputs the same rendering values (across the entire image or portions of the image) for multiple renderings. That is, the entire image or one or more tiles of the image may have the same content (and thus the same rendering value) within a series of multiple renderings. This means that the graphics processing unit may perform the same operation on multiple renderings to output only the same rendering values for one or more tiles of the image, resulting in unnecessary processing.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

A method and system for performing rendering using a graphics processing unit implementing a tile-based graphics pipeline in which rendering space is subdivided into tiles is disclosed. Rendered geometry data is received, the geometry data including primitives associated with one or more overlay vertex shader programs. The geometry data is processed using a vertex shader program to generate processed primitives, and a determination is made as to in which tile each of the processed primitives is located. For at least one selected tile, storing i) a representation of per-tile vertex shader data that identifies one or more vertex shader programs used to generate processed primitives in the tile, and ii) a representation of per-tile rendering data that may be used in rendering the processed primitives in the tile in a subsequent stage of the graphics pipeline. Before comparing the per-tile rendering data of the selected tile with the previously rendered per-tile rendering data, determining whether the previously rendered output of the tile is available as the rendered output by comparing the per-tile vertex shader data of the tile with the previously rendered vertex shader data.

A first aspect provides a method of performing rendering using a graphics processing unit configured to implement a tile-based graphics pipeline in which a rendering space is subdivided into a plurality of tiles, the method comprising: receiving rendered geometry data, the geometry data comprising a plurality of primitives, each primitive associated with one or more vertex shader programs; processing the geometry data using one or more vertex shader programs to generate one or more processed primitives; determining which of the processed primitives are located within each tile of the plurality of tiles; for at least one selected tile of the plurality of tiles, storing i) a representation of per-tile vertex shader data that identifies one or more vertex shader programs used to generate processed primitives located in the tile, and ii) a representation of per-tile rendering data that is usable in rendering the processed primitives within the tile in a subsequent stage of the graphics pipeline; and for the or each selected tile, determining whether the previously rendered output of the tile can be used as the rendered output by comparing the per-tile vertex shader data of the tile with the previously rendered vertex shader data before comparing the per-tile rendering data of the tile with the previously rendered per-tile rendering data.

Determining whether a previously rendered output of the tile is available as the rendered output may include: determining whether the per-tile vertex shader data matches corresponding per-tile vertex shader data previously rendered; in response to determining that the per-tile vertex shader data matches, determining whether per-tile rendering data for the tile matches corresponding per-tile rendering data previously rendered; and in response to determining that the per-tile rendering data matches, using a previously rendered output of the tile as the rendered output.

Determining whether a previously rendered output of the tile is available as the rendered output may further include: in response to determining that the per-tile vertex shader data does not match, the graphics pipeline is caused to render the tile. Determining whether a previously rendered output of the tile is available as the rendered output may further include: in response to determining that the per-tile rendering data does not match, the graphics pipeline is caused to render the tile.

The method may also include storing rendering range data indicative of one or more characteristics of the rendering, and using the rendering range data to check whether to skip per-tile vertex shader data and per-tile rendering data comparisons, and causing the graphics pipeline to render the tile, before determining whether a previously rendered output of the tile is available as an output of the rendering. The rendering range data may include transparent colors, and using the rendering range data to check whether to skip per-tile vertex shader data and per-tile rendering data comparisons may include determining whether the transparent colors match previously rendered colors. The rendering range data may include a valid flag, and using the rendering range data to check whether to skip per-tile vertex shader data and per-tile rendering data comparison may include determining whether the valid flag has a predetermined value. The method may set the valid flag to a predetermined value based on at least one of: data indicating that rendering is part of a scene using a plurality of rendering targets; and rendering includes more draw calls than a threshold number.

The method may further include, for at least one selected tile of the plurality of tiles, storing iii) per-tile validity data indicating whether to skip a comparison of per-tile vertex shader data and per-tile rendering data, wherein the per-tile validity data may be set based on a number of processed primitives located within the tile.

The per-tile rendering data may include vertex coordinates and vertex state data for each of the processed primitives located within the tile. Storing the representation of per-tile rendering data may include generating a hash of vertex coordinates and vertex state data for each of the processed primitives located within the tile, and storing the hash value. The vertex state data may include data associated with each vertex and used to render processed primitives in the tile, including one or more of: pixel shader identifiers, variations, color data, surface normal data, and texture data.

A second aspect provides a graphics processing system configured to implement a tile-based graphics pipeline in which a rendering space is subdivided into a plurality of tiles, the graphics processing system comprising: geometry processing logic configured to receive rendered geometry data, the geometry data comprising a plurality of primitives, each primitive associated with one or more vertex shader programs, and process the geometry data using the one or more vertex shader programs to generate one or more processed primitives; a tiling unit configured to determine which of the processed primitives are located within each tile; a data characterization unit configured to store, in memory, for at least one selected tile of a plurality of tiles, i) a representation of per-tile vertex shader data that identifies one or more vertex shader programs used to generate processed primitives located in the tile, and ii) a representation of per-tile rendering data that characterizes data that is usable to render the processed primitives within the tile in a subsequent stage of the graphics pipeline; and a test unit configured to determine, for the or each selected tile, whether a previously rendered output of the tile can be used as the rendered output by comparing per-tile vertex shader data of the tile with previously rendered vertex shader data of the tile before comparing per-tile rendering data of the tile with previously rendered per-tile rendering data.

To determine whether a previously rendered output of the tile is available as the rendered output, the test unit may be further configured to: determining whether the per-tile vertex shader data matches corresponding per-tile vertex shader data previously rendered; in response to determining that the per-tile vertex shader data matches, determining whether per-tile rendering data for the tile matches corresponding per-tile rendering data previously rendered; and in response to determining that the per-tile rendering data matches, using a previously rendered output of the tile as the rendered output.

To determine whether a previously rendered output of the tile is available as the rendered output, the test unit may be further configured to: in response to determining that the per-tile vertex shader data does not match, the graphics pipeline is caused to render the tile. To determine whether a previously rendered output of the tile is available as the rendered output, the test unit may be further configured to: in response to determining that the per-tile rendering data does not match, the graphics pipeline is caused to render the tile.

The data characterization unit may be further configured to store rendering range data indicative of one or more characteristics of the rendering, and use the rendering range data to check whether to skip per-tile vertex shader data and per-tile rendering data comparisons, and cause the graphics pipeline to render the tile, before determining whether a previously rendered output of the tile is available as an output of the rendering. The data characterization unit may be further configured to store iii) per-tile validity data indicating whether to skip per-tile vertex shader data and per-tile rendering data comparisons and render primitives located within the tile to render the tile for at least one selected tile of the plurality of tiles, wherein the per-tile validity data is set based on a number of processed primitives located within the tile.

A third aspect provides a graphics processing system configured to perform the above method.

The graphics processing system may be embodied in hardware on an integrated circuit. A method of manufacturing a graphics processing system at an integrated circuit manufacturing system may be provided. An integrated circuit definition data set may be provided that, when processed in an integrated circuit manufacturing system, configures the system to manufacture a graphics processing system. A non-transitory computer readable storage medium having stored thereon a computer readable description of a graphics processing system that, when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying the graphics processing system may be provided.

An integrated circuit manufacturing system may be provided, the integrated circuit manufacturing system comprising: a non-transitory computer readable storage medium having stored thereon a computer readable description of a graphics processing system; a layout processing system configured to process the computer readable description to generate a circuit layout description of an integrated circuit embodying the graphics processing system; and an integrated circuit generation system configured to fabricate the graphics processing system in accordance with the circuit layout description.

A computer program code for performing any of the methods described herein may be provided. A non-transitory computer readable storage medium may be provided having stored thereon computer readable instructions that, when executed at a computer system, cause the computer system to perform any of the methods described herein.

As will be apparent to those skilled in the art, the above features may be suitably combined and combined with any of the aspects of the examples described herein.

Drawings

Examples will now be described in detail with reference to the accompanying drawings, in which:

FIG. 1 illustrates an example of a graphics processing system;

FIG. 2 shows a schematic diagram of a rendering of a scene;

FIG. 3 illustrates an example of a graphics processing system capable of performing redundancy testing to detect and avoid redundant rendering;

FIG. 4 illustrates data provided to and generated by a data characterization unit in accordance with a first technique;

FIG. 5 illustrates a flow chart of the operation of the system of FIG. 4;

FIG. 6 illustrates data provided to and analyzed by a test unit in accordance with a first technique;

FIG. 7 shows a flow chart of the operation of the system of FIG. 6;

FIG. 8 illustrates data provided to and generated by a data characterization unit in accordance with a second technique;

FIG. 9 shows a flow chart of the operation of the system of FIG. 8;

FIG. 10 illustrates data provided to and analyzed by a test unit in accordance with a second technique;

FIG. 11 shows a flow chart of the operation of the system of FIG. 10;

FIG. 12 illustrates a plurality of bitmasks generated for each tile when primitive block data is partitioned into a plurality of primitive block segments;

13A and 13B illustrate how current and previously rendered information may be stored in memory;

FIG. 14 illustrates a computer system in which a graphics processing system described herein may be implemented; and is also provided with

FIG. 15 illustrates an integrated circuit manufacturing system for generating an integrated circuit embodying a graphics processing system.

The figures illustrate various examples. Skilled artisans will appreciate that element boundaries (e.g., boxes, groups of boxes, or other shapes) illustrated in the figures represent one example of the boundaries. In some examples, it may be the case that one element may be designed as multiple elements, or that multiple elements may be designed as one element. Where appropriate, common reference numerals have been used throughout the various figures to indicate like features.

Detailed Description

The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art.

Embodiments will now be described by way of example only.

The present disclosure relates to techniques for reducing redundant throughput when performing a rendering sequence using a graphics processing system. The graphics processing system includes a graphics processing unit that identifies when two rendering operations (belonging to two different renderings) will result in the same rendering output of at least a portion of an image (e.g., one or more tiles of the image). The same rendering (e.g., of tiles) as the corresponding previous rendering (e.g., both renderings would produce the same set of tile/image rendering values) is referred to herein as a redundant rendering. The corresponding image or portion thereof may be referred to as a redundant image or redundant tile, as appropriate.

FIG. 3 illustrates an example of a graphics processing system 300 configured to implement the techniques for detecting and avoiding redundant rendering described below. Graphics processing system 300 includes graphics processing unit 302 and memory block 304 _1,2,3 . Each of the memory blocks is external to the graphics processing unit 302. Each memory block may or may not form part of the same physical memory. Graphics processing unit 302 includes geometry processing logic 306, tiling unit 308, data characterization unit 310, test unit 312, and rendering logic 314. The rendering logic includes an acquisition unit 316 and fragment processing logic 318.

Note that geometry processing logic 306 and rendering logic 314 are shown in fig. 3 as separate components because they perform different functions and, in some examples, are implemented in physically separate hardware; however, in some other examples, geometry processing logic 306 and rendering logic 314 may share processing resources, e.g., such that they are implemented by the same physical processing hardware, where the processing hardware may switch between performing the operations of the functions of geometry processing logic 306 and performing the operations of the functions of rendering logic 314.

Graphics processing unit 302 receives graphics data submitted by an application 324 running on a host computer system 322 (e.g., a CPU). The host computer system 322 also includes a graphics driver 326. The computer system may execute an application 324 to invoke application instructions. These application instructions may take the form of rendering requests submitted by the application. The rendering request may include one or more draw calls. A draw call is a command that specifies certain components of a scene (e.g., a portion of a scene) to be rendered. The draw call may, for example, specify one or more geometric items or structures of the scene to be rendered. One or more such rendering calls may need to be performed to perform one rendering. That is, a single rendering request submitted by an application may be made up of one or more draw calls.

The driver 326 receives the rendering request, which causes the graphics data associated with the request (and thus the one or more draw calls that make up the rendering request) to be submitted to the graphics unit 302. The graphics data may be stored in an external memory (not shown in fig. 3), or it may be stored within computer system 322 and submitted directly from drive 326.

Graphics processing unit 302 operates to perform rendering as part of rendering an image of a scene. In order to render a scene, the graphics processing unit may need to perform multiple renderings. A rendered image may then be formed from the multiple rendered outputs. Thus, a single rendering may not directly correspond to rendering an image (but may be so in some cases). To perform rendering, the graphics unit may execute one or more draw calls submitted by application 324 to render the geometry associated with those draw calls, thereby generating rendered image data. Graphics processing unit 302 performs rendering according to a graphics pipeline. That is, the graphics processing unit 302 implements a graphics pipeline to render image data. In this example, the graphics pipeline is a tile-based rendering pipeline, such as a tile-based deferred rendering pipeline.

In the geometry processing stage, geometry processing logic 306 performs geometry processing functions including clipping and culling to remove primitives that do not fall into the visible view. Geometry processing logic 306 may also project primitives into screen space (shown schematically at 202 in FIG. 2). Geometry processing logic 306 may also execute one or more vertex shader programs on the primitives that may programmatically manipulate the primitives (e.g., transform the primitives, illuminate the primitives, etc.),Move primitives, rotate primitives, warp primitives, duplicate primitives, or change primitives or their associated attributes in any other manner). The processed primitives output from geometry processing logic 306 are passed to tiling unit 308, which determines which primitives are at least partially associated with each tile (e.g., tile 204) of the rendering space of graphics processing system 300 _A-D ) Intersection/overlap. The tiling unit 308 may assign primitives to tiles of the rendering space by creating a control stream (or "display list"/"tile list") for the tiles, wherein the control stream of the tiles includes an indication of the primitives that are at least partially present within the tiles. The processed primitive data is sorted and stored in memory 304 ₂ In a data structure called a primitive block (via geometry processing logic 306 or tiling unit 308), and control flows are output from tiling unit 308 and stored in memory 304 ₂ Is a kind of medium.

In the render phase, the rendering logic 314 renders graphics data of tiles of the rendering space to generate rendering values, such as rendered image values. Rendering logic 314 may be configured to implement any suitable rendering technique, such as rasterization or ray tracing, to perform rendering. To render a tile, the fetch unit 316 reads a tile from the primitive block from the memory 304 ₂ The control stream of the tile and the primitives associated with the tile are acquired. The fragment processing logic 318 may perform operations including hidden surface removal and shading and/or texturing on primitive fragments (i.e., fragments formed by sampling primitives) to form rendered image values for tiles. Texturing and/or shading may be performed by executing a suitable fragment shader program. The rendered image values (e.g., pixel color values) may then be transferred to memory 304 ₃ For storage. The rendered image may be output from graphics processing system 300 and used in any suitable manner, such as for display on a display or stored in memory or transmitted to another device, etc.

Graphics processing unit 302 identifies redundant renderings by generating and storing information associated with the current rendering (i.e., the rendering performed by the graphics processing unit) and comparing the information to corresponding information of previous renderings (i.e., renderings that have been processed prior to the current rendering). If the information matches on the rendering, the current rendering is identified as redundant. If the rendered information does not match, the current rendering is identified as non-redundant. The current rendered information is stored and compared to the previously rendered information before the graphics unit completes the current rendering. In this way, if the current rendering is identified as redundant, at least some of the processing required to complete the rendering may be avoided.

Note that the previous rendering may be a rendering immediately before the current rendering, but this is not essential. For example, the current rendering and the previous rendering that is compared to the information of the current rendering may be separated by one or more intermediate renderings. In some examples, the image is created from multiple renderings of different rendering types. Examples of rendering types include rendering to a frame buffer, rendering to a texture, rendering a shadow map, and so forth. In these examples, the previous rendering may be a previously processed rendering having the same rendering type as the current rendering.

While avoiding redundant rendering is generally desirable, it is also important that the process of generating, storing, and comparing information to identify redundant rendering does not itself result in excessive bandwidth usage, power consumption, and processing delays, such that the benefits of redundant rendering avoidance are exceeded. According to examples described herein, a graphics processing unit may implement several different techniques in order to detect redundant rendering. These techniques differ from one another by being stored and used to detect the type of information that is rendered the same as the previous rendering corresponding to the same region of the image. These techniques may also differ due to the stages of the graphics pipeline implemented by the graphics processing unit at which the information is collected. These techniques aim at optimizing redundant rendering avoidance by ensuring that the processing for detecting redundant rendering is only performed when there is a reasonable likelihood of redundant rendering, and/or by reducing the amount of computation performed and data stored.

In order to detect redundant renderings, information/data about a given rendering must be analyzed. The information available for rendering varies depending on where the information is read in the graphics pipeline. For example, information about rendering may be read before the geometry processing stage (denoted as pre-geometry stage data) or after the geometry processing stage (denoted as post-geometry stage data).

The pre-geometry stage data includes information about the renderings available before the geometry processing stage is completed (e.g., before the geometry processing stage) and is primarily related to the geometry of the entire scene to be rendered (but may also include information about how the fragment is to be processed later in the pipeline). The information may include geometry data and state data associated with one or more draw calls that make up the rendering. If the rendering consists of multiple draw calls, each of these draw calls may be associated with its own state data. The submission of a draw call (e.g., from a running application) results in the geometry data of the draw call being submitted to a graphics processing unit for processing. Thus, the geometry data received at the graphics processing unit for a particular draw call is associated with the state data for that draw call. The pre-geometry stage state data may include information such as vertex shader programs to be applied to primitives at the geometry processing stage. The additional pre-geometric phase state data may include information such as the number and/or identity of rendered draw calls, information about whether any advanced rendering techniques are used, such as Multiple Render Targets (MRTs) and the transparent color of the rendering (this is the initialized color for rendering the output, i.e., the color that the output would be if no primitives were rendered).

Given that the pre-geometry stage data contains all the information required by the graphics processor to render the scene, it is possible to detect redundant rendering using only the pre-geometry stage data. For example, geometric data and its associated state data (from one or more draw calls making up the rendering) may be compared to equivalent data of previous renderings. This comparison may be performed before any geometric processing is performed and thus all redundant processing may be avoided. Alternatively, the geometry processing may be started and the comparisons may be made in parallel so that the results are known before the geometry processing phase of the current rendering is completed. For example, the comparison may be completed before the post-vertex processing stage of the geometry processing stage. If the current rendering matches the previous rendering, then the current rendering may be determined to be redundant before performing the post-vertex processing (and all subsequent) stages in the graphics pipeline. However, the front geometry stage data is related to the entire scene, i.e. not just a part of the final rendering, such as a tile. Thus, this greatly reduces the likelihood of detecting redundant rendering, as even small differences in individual primitives across the scene can cause the data (and the final scene) to be different. Thus, using the pre-geometry stage data in isolation may be inefficient in terms of avoiding the gain of redundant rendering and the processing and memory required to detect redundant rendering.

Instead, the pre-geometric stage data may alternatively be compared to previously rendered data at a subsequent stage of the graphics pipeline. One convenient stage of the pipeline performing such comparisons is at the transition between the geometry processing and rendering stages of the tile-based pipeline (i.e., after the end of the currently rendered tiling stage). This allows the current and previously rendered data to be compared on a per tile basis. This provides significantly higher granularity and thus higher redundant render detection rates, as some parts of the scene may have changed while other parts have not. Tiles associated with the unchanged portion of the scene can then be detected as redundant. If the current rendered and previously rendered pre-geometric phase data matches for a given tile (in addition to the additional information outlined below), that tile may be determined to be redundant and thus the rendering phase of that tile may be avoided (but not the geometric and tiling process).

In other examples, post-geometry stage data may be used to compare renderings that become available at or near the end of the geometry processing stage. In these examples, this information may characterize screen space primitive content of tiles within the rendering space. This information may include, for example, an indication of which primitives and/or vertices are located within the tile, as well as information regarding the rendering stage of each of those primitives located within the tile (e.g., which pixel/fragment shaders are needed to render each primitive, the resources of the graphics processing unit needed to render the primitive, etc.). This information may again be compared to previously rendered information on a per tile basis, allowing redundant tiles to be identified prior to the rendering stage. However, the amount of this information (details of all primitives/vertices of each tile, plus associated rendering stage state data) can become very large, and thus the processing required to analyze and compare this information and store the results is significant.

The graphics processing unit 302 of fig. 3 aims to balance the potential benefits of avoiding redundant rendering with minimizing the computational, memory and power consumption requirements to detect redundancy. Graphics processing unit 302 includes a data characterization unit 310 that receives front geometry stage data (from geometry processing logic 306) and rear geometry stage data (from tiling unit 308) and makes a determination as to whether the rendering is suitable for redundancy testing, and if so, generates and stores data characterizing the rendering in memory 304 ₁ Is a kind of medium. The graphics processing unit 302 also includes a test unit 312 that operates at the per-tile rendering stage of the pipeline. Test unit 312 reads from memory 304 ₁ The data regarding the current tile rendering and the previous tile rendering is read and if this indicates that the rendering is suitable for redundancy testing, the test unit 312 compares the data between the current rendering and the previous rendering. If the data matches, this indicates that the rendering of the tile is redundant and may be skipped and the output from the previous tile rendering is reused as the current output. The previously rendered output may be retrieved, for example, from memory, such as from a frame buffer, a background buffer, or an intermediate storage buffer such as a rendering target. If the rendering is deemed redundant, the test unit 312 may send a signal to the fetch unit 316 so that it does not read the control stream and primitive block data from memory (or stop reading if it has already started), thereby saving memory bandwidth. Also, the fragment processing logic 318 does not need to process fragments/pixels of a tile, thereby saving processing and power consumption.

Two techniques are disclosed herein to achieve these goals. The first technique may be referred to as "pre-geometric phase data comparison followed by post-geometric phase data comparison". The first technique is described with reference to fig. 4 to 7. The second technique may be referred to as "primitive block comparison" and is described with reference to fig. 8-12. Aspects of the two techniques may also be combined, as will be apparent to those skilled in the art.

Front geometry stage data comparison, then rear geometry stage data comparison

The "front geometry stage data comparison followed by the rear geometry stage data comparison" technique is described first with reference to fig. 4 and 5, which illustrate the process of generating and storing characterization data for the current rendering. The redundancy test process is then described with reference to fig. 6 and 7.

Fig. 4 shows geometry processing logic 306, tiling unit 308, and data characterization unit 310 of fig. 3 in more detail. In particular, fig. 4 shows data provided to and generated by the data characterization unit 310 and stored in memory. Geometry processor logic 306 is shown to include a vertex shader unit 402 and a vertex post-processing unit 404. In some examples, geometry processing logic 306 may also include a primitive block generator (not shown in fig. 4, but described in more detail with reference to fig. 8 below). Vertex shader unit 402 receives primitive data to be processed. The vertex shader unit may execute one or more vertex shaders on the primitive data. The vertex shader unit may, for example, operate to perform one or more geometric transformations to transform primitive data from model space to screen space. It may also perform lighting and/or coloring operations or programmatically alter them in any suitable manner. The transformed vertex data is then output from shader unit 402 to vertex post-processing unit 404. The vertex post-processing unit performs a number of operations on the transformed primitive data to generate processed primitives, including clipping, projection, and culling in this example.

The operation of the system of fig. 4 will now be described with reference to the flowchart of fig. 5. In step 502, geometry data and associated state data for rendering are received at geometry processing logic 306 within GPU 302. The rendered geometry data includes a plurality of primitives that describe a surface of the geometry item to be rendered. The primitive data may include vertex data for one or more input primitives. Each primitive/vertex is associated with some state data that describes how the primitive should be rendered through the graphics pipeline. For example, the state data may include information such as vertex shader programs to be applied to primitives/vertices at the geometric processing stage, and rendering range data such as the number and/or identity of rendered draw calls, information regarding whether any advanced rendering techniques are used, such as Multiple Render Targets (MRTs), and the transparent colors of the rendering. The state data may also include data related to how the primitives are processed later in the pipeline (e.g., at the rendering stage), such as fragment shaders, vertex changes, texture information, etc. (where "changes" are attributes associated with each vertex, including, for example, color data, normal data, texture coordinates, or any other data that may be used as part of the rasterization process). As described above, the geometric data may be committed by a driver running on the host CPU, and in some examples, the data may be committed directly to the GPU, and in other examples, some data may be written to memory, and references to memory committed to the GPU (optionally along with other data).

Note that shaders of the geometry processing stage (e.g., vertices and geometry shaders) typically operate (e.g., are executed) on primitives or vertices of primitives, while shaders of the rendering stage (e.g., pixel/fragment shaders) typically operate (e.g., are executed) on fragments. It is further noted that the above mentioned status data is only an example of status data that may be submitted and may also comprise further data items. Examples of additional status data that may also be received include one or more of the following: an indication of the type of draw call (e.g., whether the draw call is indexed, instantiated, etc.); the argument of the draw call (e.g., the number of vertices of one or more primitives to render); resources of a graphics processing unit to be used for processing primitive data of a draw call (e.g., an indication of a vertex buffer or index buffer to be used); and an indication of a render target state (e.g., render target mix state or depth template state).

In step 504, geometry processing logic 306 processes the rendered geometry data to generate one or more processed primitives. In particular, geometry processing logic 306 processes primitives using one or more vertex shader programs associated with the plurality of primitives. Generating one or more processed primitives using one or more vertex shader programs may include executing a vertex shader program on data of the associated primitives and/or their associated vertices that programmatically changes or manipulates the primitives (e.g., transforms the primitives, illuminates the primitives, moves the primitives, rotates the primitives, deforms the primitives, replicates the primitives, or changes the primitives or their associated attributes in any other manner). The processed primitives may then be further processed by vertex post-processing unit 404 (or any other additional geometric stage processing blocks not shown in FIG. 4, such as hull shaders, tessellation, and domain shaders). The processed primitives are then provided to tiling unit 308. Thus, a processed primitive may refer to a primitive that has been subjected to one or more of the following operations: vertex/geometry shading, clipping, projection, and undergo culling operations.

In step 506, vertex shader and rendering range state data is provided to data characterization unit 310. The term "rendering scope" is intended to refer to data applied to rendering as a whole, e.g., to all primitives of the rendering. This is to be distinguished from "per-tile" data, which applies only to a particular tile or primitives within that tile. An example of vertex shader data and rendering range state data is shown in the pre-geometry stage data block 406 in fig. 4. The pre-geometry stage data block 406 includes rendering range state data 408 and vertex shader data 410. In the example of fig. 4, rendering range state data 408 includes data regarding rendering transparent colors, a count of the number of draw calls rendered, and a flag indicating whether advanced rendering techniques (such as MRTs) are used in rendering. In the example of FIG. 4, vertex shader data 410 includes a data structure that maps the identifier of each primitive (denoted as a "primitive ID") to the associated identifier of the vertex shader (denoted as a "shader ID"). It is noted that although FIG. 4 only shows one vertex shader mapped to each primitive, in some examples a primitive may have zero or more associated shader programs.

Note that in some examples, the pre-geometry stage data block 406 is not provided by the geometry shader logic 306 to the data characterization unit 310, but may be provided by an earlier unit (not shown) in the graphics processing unit or directly by a driver. This is illustrated by the dashed line in fig. 5. Alternatively, a portion of the pre-geometry stage data block 406 may be provided by geometry shader logic 306, while another portion is provided by a portion of the GPU that precedes geometry shader logic 306 in the graphics pipeline (e.g., rendering range data 408 is submitted directly to data characterization unit 310 by driver 326, and vertex shader data 410 is provided by geometry shader logic 306).

In step 508, the tiling unit 308 determines which of the processed primitives from the geometry shader logic 306 are located within each tile of the plurality of tiles. As used herein, the term "located" as it relates to primitives and tiles means "at least partially located," i.e., intersecting or overlapping the primitives and tiles. Thus, primitives located within a tile may be located partially within the tile, or entirely within the tile.

The tiling unit 308 may generate a display list for each tile indicating which primitives are located within the tile. These display lists may alternatively be referred to as control flow or tile lists. Each display list created by tiling unit 308 may not actually include the data for the primitives indicated in the list (e.g., vertex data for the primitives). Instead, each display list may contain an indication of each primitive located within a tile (e.g., primitive IDs in the associated primitive block). This reduces storage requirements by avoiding the need to store duplicate copies of primitive data for primitives located within more than one tile. The primitive ID stored in the display list for each tile may then be used to index the data for that primitive stored within the primitive block. Primitives located within a tile may not all belong to a single primitive block, but in some cases may belong to multiple primitive blocks. Thus, the display list for each tile may index one or more primitive blocks. The display list for each tile is output by tiling unit 308 and stored in memory.

In step 510, the tiling unit 308 provides per-tile rendering data to the data characterization unit 310. An example of rendering data per tile is shown in the post-geometry stage data block 412 in fig. 4. The post-geometric phase data block 412 of fig. 4 includes per-tile rendering data 414 organized as a list with tile identifiers (denoted as "tile IDs") followed by a data block associated with the tile. In this example, the data associated with each tile includes: a set of primitive identifiers (denoted as "primitive IDs"); a set of vertex coordinates (denoted as "vertex coordinates") and a set of vertex state data (denoted as "vertex state"). The primitive ID of a tile indicates which primitives are located in the tile. The vertex coordinates list the coordinates (e.g., x, y, and z coordinates) of each vertex of each primitive in the tile. If the primitive is a triangle, there are three vertices, each with x, y, z coordinates, e.g., as represented in FIG. 4 as xyz_P0V0 for the x, y, z coordinates of vertex 0 in primitive 0, xyz_P0V1 for the x, y, z coordinates of vertex 1 in primitive 0, xyz_P0V2 for the x, y, z coordinates of vertex 2 in primitive 0, and so on. The vertex states list the state data for each vertex of each primitive in the tile. If the primitive is a triangle, there are three vertices, each with associated state data, e.g., as represented in FIG. 4 as sd_P0V0 for state data for vertex 0 in primitive 0, sd_P0V1 for state data for vertex 1 in primitive 0, sd_P0V2 for state data for vertex 2 in primitive 0, and so on. Vertex state data may be associated with stages of a pipeline that have not yet been executed for the current rendering. In an example of a tile-based graphics pipeline that includes geometry processing and rendering stages, the state data may be data associated with the rendering stages of the pipeline (because the geometry processing stages have been completed for the primitives). The state data may, for example, include an indication of which shaders to execute to render primitives in the tile at the rendering stage. The state data may include a shader ID and/or an indication of shader resources of a shader to be executed to render the primitives of the tile. Where the state data is related to a primitive (e.g., for rendering) rather than a separate vertex, the state data of the primitive may be associated with one vertex (e.g., the first vertex) of the primitive. The state data may also include vertex changes and texture information.

In step 512, the data characterization unit 310 generates and stores in memory a representation of the per-tile vertex shader data and per-tile rendering data. In some examples, the data characterization unit 310 also generates and stores in memory rendering range redundancy data that indicates one or more characteristics of the rendering that are useful in redundancy detection described later. The data characterization unit may cause this information to be stored in the external memory block 304 ₁ Is a kind of medium. Alternatively (or additionally), some or all of this information may be stored locally to the graphics processor, for example in registers or in a cache memory in the data characterization unit. This information will be used to compare the current rendering with the previous rendering to determine whether part or all of the current rendering is redundant (as will be described in more detail below).

Fig. 4 illustrates an exemplary data characterization output block 416 generated by the data characterization unit 310. The data characterization output block 416 includes a header 418 and per-tile characterization data 420. The header 418 includes one or more valid flags generated by the data characterization unit 310 that may be used to indicate whether the current rendering is suitable for testing redundant rendering. This may save processing at the test stage, as described below. In one example, a valid flag may be set to indicate that the entire current rendering is unsuitable for testing redundant rendering, and thus all associated redundancy tests may be skipped. The valid flag may be set to a predetermined value by the data characterization unit 310 based on the rendering range state data 408. For example, if the render range state data 408 indicates that the rendering is part of a scene using multiple render targets (or other advanced rendering techniques), or that the rendering includes more draw calls than a threshold number, the valid flag may be set to a predetermined value. These are indications of complex renderings that are unlikely to benefit from redundant rendering testing. In some examples, if the data characterization unit 310 determines that the valid flag should be set to indicate that the entire current rendering is not suitable for testing redundant rendering, then the remainder of the data in the data characterization output block 416 need not be generated and stored.

In further examples, the additional valid flag may be set on a per-tile basis. That is, for each tile in the rendering space, there is a valid flag, and the valid flag is set to a predetermined value to indicate whether the tile is suitable for the redundant rendering test. The data characterization unit 310 may set a per-tile valid flag for a given tile based on the number of processed primitives located within the tile such that if more than a predefined maximum number of processed primitives are located in the tile, the valid flag is set to indicate that the tile is not suitable for the redundant rendering test. This is because a larger number of primitives within a tile are indicators of complex scenes that are unlikely to benefit from redundant rendering testing, and may also require significant processing and storage of per-tile characterization data. In an example, it has been found that the predefined maximum number of processed primitives for a tile should be in the range of 16 to 64. In some examples, if the data characterization unit 310 determines that the valid flag for a particular tile should be set to indicate that the tile is not suitable for testing redundant rendering, then the remainder of the data in the data characterization output block 416 for the particular tile need not be generated and stored.

Although in the example of fig. 4, the per-tile valid flag is shown as part of a header, in other examples, the valid flag for each tile may be stored in a per-tile characterization data 420 section of the data characterization output block 416, with other data related to that tile stored.

The header 418 may also include transparent colors for rendering. As described above, the transparent color is a rendering range characteristic, and can be used for redundancy detection described later.

The per-tile characterization data 420 includes a set of data specific to each tile in the rendering space. For example, FIG. 4 illustrates this with a sequence of data items for tile T0, followed by a sequence of data items for T1, and so on. However, in other examples, the data may be interleaved such that the value of a particular data item is listed for all tiles, then the value of another data item is listed for all tiles, and so on. As shown in the example of fig. 4, the data characterization unit 310 generates two characterization data sets for each tile: a representation of per-tile vertex shader data, a representation of per-tile rendering data.

The per-tile vertex shader data identifies one or more vertex shader programs that are used to generate processed primitives that are located within the tile. The data characterization unit 310 generates this information from vertex shader data 410 that maps primitive IDs to associated identifiers of the vertex shaders and primitive IDs in each tile from the per-tile rendering data 414. Thus, the data characterization unit 310 can map primitives located in each tile to the vertex shader used by those primitives to generate a list of vertex shaders used in each tile. Note that in many cases, multiple primitives may use the same vertex shader program, so it is contemplated that the vertex shader list used to generate processed primitives in a tile contains fewer entries than the primitives that exist within the tile. In one example, the representation of per-tile vertex shader data stored by the data characterization unit 310 may be a simple list of vertex shader IDs used to generate processed primitives in each tile. In another example, the representation of per-tile vertex shader data stored by the data characterization unit 310 may be a hash used to generate a list of vertex shader IDs for processed primitives in each tile, as described in more detail below.

The per-tile rendering data is data that may be used when rendering processed primitives within that tile in subsequent stages of the graphics pipeline. This is based on the per-tile rendering data 414 provided by the tiling unit 308. As shown in fig. 4, the per-tile rendering data may include vertex coordinates and vertex state data for each primitive located within the tile. Thus, this data accurately describes where the geometry within the tile is located (in terms of coordinates) and how it will be rendered to output (in terms of state data) in a subsequent pipeline stage. This enables an accurate comparison between the current tile and the previous tile during the redundancy test described later. The representation of per-tile rendering data stored by the data characterization unit 310 may be a direct copy of vertex coordinates and vertex state data from the per-tile rendering data 414. However, in other examples, the representation of per-tile rendering data stored by the data characterization unit 310 may be a hash of vertex coordinates and vertex state data from the per-tile rendering data 414.

It is noted that the exemplary arrangement of data shown in fig. 4 is merely an illustrative example, and that data may be structured in any suitable manner.

As described above, the per-tile vertex shader data and the representation of per-tile rendering data may be hashed versions of the raw data. The benefit of using a hash is that it reduces storage requirements because the hash is smaller than the original data and may be of a fixed size in some examples, regardless of the size of the original data. The data characterization unit 310 may be configured to store either or both of per-tile vertex shader data and per-tile rendering data in the form of one or more hashed representations. The data characterization unit 310 may generate the hash value by implementing a hash function. There are many well known hash functions, such as XOR-based functions, cyclic Redundancy Check (CRC) -based functions, and more complex schemes, such as MD5, SHA-1, and SHA-2.

In one example, per-tile rendering data may be hashed on a per-primitive basis; that is, the vertex coordinates and/or state data for each primitive located within a tile may be used to generate a corresponding hash value for each primitive. In this case, the hash function implemented by the data characterization unit 310 may generate a hash value as a function of the vertex data of the individual primitives. Thus, in this case, for each tile, the data characterization unit is in memory block 304 ₁ A number of hash values equal to the number of primitives determined to be located in the tile. Thus, the data characterization unit 310 may cause the hashed per-tile rendering data to be stored in the memory block 304 ₁ Wherein the hashed per-tile rendering data includes one or more hashes for each tileA set of column values, where, for example, each hash value corresponds to a respective primitive located in a tile.

In another example, per-tile rendering data may be hashed on a per-tile basis. That is, a single hash value may be generated for the tile based on per-tile rendering data for all primitives located within the tile (i.e., based on all vertex coordinates and/or state data for all primitives located within the tile). Thus, in this case, the hash function implemented by the data characterization unit 310 may generate a single hash value as a function of the per-tile rendering data for all primitives located within the tile. Generating a single vertex hash value per tile has the advantage of further reducing the storage requirements of the vertex data for each tile.

In further examples, per-tile vertex shader data and per-tile rendering data may be hashed together on a per-tile basis. That is, a single hash value may be generated for the tile based on the per-tile vertex shader data and per-tile rendering data for all primitives located within the tile (i.e., based on all vertex shader data, all vertex coordinates, and/or state data for all primitives located within the tile). In yet further examples, per-tile vertex shader data and per-tile rendering data may be hashed separately on a per-tile basis, and hash values stored separately.

Although storing per-tile vertex shader data and/or per-tile rendering data in the form of one or more hash values does reduce the memory block 304 ₁ But it still requires the data characterization unit 310 to perform multiple hash computations, consuming the processing resources of the graphics processing unit.

Referring now to fig. 6 and 7, these figures illustrate data provided to and analyzed by test unit 312 in accordance with a "pre-geometric phase data comparison, then post-geometric phase data comparison" technique, and a flow chart showing how this operates.

Fig. 6 illustrates in more detail rendering stage functions, including test unit 312, rendering logic 314, acquisition unit 316, and fragment processing logic 318 of fig. 3. In particular, FIG. 6 illustrates data retrieved from memory and analyzed by the test unit 312 to determine whether rendering of a particular tile is redundant. The test unit 312 is configured to retrieve and analyze various data items related to the current and previous renderings of the tile from memory in a multi-stage process and use this to determine whether the current rendering is redundant. In other words, the test unit 312 determines whether the previously rendered output of the tile is available as the rendered output.

In one example, the fetch unit 316 is configured to wait for analysis to complete at the test unit 312 before fetching the rendering data (i.e., display list and associated primitive data) of the tile being rendered. If the test unit 312 determines that the rendering of the current tile is not redundant (i.e., needs to be rendered), the test unit 312 provides a signal to the fetch unit 316, which may begin fetching the rendering data of the tile and provide this to the fragment processing logic 318, which will perform fragment shading, texturing, etc. to render the tile and generate an output. Conversely, if the test unit 312 determines that the rendering of the current tile is redundant (i.e., does not need to be rendered), the test unit 312 provides a signal to the fetch unit 316 indicating that tile rendering data does not need to be fetched from memory. This information (which may be provided by the test unit 312 or the fetch unit 316) is used by the rendering logic 314 and causes the rendering logic to retain the previously rendered output of the tile in memory and use that data as the currently rendered output data of the tile.

In another example, to avoid the test unit 312 stopping the graphics pipeline, the fetch unit 316 is configured to continue fetching rendering data for the tile being rendered without waiting for the test unit 312 to complete the analysis. In this way, more memory bandwidth may be used, but performance is improved. If the test unit 312 determines that the rendering of the current tile is not redundant (i.e., needs to be rendered), the fetch unit 316 continues to fetch the rendering data of the tile and provides this to the fragment processing logic 318, which will perform fragment shading, texturing, etc. to render the tile and generate an output. If the test unit 312 determines that the rendering of the current tile is redundant (i.e., does not need to be rendered), the test unit 312 provides a signal to the fetch unit 316 indicating that the fetching of tile rendering data should be stopped/interrupted because the data is no longer needed. As described above, this information (which may be provided by the test unit 312 or the fetch unit 316) is used by the rendering logic 314 and causes the rendering logic to retain the previously rendered output of the tile in memory and use this data as the currently rendered output data of the tile.

The operation of the test unit 312 and the redundant data sequences that the test unit uses to effectively determine whether tile rendering is redundant will now be described with reference to FIG. 6 in conjunction with FIG. 7.

For the selected tiles that are being tested for redundancy prior to rendering, the test unit 312 begins retrieving data stored in memory by the data characterization unit 310 from the data characterization output block 416. However, because the data characterization output block 416 may be a large data block, the test unit 312 may retrieve the data in a multi-stage process in order to reduce memory bandwidth and power consumption. In particular, the multi-stage process aims to quickly and efficiently identify many non-redundant tile renderings without incurring significant memory bandwidth or computational costs.

In step 702, in a first phase, the test unit 312 retrieves the current rendered and previously rendered data from the header 418 of the data characterization output block 416. The retrieved header data may be in the form of rendering range data and per-tile validity data. For example, as shown in fig. 6, in this first phase, the test unit 312 retrieves a first data block 602 that includes one or more valid flags and the current and previous rendered transparent colors. As described above, the valid flags may include either or both of a render range valid flag and a per-tile valid flag for the tile being tested.

In step 704, the test unit 312 determines if the data retrieved in the first data block 602 indicates that a per-tile comparison is valid for continuing with more detailed rendering of the data. In the example of fig. 6, the test unit 312 determines this by checking whether the relevant valid flag indicates that both the previous rendering data and the current rendering data are suitable for redundancy testing. For example, test unit 312 may check that a render range valid flag is not set for the current rendering and the previous rendering (e.g., indicating that neither the complex rendering features used are rendered nor include too many draw calls). The test unit 312 may also check whether a per-tile valid flag for the tile is set for any of the current rendering and the previous rendering (e.g., indicating that too many primitives are located in the tile). In the example of fig. 6, the test unit 312 further determines whether the currently rendered and previously rendered transparent colors match by comparing the two values.

If any valid flag indicates that the tile is not valid for redundancy testing, or the clear colors do not match, this indicates that the test unit should not make further comparisons because the tile is not redundant or not suitable for redundancy testing. In this case, in step 706, the tile is rendered as normal by the rendering phase of the GPU. Importantly, this decision can eliminate many unsuitable tiles from the redundancy test process, where only a very small amount of data is retrieved, and with very little processing overhead. Thus, this does not significantly degrade the performance of the GPU.

If the rendering range data and per-tile validity data indicate that it is valid to proceed with a more detailed per-tile comparison to detect redundancy, the process moves to a second stage in step 708 where the test unit retrieves a first portion of per-tile characterization data 420 from the data characterization output block 416. In step 708, the test unit 312 retrieves vertex shader data for tiles of both the current rendering and the previous rendering. For example, as shown in FIG. 6, if the tile being tested is T0, then a second data block 604 of T0 vertex shader data is retrieved that includes both the current rendering and the previous rendering. As described above, per-tile vertex shader data includes information about the identity of the vertex shader that is used in a geometric phase to process primitives located in the tile being tested. As described above, this is typically a small amount of data (at least for tiles that have passed through the first stage that do not include too many primitives), and may be in the form of hashed data in some examples.

In step 710, the test unit 312 compares the retrieved data to determine if the per-tile vertex shader data from the current rendering and the previous rendering match. If the per-tile vertex shader data from the current and previous renderings do not match, this indicates that the geometry in the current and previous rendered tiles is generated in a different manner (with different shaders), and thus that the tiles are highly unlikely to be redundant. In this case, the process moves to step 706 and the tile is rendered as normal in the rendering phase. Notably, this is a decision that can use a small amount of data and a simple logical comparison to identify tiles that are highly likely not redundant, and thus without significant bandwidth, power, or processing overhead.

If the per-tile vertex shader data from the current rendering and the previous rendering match, the process moves to the third stage. The third stage is a comprehensive comparison of the redundant data to confirm whether the tiles are redundant using the second portion of the per-tile characterization data 420 from the data characterization output block 416. In step 712, the test unit 312 retrieves vertex coordinates and state data for the currently and previously rendered tiles. For example, as shown in FIG. 6, if the tile being tested is T0, then a third data block 606 is retrieved that includes T0 vertex coordinates and state data for both the current rendering and the previous rendering. As described above, the vertex coordinates and state data of a tile provide information about where the vertices of primitives in the tile are located and how the primitives are to be rendered at the rendering stage. Thus, this enables an accurate comparison of whether the content of a tile is indeed the same as previously rendered content.

In step 714, the test unit 312 compares the retrieved data to determine if the per-tile vertex coordinates and state data from the current and previous renderings match. If the per-tile vertex coordinates and state data do not match, then the tile content in the current rendering and the previous rendering is not the same, and thus the tiles are not redundant. In this case, in step 706, the tile is rendered as normal in the rendering phase. If the per-tile vertex coordinates and state data do match, then in step 716 the tile is determined to be redundant and the rendering of the tile (or at least a portion of the avoidance process) may be skipped in the rendering phase. In this case, the previously rendered output may be used as the currently rendered output, as described above.

As described above, in some examples, the retrieved per-tile vertex coordinates and state data may be in the form of hash data. The test unit 312 may require that one or more hash values of the tiles match exactly to determine that the currently rendered tile is redundant. If the information characterizing the primitive content of a tile is in the form of a plurality of hash values, each of these hash values may have to be matched with a corresponding hash value of the tile stored for the previous rendering in order for the test unit 312 to determine that the primitive content matches.

The amount of data retrieved for the vertex coordinates and state data of the tile is greater than other phases of the redundancy test. However, without comparing the data, an accurate decision on redundancy cannot be made. This impact on GPU performance/power consumption is mitigated by the multi-stage test procedure described above. By ensuring that unsuitable or non-redundant tiles are eliminated early from the test process, and by small amounts of redundant data and simple comparison, retrieval of larger amounts of data required for accurate comparison is minimized.

Primitive block comparison

A second technique "primitive block comparison" is now described with reference to fig. 8 and 9, which illustrate the process of generating and storing characterization data for the current rendering. The redundancy test procedure of this technique is then described with reference to fig. 10 and 11.

Fig. 8 shows geometry processing logic 306, tiling unit 308, and data characterization unit 310 of fig. 3 in more detail. In particular, fig. 8 shows data provided to and generated by the data characterization unit 310 and stored in memory. Geometry processing logic 306 is shown to include vertex shader unit 402 and vertex post-processing unit 404 (as described above with reference to fig. 4). Geometry processing logic 306 also includes primitive block generator 802. Vertex shader unit 402 receives primitive data to be processed. The vertex shader unit may execute one or more vertex shaders on the primitive data. The vertex shader unit may, for example, operate to perform one or more geometric transformations to transform primitive data from model space to screen space. It may also perform lighting and/or coloring operations or programmatically alter them in any suitable manner. The transformed vertex data is then output from shader unit 402 to vertex post-processing unit 404. The vertex post-processing unit performs a number of operations on the transformed primitive data to generate processed primitives, including clipping, projection, and culling in this example.

The processed primitives output from the vertex post-processing unit 404 are input into primitive block generator 802. Primitive block generator 802 operates to group the generated processed primitives into one or more sets and generate primitive blocks from each set to form one or more primitive blocks. A primitive block is a data structure generated for storage in memory that contains the data of a set of primitives and can be accessed by a later stage of the pipeline when the primitive data is needed. The data may be, for example, vertex data (e.g., screen space coordinates of vertices and vertex changes for each primitive in the set of primitive blocks). The primitive block may also contain an index for each primitive within the block (e.g., the primitive ID for each primitive in the block).

In one example, grouping primitives into sets to form primitive blocks may be performed based on common state data. That is, grouping primitives may include identifying primitives that have common state data and storing vertex coordinates and vertex changes for the identified primitives in a primitive block in association with the common state data. By grouping primitives by common state data, such common state data may be stored once within a primitive block rather than individually for each primitive. This enables saving of the amount of data that needs to be stored and thus saving read/write memory bandwidth. This saving may be significant, particularly because many objects in a scene may be formed from hundreds of primitives, all of which may share a common state.

Each primitive block itself may be associated with a primitive block ID (e.gSo that different primitive blocks can be distinguished and identifiable from each other). Primitive data for all processed primitives generated by vertex post-processing unit 404 may be stored within one or more primitive blocks. That is, the one or more primitive blocks generated by primitive block generator 802 may contain all of the processed primitives generated by vertex post-processing unit 404. Primitive blocks generated by primitive block generator 802 are input into tiling unit 308 and written to external memory block 304 ₂ 。

The operation of the system of fig. 8 will now be described with reference to the flowchart of fig. 9. In step 902, geometry data and associated state data for rendering are received at geometry processing logic 306 within GPU 302. The rendered geometry data includes a plurality of primitives that describe a surface of the geometry item to be rendered. The primitive data may include vertex data for one or more input primitives. Each primitive is associated with state data that describes how the primitive should be rendered through the graphics pipeline. For example, the state data may include information such as vertex shader programs to be applied to primitives at the geometry processing stage, and rendering range data such as the number and/or identity of rendered draw calls, information regarding whether any advanced rendering techniques are used, such as Multiple Render Targets (MRTs), and rendered transparent colors. The state data may also include data related to how the primitives are processed later in the pipeline (e.g., at the rendering stage), such as fragment shaders, vertex changes, texture information, etc. (where "changes" are attributes associated with each vertex, including, for example, color data, normal data, texture coordinates, or any other data that may be used as part of the rasterization process). As described above, the geometric data may be committed by a driver running on the host CPU, and in some examples, the data may be committed directly to the GPU, and in other examples, some data may be written to memory, and references to memory committed to the GPU (optionally along with other data).

In step 904, geometry processing logic 306 processes the rendered geometry data to generate one or more processed primitives. In particular, geometry processing logic 306 processes primitives using one or more vertex shader programs associated with the plurality of primitives. Generating one or more processed primitives using one or more vertex shader programs may include executing a vertex shader program on data of the associated primitives and/or their associated vertices that programmatically changes or manipulates the primitives (e.g., transforms the primitives, illuminates the primitives, moves the primitives, rotates the primitives, deforms the primitives, replicates the primitives, or changes the primitives or their associated attributes in any other manner). The processed primitives may then be further processed by vertex post-processing unit 404 (or any other additional geometric stage processing blocks not shown in FIG. 8, such as hull shaders, tessellation, and domain shaders). The processed primitives are then provided to primitive block generator 802.

In step 906, the rendering range state data is provided to the data characterization unit 310. As detailed above, the term "rendering scope" is intended to refer to data applied to rendering as a whole, e.g., to all primitives of the rendering. This is to be distinguished from "per-tile" data, which applies only to a particular tile or primitives within that tile. An example of rendering range state data is shown in the front geometry stage data block 804 in fig. 8. The pre-geometry stage data block 804 includes data regarding rendering transparent colors, a count of the number of draw calls rendered, and a flag indicating whether advanced rendering techniques (such as MRTs) are used in rendering.

Note that in some examples, the pre-geometry stage data block 406 is not provided by the geometry shader logic 306 to the data characterization unit 310, but may be provided by an earlier unit (not shown) in the graphics processing unit or directly by a driver. This is illustrated by the dashed line in fig. 9. Alternatively, a portion of the pre-geometry stage data block 804 may be provided by the geometry shader logic 306, while another portion is provided by a portion of the GPU that precedes the geometry shader logic 306 in the graphics pipeline.

In step 908, primitive block generator 802 generates one or more primitive blocks that contain rendered primitive data. As described above, primitive block generator 802 may do so by identifying primitives that have common state data and grouping the primitives such that their vertex data is stored in the primitive block in association with the common state data. In step 910, primitive block generator 802 provides the generated primitive blocks to data characterization unit 310. Fig. 8 shows an exemplary primitive block 806 provided from the primitive block generator 802 to the data characterization unit 310. Primitive block 806 includes a primitive block identifier (denoted "primitive block ID") followed by a data block associated with the primitive block. The data blocks associated with the primitive blocks may include: a list of primitives in a primitive block (denoted as "primitive ID"), state data common to all of these primitives (denoted as "common state"), a set of vertex coordinates (denoted as "vertex coordinates"), and a set of vertex change data (denoted as "vertex change"), are identified.

In an example of a tile-based graphics pipeline that includes geometry processing and rendering stages, the common state data may be data associated with the rendering stages of the pipeline (because the geometry processing stages have been completed for the primitives). The common state data may, for example, include an indication of which shaders to execute to render primitives in the primitive block at the rendering stage. The common state data may include a shader ID and/or an indication of shader resources of a shader to be executed to render the primitives in the primitive block. The vertex coordinates list the coordinates (e.g., x, y, and z coordinates) of each vertex of each primitive in the primitive block. If the primitive is a triangle, there are three vertices, each with x, y, z coordinates, e.g., as represented in FIG. 8 as xyz_P0V0 for the x, y, z coordinates of vertex 0 in primitive 0, xyz_P0V1 for the x, y, z coordinates of vertex 1 in primitive 0, xyz_P0V2 for the x, y, z coordinates of vertex 2 in primitive 0, and so on. The vertex changes list the change data for each vertex of each primitive in the primitive block. Vertex changes may be considered per-vertex change state data and thus cannot be separated into common state data. If the primitive is a triangle, there are three vertices, each with associated change data, e.g., as represented in FIG. 8 as sd_P0V0 for the state data of vertex 0 in primitive 0, sd_P0V1 for the state data of vertex 1 in primitive 0, sd_P0V2 for the state data of vertex 2 in primitive 0, and so on.

In step 912, the tiling unit 308 determines which of the processed primitives from the geometry shader logic 306 are located within each tile of the plurality of tiles. As used herein, the term "located" as it relates to primitives and tiles means "at least partially located," i.e., intersecting or overlapping the primitives and tiles. Thus, primitives located within a tile may be located partially within the tile, or entirely within the tile.

In step 914, the tiling unit 308 provides an indication of which primitive blocks are associated with each tile to the data characterization unit 310. In other words, the tiling unit 308 indicates for each tile which primitive blocks contain at least one primitive that is located within that tile. The indication may be in the form of a per-block primitive block list indicating which of the one or more primitive blocks contain at least one primitive located within the tile. Fig. 8 shows an exemplary primitive block list 808 generated by tiling unit 308 and provided to data characterization unit 310. The primitive block list 808 is in the form of a bitmask (i.e., a string of bits) for each tile, where the position of a bit within the bitmask indicates the identity of the primitive block and the value of the bit (one or zero) indicates whether the primitive block contains a primitive located within the tile. For example, using the illustration of FIG. 8, the primitive block list 808 may include a first bitmask associated with tile 0 (T0 in FIG. 8). The first bitmask has a zero bit set to 0 (if counted from zero), indicating that primitive block 0 (PB 0 in FIG. 8) does not contain primitives that lie within tile 0. The first bitmask then has a first bit set to 1, indicating that primitive block 1 (PB 1 in FIG. 8) contains at least one primitive that is located within tile 0. The first bitmask then has a second bit set to 0, indicating that primitive block 2 (PB 2 in FIG. 8) does not contain primitives located within tile 0. The bitmask may contain additional bits corresponding to additional primitive blocks, and the bitmask may thus be the length of the number of primitive blocks generated by primitive block generator 802. In some examples (as described below), there may be a predefined limit to the number of primitive blocks in rendering beyond which redundancy testing will not be performed, in which case the length of the bitmask may be correspondingly limited. The primitive block list 808 may then further contain a second bitmask (T1 in FIG. 8) associated with tile 1. The second bitmask has a zeroth bit set to 1, indicating that primitive block 0 (PB 0 in FIG. 8) contains primitives located within tile 1. Additional bits may be present in the bit mask and additional bit masks may be present for additional slices.

The benefit of using a bitmask such as that described above is that it is a very compact and efficient data structure. It requires very little memory space and quickly and efficiently accesses and interprets data. Since tiling unit 308 has already determined which primitives are located in which tiles, and it knows where the primitives are stored, this simply adds the addition of set bits for each tile in which the primitives are found to be located, thus requiring minimal overhead to build and provide this data at tiling unit 308.

In step 916, the data characterization unit 310 generates and stores primitive block characterization data and per-block primitive block usage data in memory. In some examples, the data characterization unit 310 also generates and stores in memory rendering range redundancy data that indicates one or more characteristics of the rendering that are useful in redundancy detection described later. The data characterization unit may cause this information to be stored in the external memory block 304 ₁ Is a kind of medium. Alternatively (or additionally), some or all of this information may be stored locally to the graphics processor, for example in registers or in a cache memory in the data characterization unit. This information will be used to compare the current rendering with the previous rendering to determine whether part or all of the current rendering is redundant (as will be described in more detail below).

Fig. 8 shows an exemplary data characterization output block 810 generated by the data characterization unit 310. The data characterization output block 810 includes a header 812, primitive block data 814, and a per-block primitive block list 816. The header 812 includes one or more valid flags generated by the data characterization unit 310 that may be used to indicate whether the current rendering is suitable for testing redundant rendering. This may save processing at the test stage, as described below. In one example, a valid flag may be set to indicate that the entire current rendering is unsuitable for testing redundant rendering, and thus all associated redundancy tests may be skipped. The valid flag may be set to a predetermined value by the data characterization unit 310 based on the rendering range state data 804. For example, if the render range state data 804 indicates that the rendering is part of a scene using multiple render targets (or other advanced rendering techniques), or that the rendering includes more draw calls than a threshold number, the valid flag may be set to a predetermined value. These are indications of complex renderings that are unlikely to benefit from redundant rendering testing. In some examples, if the data characterization unit 310 determines that the valid flag should be set to indicate that the entire current rendering is not suitable for testing redundant rendering, then the remainder of the data in the data characterization output block 810 need not be generated and stored.

Header 812 may also include the number of primitive blocks generated by primitive block generator 802. This data is useful for several reasons. It may be used as part of the testing process to eliminate rendering that is too complex to benefit from redundancy testing (as outlined below in connection with fig. 10 and 11). This is also beneficial for interpreting the rest of the data representation output block 810, as the number of primitive blocks generated will depend on the scene being rendered and is not known in advance. Thus, providing the number of primitive blocks in the header ensures that it can be determined how large the primitive block data 814 is, and the number of bits in each per-block primitive block list 816. The header 812 may also include transparent colors for rendering. As described above, the transparent color is a rendering range characteristic, and can be used for redundancy detection described later.

Primitive block data 814 comprises data that characterizes the contents of one or more primitive blocks. For example, fig. 8 shows this with the data of the primitive block PB0, followed by the data of PB1, and the like. In one example, primitive block data 814 may be a direct copy of primitive block 806 generated by primitive block generator 802. In another example, primitive block data 814 may be a hash of primitive block 806 generated by primitive block generator 802, as described in more detail below.

The per-block primitive block list 816 is data that indicates which of the one or more primitive blocks contain primitives that lie within the tile. This is based on the primitive block list 808 provided by tiling unit 308. The per-block primitive block list 816 may include, for example, a bitmask for each tile indicating which of the one or more primitive blocks contain at least one primitive located within the tile, wherein the location of a bit within the bitmask indicates the identity of the primitive block and the value of the bit (one or zero) indicates whether the primitive block contains a primitive located within the tile. The per-tile primitive block list 816 stored by the data characterization unit 310 may be a direct copy of the primitive block list 808 from the tiling unit 308.

It is noted that the exemplary arrangement of data shown in fig. 8 is merely an illustrative example, and that data may be structured in any suitable manner.

As described above, primitive block data 814 may be a hashed version of the original data. The benefit of using a hash is that it reduces storage requirements because the hash is smaller than the original data and may be of a fixed size in some examples, regardless of the size of the original data. The data characterization unit 310 may be configured to store the primitive block data 814 in the form of one or more hashed representations. The data characterization unit 310 may generate the hash value by implementing a hash function. There are many well known hash functions, such as XOR-based functions, cyclic Redundancy Check (CRC) -based functions, and more complex schemes, such as MD5, SHA-1, and SHA-2.

Primitive block data 814 may be hashed on a per primitive block basis; i.e. each primitive block is hashed separately. Although storing primitive block data 814 in the form of one or more hash values does reduce memory block 304 ₁ But it still requires the data characterization unit 310 to perform a hash calculation for each primitive block, consuming the processing resources of the graphics processing unit.

Referring now to fig. 10 and 11, these figures illustrate data provided to and analyzed by test unit 312 in accordance with the "primitive block comparison" technique, and a flow chart showing how this operates.

Fig. 10 illustrates in more detail rendering stage functions, including test unit 312, rendering logic 314, acquisition unit 316, and fragment processing logic 318 of fig. 3. In particular, FIG. 10 illustrates data retrieved from memory and analyzed by the test unit 312 to determine whether rendering of a particular tile is redundant. The test unit 312 is configured to retrieve and analyze various data items related to the current and previous renderings of the tile from memory in a multi-stage process and use this to determine whether the current rendering is redundant. In other words, the test unit 312 determines whether the previously rendered output of the tile is available as the rendered output.

The operation of the test unit 312 and the redundant data sequences that the test unit uses to effectively determine whether tile rendering is redundant will now be described with reference to FIG. 10 in conjunction with FIG. 11.

For the selected tiles that are being tested for redundancy prior to rendering, the test unit 312 begins retrieving data stored in memory by the data characterization unit 310 from the data characterization output block 810. However, because the data characterization output block 810 may be a large data block, the test unit 312 may retrieve data in a multi-stage process in order to reduce memory bandwidth and power consumption. In particular, the multi-stage process aims to quickly and efficiently eliminate many non-redundant tile renderings without incurring significant memory bandwidth or computational costs.

In step 1102, in a first phase, test unit 312 retrieves current rendered and previously rendered data from a header 812 of data representation output block 810. The retrieved header data may be in the form of rendering range data. For example, as shown in fig. 10, in this first stage, test unit 312 retrieves a first data block 1002 that includes one or more valid flags, the number of primitive blocks, and the current and previous rendered transparent colors. As described above, the valid flag may include a rendering range valid flag.

In step 1104, the test unit 312 determines if the data retrieved in the first data block 1002 indicates that a per-tile comparison to proceed with rendering the data in more detail is valid. In the example of fig. 10, the test unit 312 determines this by checking whether the valid flag indicates that both the previous rendering data and the current rendering data are suitable for redundancy testing. For example, test unit 312 may check that a render range valid flag is not set for both current and previous renderings (e.g., indicating that neither the complex rendering features used are rendered nor include too many draw calls).

The test unit 312 may also use the values of the number of primitive blocks in the current rendering and the previous rendering to determine whether it is appropriate to continue redundancy testing the tile. For example, test unit 312 may compare the number of primitive blocks in the current and previous renderings to a predefined limit. If the number of primitive blocks in the current and previous renderings exceeds this limit, this may indicate that the rendering contains a large number of primitives, indicating that complex scenes are unlikely to be suitable for redundancy testing. In some examples, the predefined limit for the number of primitive blocks is between 64 and 128. Note that in an alternative example, the data characterization unit 310 may compare the number of primitive blocks to a predefined limit and set a valid flag based thereon instead of by the test unit 312. In the example of fig. 10, the test unit 312 further determines whether the currently rendered and previously rendered transparent colors match by comparing the two values.

If any valid flag indicates that the tile is not valid for redundancy testing, the number of primitive blocks exceeds a limit, or the transparent colors do not match, this indicates that the test unit should not make further comparisons because the tile is not redundant or not suitable for redundancy testing. In this case, in step 1106, the tile is rendered as normal by the rendering phase of the GPU. Importantly, this decision can eliminate many unsuitable tiles from the redundancy test process, where only a very small amount of data is retrieved, and with very little processing overhead. Thus, this does not significantly degrade the performance of the GPU.

If the validity data indicates that it is valid to proceed with a more detailed per-tile comparison to detect redundancy, then in one example the process moves to a second stage in step 1108, where the test unit retrieves the per-tile primitive block list 816 of the tile being tested from the data representation output block 810. For example, in step 1108, the test unit 312 retrieves primitive block bitmasks for the current and previously rendered tiles. As shown in fig. 10, if the tile being tested is T0, then a second data block 1004 is retrieved that includes a T0 primitive block bitmask for both the current rendering and the previous rendering. As described above, the per-block primitive block bitmask includes information identifying which primitive blocks contain primitives located in the tile being tested. As mentioned above, this is typically a small amount of data (especially for renderings that have passed the first phase that do not include too many primitive blocks).

In step 1110, the test unit 312 compares the retrieved data to determine if the per-tile primitive block lists from the current rendering and the previous rendering match. If the per-block primitive block lists from the current and previous renderings do not match, this indicates that the primitives in the current and previous rendered tiles are in different primitive blocks, and thus that the tiles are unlikely to be redundant. In this case, the process moves to step 1106 and the tile is rendered as normal in the rendering phase. Notably, this is a decision that may use a small amount of data and a simple logical comparison to identify tiles that may not be redundant, and thus without significant bandwidth, power, or processing overhead.

Note that in some alternative examples, the comparison of bitmasks may be omitted, and conversely, the comparison may be moved from the first stage to a third stage (outlined below). While this does not save any memory bandwidth (because the bitmask is used in the third stage), it may reduce the time to complete the analysis and does not require the current and previous renderings to have the same bitmask structure to implement the comparison (e.g., they may have different lengths). It may also avoid the situation that only a portion of the scene changes between renderings, such that primitive blocks are assembled differently, but some tiles may still be redundant.

If the per-block primitive block lists from the current rendering and the previous rendering match, the process moves to the third stage. The third stage is a comprehensive comparison to confirm whether the tiles are redundant using primitive block data 814 from the data representation output block 810. In step 1112, the test unit 312 uses the current primitive block list and the previous primitive block list of the tile to retrieve the indicated related primitive block data 814 for the current and previously rendered tiles. For example, as shown in fig. 10, if the tile being tested is T0, and both the current primitive block list and the previous primitive block list of T0 indicate that primitive block 1 contains a primitive located in that tile, then a third data block 1006 is retrieved, which includes primitive block data for both current and previous rendered PB 1. As described above, a primitive block contains information about where vertices of primitives in the primitive block are located and how they are to be rendered in the rendering stage. Thus, this enables an accurate comparison of whether the content of a tile is indeed the same as previously rendered content.

In step 1114, the test unit 312 compares the retrieved data to determine if the indicated primitive block data from the currently rendered and previously rendered tiles match. If the indicated primitive block data of a tile does not match, then there is no guarantee that the tile contents in the current rendering and the previous rendering are the same, and therefore the tile is not considered redundant. In this case, in step 1106, the tile is rendered as normal in the rendering phase. If the indicated primitive block data of the tile matches, then in step 1116, the tile is determined to be redundant and the rendering of the tile (or at least a portion of the avoidance process) may be skipped in the rendering phase. In this case, the previously rendered output may be used as the currently rendered output, as described above.

As described above, in some examples, primitive block data 814 may be in the form of hash data. The test unit 312 may require an exact match of one or more hash values of the primitive block associated with the tile to determine that the currently rendered tile is redundant. If the information characterizing the associated primitive block is in the form of multiple hash values (e.g., multiple primitive blocks), each of these hash values may have to be matched individually with the corresponding hash value previously rendered in order for test unit 312 to determine that the primitive contents match.

The amount of data retrieved for comparing primitive blocks associated with a tile is greater than other phases of the redundancy test. However, without comparing the data, an accurate decision on redundancy cannot be made. This impact on GPU performance/power consumption is mitigated by the multi-stage test procedure described above. By ensuring that unsuitable or non-redundant tiles are eliminated early from the test process, and by small amounts of redundant data and simple comparison, retrieval of larger amounts of data required for accurate comparison is minimized.

In contrast to the "pre-geometry stage data comparison followed by post-geometry stage data comparison" technique described above, the "primitive block comparison" technique loses some granularity in detecting redundant tile rendering. In particular, it may be the case that one primitive within a primitive block is changed between renderings, which would then mean that all tiles identifying that primitive block would not be considered redundant, even though the changed primitive is not actually present in all of those tiles. However, the "primitive block comparison" technique has the advantage of being more efficient in terms of storage and computation. First, a primitive block is anyway created in the GPU for use with the tiling process, thus utilizing the already created data structure. Second, the use of primitive blocks gives more efficient data storage. This is because each primitive is contained in only a single primitive block and is therefore stored only once, regardless of how many tiles the primitive is in. In contrast, with the "pre-geometric phase data comparison followed by the post-geometric phase data comparison" technique, primitive data is stored in association with tiles, and thus will be stored multiple times when it is located in multiple tiles. Third, when hashing is used, the use of primitive blocks is more computationally efficient. This is because only one hash is computed per primitive block and each primitive is contained in only a single primitive block. This means that the primitives are hashed only once. In contrast, with the "pre-geometric phase data comparison followed by post-geometric phase data comparison" technique, primitives are hashed multiple times when they are located in multiple tiles.

To increase the granularity of detecting redundant tiles in the "primitive block comparison" technique, in some examples, the primitive block generator 802 may divide each primitive block formed for the current rendering into multiple portions. Primitive block generator 802 may, for example, divide each primitive block into a fixed number of portions, or into fixed-size portions. Each primitive block portion may then contain a subset of the primitives within the primitive block. If each primitive block is partitioned into "n" portions, the tiling unit 308 may then generate "n" bitmasks for each tile, each bitmask being associated with a portion and indicating which primitive blocks of the portion contain one or more primitives located within the tile. This is schematically illustrated in fig. 12.

Fig. 12 shows an example in which two primitive blocks 1202 and 1204 are formed by primitive block generator 802 for the current rendering. Primitive block generator 802 then splits each primitive block into multiple portions, in this example two segments. The primitive block portions of each primitive block are passed from the primitive block generator 802 to the data characterization unit 310.

Tiling unit 308 then generates a number of bitmasks for each tile equal to the number of primitive block portions forming each primitive block. Thus, in this example, tiling unit 308 generates two bitmasks per tile, shown at 1206 and 1208 in FIG. 12. Each bitmask corresponds to a respective primitive block portion number (shown by the dashed line in fig. 12). Thus, each bitmask of a tile indicates which primitive block portions of a given primitive block portion number contain primitives that lie within the tile. In the example shown in fig. 12, the bitmask 1206 indicates which primitive block portions of part number 1 contain primitives that lie within a tile, and the bitmask 1208 indicates which primitive block portions of part number 2 contain primitives that lie within a tile. In other words, in the example of fig. 12, where a primitive block is split in two halves, the bitmask 1206 indicates when a primitive from the first half of any primitive block is located within a tile, and the bitmask 1208 indicates when a primitive from the second half of any primitive block is located within a tile.

The relationship between the bitmask and primitive block portions of a tile may be represented mathematically by: claim each bitmask b _i Which primitive block segments pb indicate j= … N _ij Comprising primitives located within a tile, wherein pb _ij Is the primitive block of the segment number i of primitive block j, and N is the number of primitive blocks formed for rendering

The data characterization unit 310 then stores the primitive block data 814 as described above in the data characterization output block 810, but these portions are stored separately (and optionally hashed separately). The data characterization unit 310 also stores a plurality of bitmasks per tile in the data characterization output block 810, each bitmask being associated with a different primitive block portion. Note that while primitive block generator 802 is described as partitioning primitive blocks, in other examples, this function may be performed by data characterization unit 310. Also, tiling unit 308 is described above as generating multiple bitmasks per tile, but in other examples this may be derived by data characterization unit 310.

This approach increases the memory block 304 compared to an example in which the primitive block is not partitioned into segments ₁ Is not required for the memory requirements of (a). This is because the size of the primitive block zone list is greater than the size of the primitive block list; and by dividing a primitive block into multiple segments, multiple bitmasks need to be stored per tile rather than a single bitmask per tile. However, this approach has the advantage of increasing the granularity. This is because under this approach, if a primitive within a tile is from a segment of the primitive block that did not change between renderings, the tile will be identified as redundant even if a primitive from another segment of the primitive block has changed.

Reference is now made to fig. 13A and 13B, which illustrate how the information currently rendered and previously rendered may be stored in memory for the first technique or the second technique described above. For example, the data characterization unit has been described with reference to steps 512 and 916 as having the data characterization output block 416/810 stored in the memory block 304 ₁ Is a kind of medium. Memory block 304 ₁ The previously rendered data may also be stored therein to characterize the output block. To minimize memory requirements, memory block 304 ₁ Only two rendered data representation output blocks may be stored: current rendering, and previous rendering compared to the current rendering.

Thus, the data characterization unit 310 writes information to the currently rendered memory block 304 ₁ When it is important that the data characterization unit does not overwrite previously rendered information compared to the current rendering. Reference will now be made to fig. 13A and fig. 13bFig. 13B depicts two approaches to avoiding this situation.

FIG. 13A schematically illustrates the storage of data characterization units in memory block 304 when performing multiple renderings ₁ A storage location of the information in the database. Each rectangular box indicates that the rendered data characterizes an output block. The value in each rectangular box represents a rendering number. In this example, rendering n is performed first, then rendering n+1, n+2, and so on. In this example, the data characterization unit 310 stores the currently rendered information in the memory 304 each time ₁ Is the same position in the middle. That is, when rendering n+1 is being performed, information of rendering n+1 is stored in position 2; when rendering n+2 is being performed, information of rendering n+2 is also stored in position 2; etc. Thus, to avoid that the currently rendered information overrides the previously rendered information (which would prevent the information from being able to be compared), the information stored at location 2 is copied to location 1 at the end of each rendering. For example, after rendering n+1 ends, the rendered information is moved to position 1. The next currently rendered (n+2) information may then be written to the second location such that both rendered n+1 and n+2 information are stored in memory 304 ₁ Such that the information can be compared.

An alternative method of storing information is shown in fig. 13B. In this method, the data characterization unit 310 is arranged to store the currently rendered data characterization output block in the memory 304 ₁ Depending on the storage location of the previously rendered data characterizing the output block. For example, when the current rendering is rendering n+1, the information of the previous rendering n is stored in the location 1, and thus the data characterization unit 310 causes the information to be stored in another storage location (location 2). When the current rendering is rendering n+2, the previously rendered information is stored in location 2, and thus the data characterization unit causes the information to be stored in another storage location (location 1).

In order for the data characterization unit 310 to know where each rendering writes information, the data characterization unit may store an indication of the current rendered storage location relative to the previously rendered storage location. For example, when the current rendering is rendering n+1, the data characterization unit stores 304 ₁ An indication of the storage location of the information for rendering n+1 relative to the storage location of the information for previously rendering n is stored. When the current rendering is rendering n+2, the data characterization unit uses an indication of the storage location of the information of the previous rendering (now rendering n+1) relative to the storage location of the information of the previous rendering (now rendering n) to determine where to store the information of the current rendering n+2. The advantage of this approach is that it is not necessary to transfer storage information between storage locations at the end of each rendering.

FIG. 14 illustrates a computer system in which the graphics processing system described herein may be implemented. The computer system includes a CPU 1402, a GPU 1404, a memory 1406, a Neural Network Accelerator (NNA) 1411, and other devices 1414, such as a display 1416, speakers 1418, and a camera 1422. The components of the computer system may communicate with each other via a communication bus 1420.

The graphics processing systems of fig. 1-14 are shown as including a plurality of functional blocks. This is merely illustrative and is not intended to limit the strict division between the different logic elements of such entities. Each of the functional blocks may be provided in any suitable manner. It should be appreciated that intermediate values described herein formed by a graphics processing system need not be physically generated by the graphics processing system at any point in time, and may represent only logical values that conveniently describe the processing performed by the graphics processing system between its inputs and outputs.

The graphics processing units described herein may be embodied as hardware on an integrated circuit. The graphics processing system described herein may be configured to perform any of the methods described herein. Generally, any of the functions, methods, techniques, or components described above may be implemented in software, firmware, hardware (e.g., fixed logic circuitry) or any combination thereof. The terms "module," "functionality," "component," "element," "unit," "block," and "logic" may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the module, functionality, component, element, unit, block or logic represents program code that performs specified tasks when executed on a processor. The algorithms and methods described herein may be executed by one or more processors executing code that causes the processors to perform the algorithms/methods. Examples of a computer-readable storage medium include Random Access Memory (RAM), read-only memory (ROM), optical disks, flash memory, hard disk memory, and other memory devices that can store instructions or other data using magnetic, optical, and other techniques and that can be accessed by a machine.

The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for a processor, including code expressed in a machine language, an interpreted language, or a scripting language. Executable code includes binary code, machine code, byte code, code defining an integrated circuit (such as a hardware description language or netlist), and code expressed in programming language code such as C, java or OpenCL. The executable code may be, for example, any kind of software, firmware, script, module, or library that, when properly executed, handled, interpreted, compiled, run in a virtual machine or other software environment, causes the processor of the computer system supporting the executable code to perform the tasks specified by the code.

The processor, computer, or computer system may be any kind of device, machine, or special purpose circuit, or a collection or portion thereof, that has processing capabilities such that instructions can be executed. The processor may be or include any kind of general purpose or special purpose processor, such as CPU, GPU, NNA, a system on a chip, a state machine, a media processor, an Application Specific Integrated Circuit (ASIC), a programmable logic array, a Field Programmable Gate Array (FPGA), or the like. The computer or computer system may include one or more processors.

The invention is also intended to cover software defining the configuration of hardware as described herein, such as HDL (hardware description language) software, as used for designing integrated circuits, or for configuring programmable chips to perform desired functions. That is, a computer readable storage medium having encoded thereon computer readable program code in the form of an integrated circuit definition data set may be provided, which when processed (i.e., run) in an integrated circuit manufacturing system, configures the system to manufacture a graphics processing system configured to perform any of the methods described herein, or to manufacture a graphics processing system comprising any of the devices described herein. The integrated circuit definition data set may be, for example, an integrated circuit description.

Accordingly, a method of manufacturing a graphics processing system as described herein at an integrated circuit manufacturing system may be provided. Furthermore, an integrated circuit definition data set may be provided that, when processed in an integrated circuit manufacturing system, causes a method of manufacturing a graphics processing system to be performed.

The integrated circuit definition data set may be in the form of computer code, for example, as a netlist, code for configuring a programmable chip, as a hardware description language defining a hardware suitable for fabrication at any level in an integrated circuit, including as Register Transfer Level (RTL) code, as a high-level circuit representation (such as Verilog or VHDL), and as a low-level circuit representation (such as OASIS (RTM) and GDSII). A higher-level representation, such as RTL, logically defining hardware suitable for fabrication in an integrated circuit may be processed at a computer system configured to generate a fabrication definition of the integrated circuit in the context of a software environment that includes definitions of circuit elements and rules for combining the elements to generate a fabrication definition of the integrated circuit so defined by the representation. As is typically the case when software is executed at a computer system to define a machine, one or more intermediate user steps (e.g., providing commands, variables, etc.) may be required to configure the computer system to generate a manufacturing definition for an integrated circuit to execute code that defines the integrated circuit to generate the manufacturing definition for the integrated circuit.

An example of processing an integrated circuit definition data set at an integrated circuit manufacturing system to configure the system to manufacture a graphics processing system will now be described with respect to fig. 15.

Fig. 15 illustrates an example of an Integrated Circuit (IC) fabrication system 1502 configured to fabricate a graphics processing system as described in any of the examples herein. In particular, IC fabrication system 1502 includes layout processing system 1504 and integrated circuit generation system 1506.IC fabrication system 1502 is configured to receive an IC definition data set (e.g., defining a graphics processing system as described in any of the examples herein), process the IC definition data set, and generate an IC from the IC definition data set (e.g., embodying a graphics processing system as described in any of the examples herein). Through processing of the IC definition data set, IC fabrication system 1502 is configured to fabricate an integrated circuit embodying a graphics processing system as described in any of the examples herein.

Layout processing system 1504 is configured to receive and process the IC definition data sets to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art and may involve, for example, synthesizing RTL codes to determine a gate level representation of a circuit to be generated, for example in terms of logic components (e.g., NAND, NOR, AND, OR, MUX and FLIP-FLOP components). By determining the location information of the logic components, the circuit layout may be determined from the gate level representation of the circuit. This may be done automatically or with the participation of a user in order to optimize the circuit layout. When the layout processing system 1504 has determined a circuit layout, it may output the circuit layout definition to the IC generation system 1506. The circuit layout definition may be, for example, a circuit layout description.

As is known in the art, the IC generation system 1506 generates ICs from circuit layout definitions. For example, the IC generation system 1506 may implement a semiconductor device fabrication process that generates ICs, which may involve a multi-step sequence of photolithography and chemical processing steps during which electronic circuits are developed on a wafer made of semiconductor material. The circuit layout definition may be in the form of a mask that may be used in a lithographic process to generate an IC from the circuit definition. Alternatively, the circuit layout definitions provided to the IC generation system 1506 may be in the form of computer readable code that the IC generation system 1506 may use to form an appropriate mask for generating the IC.

The different processes performed by IC fabrication system 1502 may all be implemented at one location, e.g., by a party. Alternatively, IC fabrication system 1502 may be a distributed system such that some processes may be performed at different locations and by different parties. For example, some of the following phases may be performed at different locations and/or by different parties: (i) Synthesizing RTL code representing the IC definition dataset to form a gate level representation of the circuit to be generated; (ii) generating a circuit layout based on the gate level representation; (iii) forming a mask according to the circuit layout; and (iv) using the mask to fabricate the integrated circuit.

In other examples, processing of the integrated circuit definition data set in the integrated circuit manufacturing system may configure the system to manufacture the graphics processing system without processing the integrated circuit definition data set to determine the circuit layout. For example, an integrated circuit definition dataset may define a configuration of a reconfigurable processor such as an FPGA, and processing of the dataset may configure the IC manufacturing system to generate (e.g., by loading configuration data into the FPGA) the reconfigurable processor having the defined configuration.

In some embodiments, the integrated circuit manufacturing definition data set, when processed in the integrated circuit manufacturing system, may cause the integrated circuit manufacturing system to generate an apparatus as described herein. For example, an apparatus as described herein may be manufactured by configuring an integrated circuit manufacturing system in the manner described above with respect to fig. 15 through an integrated circuit manufacturing definition dataset.

In some examples, the integrated circuit definition dataset may contain software running on or in combination with hardware defined at the dataset. In the example shown in fig. 15, the IC generation system may be further configured by the integrated circuit definition data set to load firmware onto the integrated circuit in accordance with the program code defined at the integrated circuit definition data set at the time of manufacturing the integrated circuit or to otherwise provide the integrated circuit with the program code for use with the integrated circuit.

Embodiments of the concepts set forth in the present application in apparatuses, devices, modules, and/or systems (and in methods implemented herein) may result in performance improvements over known embodiments. Performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During the manufacture of such devices, apparatuses, modules and systems (e.g., in integrated circuits), a tradeoff may be made between performance improvement and physical implementation, thereby improving the manufacturing method. For example, a tradeoff may be made between performance improvement and layout area, matching the performance of known embodiments, but using less silicon. This may be accomplished, for example, by reusing the functional blocks in a serial fashion or sharing the functional blocks among elements of a device, apparatus, module, and/or system. In contrast, the concepts described herein that lead to improvements in the physical implementation of devices, apparatus, modules, and systems (e.g., reduced silicon area) may be weighed against performance improvements. This may be accomplished, for example, by fabricating multiple instances of the module within a predefined area budget.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the application.

Claims

1. A method of performing rendering using a graphics processing unit configured to implement a tile-based graphics pipeline in which a rendering space is subdivided into a plurality of tiles, the method comprising:

receiving the rendered geometry data, the geometry data comprising a plurality of primitives, each primitive associated with one or more vertex shader programs;

processing the geometric data using the one or more vertex shader programs to generate one or more processed primitives;

determining which of the processed primitives are located within each tile of the plurality of tiles;

for at least one selected tile of the plurality of tiles, storing i) a representation of per-tile vertex shader data that identifies the one or more vertex shader programs used to generate the processed primitives located in the tile, and ii) a representation of per-tile rendering data that is usable in rendering the processed primitives within the tile in a subsequent stage of the graphics pipeline; and

for the or each selected tile, determining whether a previously rendered output of the tile is capable of being used as the rendered output by comparing the per-tile vertex shader data of the tile with the vertex shader data of the previous rendering before comparing the per-tile rendering data of the tile with the previously rendered per-tile rendering data.

2. The method of claim 1, wherein determining whether the output of the previous rendering of the tile is capable of being used as the output of the rendering comprises:

determining whether the per-tile vertex shader data matches corresponding per-tile vertex shader data previously rendered;

in response to determining that the per-tile vertex shader data matches, determining whether the per-tile rendering data of the tile matches the previously rendered corresponding per-tile rendering data; and

in response to determining that the per-tile rendering data matches, the output of the previous rendering of the tile is used as the output of the rendering.

3. The method of claim 2, wherein determining whether the output of the previous rendering of the tile is capable of being used as the output of the rendering further comprises: the graphics pipeline is caused to render the tile in response to determining that the per-tile vertex shader data does not match.

4. The method of claim 2 or 3, wherein determining whether the output of the previous rendering of the tile is capable of being used as the output of the rendering further comprises: in response to determining that the per-tile rendering data does not match, the graphics pipeline is caused to render the tile.

5. The method of any preceding claim, further comprising storing rendering range data indicative of one or more characteristics of the rendering, and using the rendering range data to check whether to skip the per-tile vertex shader data and per-tile rendering data comparison, and cause the graphics pipeline to render the tile, before determining whether the output of a previous rendering of the tile can be used as the output of the rendering.

6. The method of claim 5, wherein the rendering range data comprises a transparent color, and using the rendering range data to check whether to skip the per-tile vertex shader data and per-tile rendering data comparison comprises determining whether the transparent color matches the previously rendered color.

7. The method of claim 5 or 6, wherein the rendering range data includes a valid flag, and using the rendering range data to check whether to skip the per-tile vertex shader data and per-tile rendering data comparison includes determining whether the valid flag has a predetermined value.

8. The method of claim 7, further comprising setting the valid flag to the predetermined value based on at least one of: data indicating that the rendering is part of a scene using a plurality of rendering targets; and the rendering includes more draw calls than a threshold number.

9. The method of any preceding claim, further comprising storing, for the at least one selected tile of the plurality of tiles, iii) per-tile validity data indicating whether to skip the per-tile vertex shader data and per-tile rendering data comparison, wherein the per-tile validity data is set based on a number of processed primitives located within the tile.

10. A method according to any preceding claim, wherein the per-tile rendering data comprises vertex coordinates and vertex state data for each of the processed primitives located within the tile.

11. A method according to any preceding claim, wherein storing the representation of the per-tile rendering data comprises generating a hash of the vertex coordinates and the vertex state data for each of the processed primitives located within the tile, and storing a hash value.

12. The method of any preceding claim, wherein the vertex state data comprises data associated with each vertex used to render the processed primitives in the tile, including one or more of: pixel shader identifiers, variations, color data, surface normal data, and texture data.

13. A graphics processing system configured to implement a tile-based graphics pipeline in which a rendering space is subdivided into a plurality of tiles, the graphics processing system comprising:

geometry processing logic configured to: receiving rendered geometry data, the geometry data comprising a plurality of primitives, each primitive associated with one or more vertex shader programs, and processing the geometry data using the one or more vertex shader programs to generate one or more processed primitives;

a tiling unit configured to determine which of the processed primitives are located within each tile;

a data characterization unit configured to store, in memory, for at least one selected tile of the plurality of tiles, i) a representation of per-tile vertex shader data that identifies the one or more vertex shader programs used to generate the processed primitives located in the tile, and ii) a representation of per-tile rendering data that is usable to render the processed primitives within the tile in a subsequent stage of the graphics pipeline; and

A test unit configured to determine, for a selected tile or each selected tile, whether a previously rendered output of the tile can be used as the rendered output by comparing the per-tile vertex shader data of the tile with the previously rendered vertex shader data of the tile before comparing the per-tile rendering data of the tile with previously rendered per-tile rendering data.

14. The graphics processing system of claim 13, wherein to determine whether the output of the previous rendering of the tile is capable of being used as the output of the rendering, the test unit is further configured to:

in response to determining that the per-tile vertex shader data matches, determining whether the per-tile rendering data of the tile matches the previously rendered corresponding per-tile rendering data; and is also provided with

15. The graphics processing system of claim 14, wherein to determine whether the output of the previous rendering of the tile is capable of being used as the output of the rendering, the test unit is further configured to: the graphics pipeline is caused to render the tile in response to determining that the per-tile vertex shader data does not match.

16. The graphics processing system of claim 14 or 15, wherein to determine whether the output of the previous rendering of the tile is capable of being used as the output of the rendering, the test unit is further configured to: in response to determining that the per-tile rendering data does not match, the graphics pipeline is caused to render the tile.

17. The graphics processing system of any of claims 13 to 16, wherein the data characterization unit is further configured to store rendering range data indicative of one or more characteristics of the rendering, and to use the rendering range data to check whether to skip the per-tile vertex shader data and per-tile rendering data comparison, and to cause the graphics pipeline to render the tile, prior to determining whether the output of a previous rendering of the tile can be used as the output of the rendering.

18. The graphics processing system of any of claims 13 to 17, wherein the data characterization unit is further configured to store iii) per-tile availability data indicating whether to skip the per-tile vertex shader data and per-tile rendering data comparison and render the primitives located within the tile to render the tile for the at least one selected tile of the plurality of tiles, wherein the per-tile availability data is set based on a number of processed primitives located within the tile.

19. The graphics processing system of any one of claims 13 to 18, wherein the graphics processing system is embodied in hardware on an integrated circuit.

20. A method of manufacturing a graphics processing system as claimed in any one of claims 13 to 19 using an integrated circuit manufacturing system.

21. A computer readable storage medium having computer readable code encoded thereon, the computer readable code configured to cause the method of any of claims 1 to 12 to be performed when the code is run.

22. A computer readable storage medium having stored thereon an integrated circuit definition data set that, when processed in an integrated circuit manufacturing system, configures the integrated circuit manufacturing system to manufacture a graphics processing system as claimed in any one of claims 13 to 19.