CN111062858A

CN111062858A - Efficient rendering-ahead method, device and computer storage medium

Info

Publication number: CN111062858A
Application number: CN201911380883.2A
Authority: CN
Inventors: 张竞丹; 李洋; 樊良辉; 陈成
Original assignee: Xi'an Xintong Semiconductor Technology Co Ltd
Current assignee: Xi'an Xintong Semiconductor Technology Co ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2020-04-24
Anticipated expiration: 2039-12-27
Also published as: CN111062858B

Abstract

The embodiment of the invention discloses an efficient rendering-in-advance method, an efficient rendering-in-advance device and a computer storage medium; the method can comprise the following steps: after each primitive is divided according to the size of tile by the cutting and dividing module, immediately transmitting the divided primitives into the rasterization module to perform rasterization operation; for the graphic elements which are subjected to the rasterization operation, scheduling a general rendering core for rendering the graphic elements subjected to the rasterization operation from a rendering core array through a scheduler according to the working state of the general rendering core in the rendering core array and tiles covered by the graphic elements subjected to the rasterization operation; and performing fragment shading processing on the primitive which is subjected to the rasterization operation aiming at the tile covered by the primitive which is subjected to the rasterization operation based on the scheduling of a scheduler by a fragment shader module which is configured to realize by the general rendering core.

Description

Efficient rendering-ahead method, device and computer storage medium

Technical Field

The embodiment of the invention relates to the technical field of Graphic Processing Units (GPUs), in particular to an efficient rendering-in-advance method and device and a computer storage medium.

Background

Generally, a GPU is a dedicated graphics rendering device for processing and displaying computerized graphics. GPUs are constructed in a highly parallel structure that provides more efficient processing than a typical general purpose Central Processing Unit (CPU) for a range of complex algorithms. For example, the complex algorithm may correspond to a representation of two-dimensional (2D) or three-dimensional (3D) computerized graphics.

However, in the process of Rendering graphics by the GPU, two Rendering schemes, namely, an Immediate Mode Rendering (IMR) scheme and a TBR scheme, are generally adopted. For the IMR scheme, when the GPU generates a command for drawing a primitive during rendering a frame of a picture, the GPU immediately performs a series of graphics rendering pipeline operations on the primitive (for example, vertex shading, primitive assembling, clipping, rasterizing, fragment shading, depth testing, blending, and the like may be included in sequence), and directly writes the rendering result back to the frame buffer, and then processes the next primitive. To reduce access to the frame buffer, on-chip buffers are added inside the GPU and made to have high memory bandwidth. However, the size of the on-chip cache is usually limited due to the constraint of the physical area of the GPU, so when the on-chip cache has a limited capacity and cannot have a capacity enough to accommodate the entire picture, the picture is usually split into image blocks (tiles, which may also be referred to as blocks or tiles) so that each tile can adapt to the on-chip cache. For example, if an on-chip cache is capable of storing 512kB of data, the picture may be divided into tiles such that the pixel data contained in each tile is less than or equal to 512 kB. In this way, a scene may be rendered by dividing a picture into tiles renderable into an on-chip cache and individually rendering each tile of the scene into the on-chip cache, storing the rendered tiles from the on-chip cache to a frame buffer, and repeating the rendering and storing for each tile of the picture. Accordingly, a picture may be rendered tile by tile to render each tile of the scene. This technique is known as the TBR scheme. It is understood that the TBR scheme belongs to a mode of delaying reproduction of a graphic, and is widely applied to mobile devices where power and system bandwidth are precious due to its low power consumption characteristic.

In the TBR scheme, the rasterization module and the general rendering core need to wait for each tile to construct a primitive list (constructing the primitive list means that each tile records which primitives cover its own pixel region in a frame of picture), and the rasterization module starts to process the primitives, and the general rendering core starts to perform a process of rendering the primitives for each tile after the rasterization operation is completed. That is, the rasterization module and the unified rendering core are inactive during the process of building the primitive list; once the tile primitive list is constructed, the current primitive or tile needs to wait for the rasterization module or the unified rendering core to execute the last primitive or tile before continuing. Therefore, the current TBR scheme has a phenomenon of load imbalance.

Disclosure of Invention

In view of the above, embodiments of the present invention are directed to providing an efficient rendering-ahead method, apparatus, and computer storage medium; the load condition in the GPU can be balanced, and the rendering efficiency is improved.

The technical scheme of the embodiment of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides an efficient render-ahead method, where the method includes:

after each primitive is divided according to the size of tile by the cutting and dividing module, immediately transmitting the divided primitives into the rasterization module to perform rasterization operation;

for the graphic elements which are subjected to the rasterization operation, scheduling a general rendering core for rendering the graphic elements subjected to the rasterization operation from a rendering core array through a scheduler according to the working state of the general rendering core in the rendering core array and tiles covered by the graphic elements subjected to the rasterization operation;

and performing fragment shading processing on the primitive which is subjected to the rasterization operation aiming at the tile covered by the primitive which is subjected to the rasterization operation based on the scheduling of a scheduler by a fragment shader module which is configured to realize by the general rendering core.

In a second aspect, an embodiment of the present invention provides a GPU, including: the system comprises a cutting and dividing module, a rasterization module, a scheduler and a general rendering core; wherein the content of the first and second substances,

the cutting and dividing module is configured to divide each primitive according to the size of tile, and then immediately transmit the divided primitives into the rasterization module;

the rasterization module is configured to perform rasterization operation on the transmitted primitive and inform the general rendering core to perform fragment coloring processing on the primitive which is subjected to rasterization operation after the rasterization operation is completed;

the scheduler is configured to schedule the general rendering core for rendering the primitive which completes the rasterization operation from the rendering core array according to the working state of the general rendering core in the rendering core array and the tile covered by the primitive which completes the rasterization operation;

the fragment shader module is configured to perform fragment shading processing on the primitive which is subjected to the rasterization operation aiming at tiles covered by the primitive which is subjected to the rasterization operation based on the scheduling of the scheduler.

In a third aspect, an embodiment of the present invention provides a computer storage medium storing an efficient render-ahead program, which when executed by at least one processor implements the steps of the efficient render-ahead method according to the first aspect.

The embodiment of the invention provides an efficient rendering-in-advance method, an efficient rendering-in-advance device and a computer storage medium; when the cutting and dividing module divides each primitive, the divided primitives are immediately transmitted into the rasterization module to be subjected to rasterization operation, and subsequent fragment shading operation is performed after the rasterization operation is finished, so that when the primitives are divided, the rasterization module is called to perform rasterization operation and the fragment shading module is called to perform fragment shading operation subsequently without waiting for all the primitives to be divided, the utilization rate of each rendering core in the rendering core array of the graphics rendering pipeline in the GPU is improved, and the load of the rendering core array is balanced.

Drawings

FIG. 1 is a block diagram of a computing device that may implement one or more aspects of an embodiment of the invention;

FIG. 2 is a block diagram of a GPU that may implement one or more aspects of an embodiment of the present disclosure;

FIG. 3 is a block diagram of a graphics processing pipeline formed by the GPU architecture of FIG. 2;

fig. 4 is a schematic diagram of a graph to be rendered according to an embodiment of the present invention;

fig. 5 is a schematic flowchart of an efficient rendering-ahead method according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Referring to FIG. 1, there is shown a computing device 100 configured to implement one or more aspects of embodiments of the invention, the computing device 100 may include, but is not limited to, the following: wireless devices, mobile or cellular telephones (including so-called smart phones), Personal Digital Assistants (PDAs), video game consoles (including video displays, mobile video game devices, mobile video conferencing units), laptop computers, desktop computers, television set-top boxes, tablet computing devices, electronic book readers, fixed or mobile media players, and the like. In the example of fig. 1, computing device 2 may include a Central Processing Unit (CPU)102 and a system memory 104 that communicate via an interconnection path that may include a memory bridge 105. The memory bridge 105, which may be, for example, a north bridge chip, is connected to an I/O (input/output) bridge 107 via a bus or other communication path 106, such as a HyperTransport (HyperTransport) link. I/O bridge 107, which may be, for example, a south bridge chip, receives user input from one or more user input devices 108 (e.g., a keyboard, mouse, trackball, touch screen that can be incorporated as part of display device 110, or other type of input device) and forwards the input to CPU 102 via path 106 and memory bridge 105. Graphics processor 112 is coupled to memory bridge 105 via a bus or other communication path 113 (e.g., PCI Express, accelerated graphics port, or hypertransport link); in one embodiment, GPU112 may be a graphics subsystem that delivers pixels to display device 110 (e.g., a conventional CRT or LCD based monitor). System disk 114 is also connected to I/O bridge 107. Switch 116 provides a connection between I/O bridge 107 and other components, such as network adapter 118 and various add-in

cards

120 and 121. Other components (not explicitly shown), including USB or other port connections, CD drives, DVD drives, film recording devices, and the like, may also be connected to I/O bridge 107. Communication paths interconnecting the various components in fig. 1 may be implemented using any suitable protocols, such as PCI (peripheral component interconnect), PCI-Express, AGP (accelerated graphics port), hypertransport, or any other bus or point-to-point communication protocol, and connections between different devices may use different protocols as is known in the art.

In one embodiment, GPU112 includes circuitry optimized for graphics and video processing, including, for example, video output circuitry. In another embodiment, GPU112 includes circuitry optimized for general purpose processing while preserving the underlying (underlying) computing architecture. In yet another embodiment, GPU112 may be integrated with one or more other system elements, such as memory bridge 105, CPU 102, and I/O bridge 107, to form a system on a chip (SoC).

It will be appreciated that the system shown herein is exemplary and that variations and modifications are possible. The connection topology, including the number and arrangement of bridges, the number of CPUs 102, and the number of GPUs 112, may be modified as desired. For example, in some embodiments, system memory 104 is directly connected to CPU 102 rather than through a bridge, and other devices communicate with system memory 104 via memory bridge 105 and CPU 102. In other alternative topologies, GPU112 is connected to I/O bridge 107 or directly to CPU 102, rather than to memory bridge 105. While in other embodiments, I/O bridge 107 and memory bridge 105 may be integrated onto a single chip. Numerous embodiments may include two or more CPUs 102 and two or more GPUs 112. The particular components shown herein are optional; for example, any number of add-in cards or peripherals may be supported. In some embodiments, switch 116 is eliminated and network adapter 118 and add-in

cards

120, 121 are directly connected to I/O bridge 107.

Based on the computing device 100 shown in FIG. 1, FIG. 2 is a schematic block diagram of a GPU112 that may implement one or more aspects of embodiments of the present invention in which a graphics memory 204 may be part of the GPU 112. Thus, GPU112 may read data from graphics memory 204 and write data to graphics memory 204 without using a bus. In other words, GPU112 may process data locally using local storage instead of off-chip memory. Such graphics memory 204 may be referred to as on-chip memory. This allows GPU112 to operate in a more efficient manner by eliminating the need for GPU112 to read and write data via a bus, which may experience heavy bus traffic. In some cases, however, GPU112 may not include a separate memory, but rather utilize system memory 10 via a bus. Graphics memory 204 may include one or more volatile or non-volatile memories or storage devices, such as Random Access Memory (RAM), static RAM (sram), dynamic RAM (dram), erasable programmable rom (eprom), electrically erasable programmable rom (eeprom), flash memory, magnetic data media, or optical storage media.

Based on this, GPU112 may be configured to perform various operations related to: generate pixel data from graphics data provided by CPU 102 and/or system memory 104 via memory bridge 105 and bus 113, interact with local graphics memory 204 (e.g., a general frame buffer) to store and update pixel data, transfer pixel data to display device 110, and so on.

In operation, CPU 102 is the main processor of computing device 100, controlling and coordinating the operation of other system components. Specifically, CPU 102 issues commands that control the operation of GPU 112. In some embodiments, CPU 102 writes command streams for GPU112 into data structures (not explicitly shown in fig. 1 or 2) that may be located in system memory 104, graphics memory 204, or other storage locations accessible to both CPU 102 and GPU 112. A pointer to each data structure is written to a pushbuffer to initiate processing of the command stream in the data structure. GPU112 reads the command stream from one or more pushbuffers and then executes the commands asynchronously with respect to the operation of CPU 102. Execution priority may be specified for each pushbuffer to control scheduling of different pushbuffers.

As particularly depicted in FIG. 2, the GPU112 includes an I/O (input/output) unit 205 that communicates with the rest of the computing device 100 via a communication path 113 that is connected to the memory bridge 105 (or, in an alternative embodiment, directly to the CPU 102). The connection of the GPU112 to the rest of the computing device 100 may also vary. In some embodiments, GPU112 may be implemented as an add-in card that may be inserted into an expansion slot of computer system 100. In other embodiments, GPU112 may be integrated on a single chip with a bus bridge, such as memory bridge 105 or I/O bridge 107. While in other embodiments some or all of the elements of GPU112 may be integrated with CPU 102 on a single chip.

In one embodiment, communication path 113 can be a PCI-EXPRESS link in which a dedicated channel is allocated to GPU112 as is known in the art. The I/O unit 205 generates data packets (or other signals) for transmission over the communication path 113 and also receives all incoming data packets (or other signals) from the communication path 113, directing the incoming data packets to the appropriate components of the GPU 112. For example, commands related to processing tasks may be directed to scheduler 207, while commands related to memory operations (e.g., reads or writes to graphics memory 204) may be directed to graphics memory 204.

In GPU112, an array 230 of rendering cores may be included, where array 230 may include C general purpose rendering cores 208, where C > 1; and D fixed-function rendering cores 209. Based on the generic rendering cores 208 in the array 230, the GPU112 is able to concurrently perform a large number of program tasks or computational tasks. For example, each rendering core may be programmed to be able to perform processing tasks related to a wide variety of programs, including, but not limited to, linear and non-linear data transformations, video and/or audio data filtering, modeling operations (e.g., applying laws of physics to determine the position, velocity, and other attributes of objects), graphics rendering operations (e.g., tessellation shader, vertex shader, geometry shader, and/or fragment shader programs), and so forth.

In some examples, fixed function rendering core 209 may include, for example, a processing unit that performs primitive assembly, a processing unit that performs clipping and partitioning operations including culling the assembled primitives by size of tile, a processing unit that performs rasterization operations including transforming primitives and outputting fragments to a tile shader, and a processing unit that performs fragment operations including, for example, depth testing, scissor testing, α blending, etc. pixel data output by the above operations may be output as graphics data by a display synthesis display pipeline 110, a graphics rendering core 208 that performs a rendering of a graphics array, and a graphics rendering core logic 230 that performs rendering of a graphics rendering model.

In addition, rendering core array 230 may receive processing tasks to be performed from scheduler 207. Scheduler 207 may independently schedule the tasks for execution by resources of GPU112, such as one or

more rendering cores

208, 209 in rendering core array 230. In one example, scheduler 207 may be a hardware processor. In the example shown in fig. 2, scheduler 207 may be included in GPU 112. In other examples, scheduler 207 may also be a separate unit from CPU 102 and GPU 112. Scheduler 207 may also be configured as any processor that receives a stream of commands and/or operations.

Scheduler 207 may process one or more command streams that include scheduling operations included in one or more command streams executed by GPU 112. Specifically, scheduler 207 may process one or more command streams and schedule operations in the one or more command streams for execution by rendering core array 230. In operation, CPU 102, through GPU driver 103 included with system memory 104 in fig. 1, may send a command stream to scheduler 207 that includes a series of operations to be performed by GPU 112. Scheduler 207 may receive a stream of operations including a command stream through I/O unit 205 and may process the operations of the command stream sequentially based on an order of the operations in the command stream, and the operations in the schedulable command stream may be executed by one or more rendering cores in rendering core array 230.

Also, tile cache 232 is a small amount of very high bandwidth memory located on-chip with GPU 112. However, the size of tile cache 232 is too small to hold the entire graphics data, so rendering core array 230 must perform multiple rendering passes to render the entire graphics data. For example, the rendering core array 230 may perform one rendering pass for each tile of a frame of image. Specifically, tile cache 232 may include one or more volatile or non-volatile memories or storage devices, such as Random Access Memory (RAM), Static RAM (SRAM), Dynamic RAM (DRAM), or the like. In some examples, tile cache 232 may be an on-chip buffer. An on-chip buffer may refer to a buffer formed on, positioned on, and/or disposed on the same microchip, integrated circuit, and/or die on which GPU112 is formed, positioned, and/or disposed. Furthermore, when tile cache 232 is implemented on the same chip as GPU112, GPU112 does not necessarily need to access tile cache 232 via communication path 113, but rather can access tile cache 232 via an internal communication interface (e.g., a bus) implemented on the same chip as GPU 112. Because this interface is on-chip, it may be able to operate at a higher bandwidth than communication path 113. Therefore, although the tile cache 232 has a limited storage capacity and increases the overhead on hardware, and can only be used for caching data of one or a plurality of small rectangles, the overhead of repeatedly accessing the video memory is avoided, the bandwidth is reduced, and the power consumption is saved.

Based on the above description of fig. 1 and fig. 2, fig. 3 shows an example of the graphics rendering pipeline 80 formed by the structure of the GPU112 shown in fig. 2, it should be noted that the core portion of the graphics rendering pipeline 80 is a logic structure formed by cascading a general-purpose rendering core 208 and a fixed-function rendering core 209 included in a rendering core array 230, and further, for the scheduler 207, the graphics memory 204, the tile cache 232, and the I/O unit 205 included in the GPU112, all are peripheral circuits or devices that implement the logic structure function of the graphics rendering pipeline 80, accordingly, the graphics rendering pipeline 80 generally includes programmable-level modules (such as the circular corner blocks in fig. 3) and fixed-function-level modules (such as the square blocks in fig. 3), for example, the functions of the programmable-level modules can be performed by the general-purpose rendering core 208 included in the rendering core array 230, the functions of the fixed-function level modules may be implemented by fixed-function rendering cores 209 included in the rendering core array 230. As shown in FIG. 3, graphics rendering pipeline 80 includes the following stages in order:

vertex fetch module 82, shown in the example of FIG. 3 as a fixed function stage, is generally responsible for supplying graphics data (triangles, lines, and dots) to graphics rendering pipeline 80. For example, vertex crawling module 82 may collect vertex data for high-order surfaces, primitives, and the like, and output vertex data and attributes to vertex shader module 84.

Vertex shader module 84, shown as a programmable stage in FIG. 3, is responsible for processing the received vertex data and attributes, and processing the vertex data by performing a set of operations for each vertex at a time.

Primitive assembly module 86, shown in FIG. 3 as a fixed function stage, is responsible for collecting the vertices output by vertex shader module 84 and assembling the vertices into geometric primitives. For example, primitive assembly module 86 may be configured to group every three consecutive vertices into a geometric primitive (i.e., a triangle). In some embodiments, a particular vertex may be repeated for consecutive geometric primitives (e.g., two consecutive triangles in a triangle strip may share two vertices).

A clipping and dividing module 88, shown as a fixed function level in fig. 3, for clipping and removing the assembled primitives, and then dividing the primitives according to the size of tiles;

rasterization module 90 is typically a fixed function stage responsible for preparing the primitives for fragment shader module 92. For example, rasterization module 90 may generate fragments for shading by fragment shader module 92.

A fragment shader module 92, shown in FIG. 3 as a programmable stage, receives fragments from rasterization module 90 and generates per-pixel data such as color. Fragment shader module 92 may also perform per-pixel processing such as texture blending and lighting model calculations.

The output merger module 94, shown in FIG. 3 as a fixed functional stage, is generally responsible for performing various operations on the pixel data, such as performing transparency tests (alpha test), stencil tests (stencil test), and blending the pixel data with other pixel data corresponding to other segments associated with the pixel. When the output merger module 94 has finished processing the pixel data (i.e., the output data), the processed pixel data may be written to a render target to produce a final result.

For a conventional TBR scheme, a screen area is usually divided into tiles of equal size, for a frame of image, after a primitive assembling stage is finished, the GPU112 calculates which tiles in the screen are covered by the primitives according to the sizes of the primitives, and establishes a primitive list for each tile, and once the tile is covered by the primitive, the primitive list of the tile updates corresponding primitive information until all the primitives are collected. After the collection of subsequent rasterization stages, the GPU112 may traverse the primitive list of each tile (for example, a tile may be covered by multiple primitives), and when a primitive in the primitive list is rendered, data of the tile is written into the on-chip cache. And writing the final data of the tile into the video memory until all the graphic elements in the list are processed. Specifically, in conjunction with the graphics rendering pipeline 80 shown in FIG. 3, the conventional TBR scheme described above would include the following steps: 1. vertex shader module 84 executes a vertex shading program on the vertices; 2. the primitive assembly module 86 performs primitive assembly, and the clipping and partitioning module 88 performs clipping and partition operations; 3. repeating the second step until all the primitives are divided into tiles; 4. traversing each tile, and executing rasterization raster operation on each primitive in each tile primitive list through a rasterization module 90; 5. the fragment shader module 92 executes fragment shading programs on the pixels of each primitive; 6. the output merger module 94 performs depth testing, blending, etc. on the pixels of each primitive; 7. after the execution of each primitive is finished, the primitive is written back to the on-chip for storage, and when all the primitives in a tile are processed, the primitive is written back to the system memory 104. From the above process, it can be found that, in the process of performing

steps

1, 2 and 3, the fixed-function rendering core 209 for performing the rasterizer operation does not perform any operation; and the generic rendering core 208 does not perform any operations in the course of performing steps 2, 3, 4.

As can be seen from the above description, for the current common unified rendering architecture, most rendering cores are in an idle state in the process of building the primitive list, so that a phenomenon of load imbalance occurs, and rendering efficiency is not high. Based on this, the embodiments of the present invention are expected to provide a technology for improving rendering efficiency, so that load balancing of rendering cores is achieved and rendering efficiency is improved in a process of performing rendering by using a TBR scheme. For example, after determining the tile covered by each assembled primitive, the subsequent rendering operation, such as rasterization operation, fragment shading operation, test mixing and other operations, is performed on each assembled primitive without waiting for the assembly and division of all primitives and then performing the subsequent rendering operation, so that the rendering efficiency and the utilization rate of rendering cores are improved, and the load of the rendering cores of the GPU is balanced.

In some examples, clipping and partitioning module 88 is configured to pass the partitioned primitives to rasterization module 90 immediately after completion of partitioning by tile size for each primitive; a rasterization module 90 configured to perform rasterization on the incoming primitive, and notify the fragment shader module 92, which is configured to be implemented by the generic rendering core 208, of performing fragment shading processing on the primitive for which the rasterization operation has been completed after the completion of the rasterization operation; a scheduler 207 configured to schedule the general-purpose rendering core 208 for rendering the primitive of which the rasterization operation is completed from the rendering core array 230 according to the working state of the general-purpose rendering core 208 in the rendering core array 230 and the tile covered by the primitive of which the rasterization operation is completed; and the fragment shader module 92 is configured to perform fragment shading processing on the primitive which is subjected to the rasterization operation aiming at the tile covered by the primitive which is subjected to the rasterization operation based on the scheduling of the scheduler 207.

According to the description of the above example, compared to the conventional TBR scheme, when the clipping and dividing module 88 divides each primitive, the divided primitive is transmitted to the rasterization module 90 to perform the rasterization operation, and the subsequent fragment shading operation is performed after the rasterization operation is completed, when the primitive is divided, it is not necessary to wait for all the primitives to be divided, and then call the rasterization module 90 to perform the rasterization operation, and then call the general rendering core 208 to perform the fragment shading operation, so that the utilization rates of the

rendering cores

208 and 209 in the rendering core array 230 of the graphics rendering pipeline 80 in the GPU112 are improved, and the load of the rendering core array 230 is balanced.

In the above example, preferably, the clipping and dividing module 88 is further configured to create a primitive relationship table according to the primitive subjected to partition after the partition is completed according to the size of tile for each primitive, and store the primitive relationship table into the system memory 104; in a specific implementation process, the primitive relationship table at least includes all state information required for rendering the completely divided primitives, tiles covered by the completely divided primitives, a first flag bit corresponding to each tile for identifying whether the completely divided primitives are processed by the general rendering core 208, and a second flag bit for identifying whether the completely divided primitives are rendered.

In the above example, preferably, the scheduler 207 is configured to read a primitive relationship table corresponding to the primitive that has completed the rasterization operation from the system memory 104, allocate the general rendering core 208 according to the tile covered by the primitive that has completed the rasterization operation and is recorded in the primitive relationship table, and update a scheduling table based on allocation of the general rendering core 208, where the scheduling table is used to characterize an identifier of the general rendering core 208, a working state of the general rendering core 208, a tile identifier covered by the primitive, and a primitive identifier covered by the tile, included in the rendering core array 230, based on a corresponding relationship between the general rendering core 208 and the tile; and scheduling the general rendering core 208 according to the scheduling table and the working state of the current general rendering core 208 to perform fragment shading operation on the primitive which has completed the rasterization operation.

For the above preferred example, specifically, the scheduler 207 is configured to check the working states of all the generic rendering cores 208 in the rendering core array 230 after reading the primitive relation table; corresponding to the idle-state general rendering core 208, scheduling tiles with the same tile flag as the idle-state general rendering core 208 in the tiles covered by the primitive which has completed the rasterization operation to the idle-state general rendering core 208 for fragment shading processing; and after waiting for the busy-state general rendering core 208 to be converted into the idle state, the corresponding busy-state general rendering core 208 corresponding to the busy-green state dispatches the tile with the same tile mark as the tile mark corresponding to the idle-state general rendering core 208 in the tile covered by the primitive which is subjected to the rasterization operation to the idle-state general rendering core 208 for chip coloring processing.

For the above specific description, the scheduler 207 is further configured to determine, according to a set maintenance policy, a removal operation or a maintenance operation in response to a situation that all the general rendering cores 208 are in a busy state after some tiles in the tiles covered by the primitives that have completed the rasterization operation have been scheduled to the corresponding general rendering cores 208; the removing operation comprises removing the processed tile from the universal rendering core 208 in a busy state so that the universal rendering core 208 renders the remaining tiles in the tiles covered by the primitive which has completed the rasterization operation; the maintenance operation includes maintaining the current state unchanged, and scheduling the general rendering core 208 which is in the idle state after the general rendering core 208 which is waiting for the busy state is in the idle state.

For the above example, taking the graph shown in fig. 4 as an example, the size of a single tile is set to be 32 × 32, the rendering core array 230 in the GPU112 includes 32 general-purpose rendering cores 208 to perform vertex shading operations and fragment shading operations, each general-purpose rendering core 208 has at least one on-chip cache of 32 × 32 × 3 byte size, that is, each on-chip cache can store at least one tile color, template, and depth value; the screen size of the display device 110 is 1080 × 1920, and thus, the screen can be divided into 60 × 34 tiles in common. And setting that 3 triangle primitives are currently required to be drawn in a frame of picture, and the triangle primitives respectively correspond to the triangle primitive No. 1 in fig. 4 (for example, the solid line frame transparent triangle in fig. 4), the triangle primitive No. 2 (for example, the solid line frame gray filled triangle in fig. 4) and the triangle primitive No. 3, in the embodiment of the present invention, for two cases to be described later, the triangle primitive No. 3 is exemplified as any one of the triangle primitive No. 3 a (shown as the dashed line frame transparent triangle primitive in fig. 4) and the triangle primitive No. 3B (shown as the dashed line frame gray filled triangle primitive in fig. 4); the tile covered by the above triangle primitives is shown in fig. 4. With reference to fig. 1, fig. 2, and fig. 3, if the graph shown in fig. 4 is rendered by the above-mentioned exemplary technical solution, a specific process may include the following steps:

step 1: after the CPU 102 prepares to complete the vertex data, the GPU driver 203 controls the general rendering core 208 to perform vertex shading operations on the 9 vertex data and write the completed vertex data back to the system memory 104.

Step 2: vertex fetch module 82 included in graphics rendering pipeline 80 retrieves processed vertex data from system memory 104; the primitive assembling module 86 assembles the three vertices into a triangle according to the primitive information;

and step 3: the clipping and dividing module 88 receives the first assembled triangle and then performs the clipping operation thereon;

and 4, step 4: the clipping and dividing module 88 divides the first clipped triangle into tiles; after the division is completed, the triangle is added into the created primitive list, specifically, the primitive list needs to include all state information required for triangle rendering, and two flag bits, which are used to respectively flag whether the current tile is processed by the general rendering core 208, and whether all primitives in the list are processed. Finally, the primitive list is written back to the system memory 104;

and 5: the vertex fetch module 82, primitive assembly module 86, and clipping and partitioning module 88 process the second triangle according to the same flow described in the 4 steps above; at the same time, the rasterization module 90 starts the rasterization operation on the first triangle; after the rasterization operation is finished, the fragment shader module 92 configured by the general rendering core 208 is notified to process tiles covered by the first triangle, and then the second triangle which has performed the above 4 steps is rasterized.

Step 6: after receiving the processing request, the fragment shader module 92 obtains the primitive list constructed in step 4 from the system memory 104, allocates tiles covered by the first triangle, and records the general rendering core ID corresponding to each tile and the triangle ID. Then processes the processing request sent based on the second triangle.

And 7: the fragment shader module 92 starts processing the tile covered by the first triangle, and since each generic rendering core 208 processes one tile, 32 generic rendering cores 208 can process 32 tiles at the same time.

For step 7, the scheduler 207 maintains a scheduling table, an exemplary template of which is shown in table 1, the first column indicates the ID of the general purpose rendering core, and the second column indicates whether the current general purpose rendering core is in a busy state; the third column indicates the ID of the tile covered by the triangle, and the last column indicates to which primitive the tile currently executing by the generic rendering core 208 belongs.

Universal rendering core ID	Generic rendering core state	Tile ID	Triangle ID

TABLE 1

Based on the schedule template shown in Table 1, when the fragment shader module 92 starts processing the first triangle, such as triangle 1 in FIG. 4, the schedule is updated as shown in Table 2:

TABLE 2

After the fragment shader module 92 finishes processing triangle 1, the color value, depth value, and stencil value of the pixel in each tile are written back to the on-chip cache. When a triangle 2 arrives, the scheduler 207 will first check the state of each general rendering core 208, and if there is an idle general rendering core 208, allocate tile covered by the triangle 2 to the idle general rendering core 208, and if tileID is the same, allocate tile to the same general rendering core 208; when the universal rendering core 208 is already occupied by tile and is in an idle state, if the tile ids are different, the universal rendering core 208 cannot be allocated to the idle universal rendering core 208, otherwise, an error occurs; if the generic rendering core 208 is busy at this time, it is necessary to wait for the generic rendering core 208 to change from busy to idle, such as generic rendering core numbers 9, 10, and 11. At this time, the table update is as shown in table 3:

TABLE 3

When triangle 3 comes, e.g., triangle 3A; scheduler 207 may also check for a free general purpose rendering core 208, at which point it can be seen from table 3 that there are only 2 general purpose rendering cores 208 that are free and not occupied by any tile. As can be seen from fig. 4, the number of tiles covered by triangle 3 a is greater than 2, at this time, two tiles of triangle 3 a are first allocated to universal rendering cores 30 and 31, and as can be seen from the foregoing description, all the universal rendering cores 208 are occupied at this time, the scheduler 207 analyzes the current situation, and determines whether to eliminate the tile in the table and make way of the universal rendering core 208 to process the remaining tiles of triangle 3 a, or to maintain the state in the current table. In the embodiment of the present invention, the scheduler 207 may use three principles as the maintenance policy to maintain the scheduling table as follows: 1. preferentially processing tiles with a large number of primitives; 2. preferentially processing the areas with dense triangles; 3. tiles at boundaries are preferentially discarded. Based on the above exemplary maintenance principles, analysis by scheduler 207 reveals that: triangle No. 3 a is further away from the triangles in the current table, so the scheduler 207 determines to maintain the current status of the current table and does not process the remaining tiles of triangle No. 3 a.

However, if the current last triangle is triangle B No. 3, tile nos. (0,0), (0,1), (0,2), (0,3), and (0,4) will be discarded according to the above maintenance principle, and accordingly, the cache on the general rendering core 208 occupied by these tiles will be cleared, and the scheduler 207 will also update the relevant flag bit in the primitive list.

And 8: after all tiles have been processed as per step 7, a depth test or blending operation is performed on each tile by the output merger module 94 of the graphics rendering pipeline 80, and finally written back to system memory 104.

For the foregoing description, it can be understood that, compared with the conventional TBR scheme, by using the scheme for improving rendering efficiency provided by the embodiment of the present invention, after all primitive lists are created, tiles with a larger number of covered primitives are already rendered, so that the rendering efficiency of the TBR is improved, and the efficiency is improved more significantly along with the increase of the number of general rendering cores and the increase of on-chip storage.

Therefore, referring to fig. 5, it illustrates an efficient render-ahead method provided by an embodiment of the present invention, which may be applied to the GPUs shown in fig. 1, fig. 2, and fig. 3, and the method may include:

s501: after each primitive is divided according to the size of tile by the cutting and dividing module 88, the divided primitives are immediately transmitted into the rasterization module 90 to be rasterized;

s502: for the primitive which has completed the rasterization operation, scheduling a general rendering core 208 for rendering the primitive which has completed the rasterization operation from the rendering core array through a scheduler 207 according to the working state of the general rendering core in the rendering core array and the tile covered by the primitive which has completed the rasterization operation;

s503: the rasterized primitive is fragment shader processed for the tile covered by the rasterized primitive based on the scheduler's schedule by a fragment shader module 92 that the generic rendering core is configured to implement.

In some examples, the method further comprises: after the clipping and dividing module 88 finishes dividing each primitive according to the size of tile, a primitive relation table is created according to the divided primitives, and the primitive relation table is stored in the system memory 104; the primitive relationship table at least includes all state information required for rendering the primitive subjected to division, tiles covered by the primitive subjected to division, a first flag bit corresponding to each tile for identifying whether the primitive subjected to division is processed by the general rendering core 208, and a second flag bit for identifying whether the primitive subjected to division is rendered.

In some examples, for the primitive that has completed the rasterization operation, the scheduler 207 schedules, from the rendering core array, the generic rendering core 208 for rendering the primitive that has completed the rasterization operation according to the working state of the generic rendering core in the rendering core array and the tile covered by the primitive that has completed the rasterization operation, including:

reading a primitive relation table corresponding to the primitive which has completed the rasterization operation from the system memory 104 through the scheduler 207;

distributing a general rendering core 208 by the scheduler 207 according to the tile covered by the recorded primitive which has completed the rasterization operation in the primitive relation table, and updating a scheduling table based on the distribution of the general rendering core 208, where the scheduling table is based on the corresponding relation between the general rendering core 208 and the tile and is used to represent the identifier of the general rendering core 208, the working state of the general rendering core 208, the tile identifier covered by the primitive, and the primitive identifier covered by the tile, which are included in the rendering core array 230;

and scheduling the general rendering core 208 by the scheduler 207 according to the scheduling table and the working state of the current general rendering core 208 so as to perform the fragment shading operation on the primitive which has completed the rasterization operation.

In some examples, the scheduling, by the scheduler 207, the generic rendering core 208 to perform the fragment shading operation on the primitive that has completed the rasterization operation according to the scheduling table and the working state of the current generic rendering core 208 includes:

after the scheduler 207 reads the primitive relation table, checking the working states of all the general rendering cores 208 in the rendering core array 230;

corresponding to the idle-state general rendering core 208, the scheduler 207 schedules, of tiles covered by the primitive that has completed the rasterization operation, tiles having the same tile flag as that corresponding to the idle-state general rendering core 208 for performing tile shading processing;

and after waiting for the busy-state general rendering core 208 to be converted into the idle state by the scheduler 207, scheduling the tile with the same tile mark as the tile mark corresponding to the idle-state general rendering core 208 in the tile covered by the primitive which is converted into the idle state to the idle-state general rendering core 208 to perform the fragment coloring processing corresponding to the busy-green-state general rendering core 208.

In some examples, the method further comprises:

determining, by the scheduler 207, a removal operation or a maintenance operation according to a set maintenance policy in response to a situation that all the general rendering cores 208 are in a busy state after a part of tiles covered by the primitives that have completed the rasterization operation have been scheduled to the corresponding general rendering cores 208; the removing operation comprises removing the processed tile from the universal rendering core 208 in a busy state so that the universal rendering core 208 renders the remaining tiles in the tiles covered by the primitive which has completed the rasterization operation; the maintenance operation includes maintaining the current state unchanged, and scheduling the general rendering core 208 which is in the idle state after the general rendering core 208 which is waiting for the busy state is in the idle state.

For the above example, the maintenance policy includes at least one of:

preferentially processing tiles with a large number of primitives;

preferentially processing the areas with dense triangles;

tiles at boundaries are preferentially discarded.

In one or more examples or examples above, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media may include computer data storage media or communication media including any medium that facilitates transfer of a computer program from one place to another. A data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this disclosure. By way of example, and not limitation, such computer-readable media can comprise a USB flash disk, a removable hard disk, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The code may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, Application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. . Accordingly, the terms "processor" and "processing unit" as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. In addition, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques may be fully implemented in one or more circuits or logic elements.

The techniques of embodiments of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an Integrated Circuit (IC), or a set of ICs (i.e., a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Indeed, as described above, the various units may be combined in a codec hardware unit, in conjunction with suitable software and/or firmware, or provided by a collection of interoperative hardware units, including one or more processors as described above.

Various aspects of the present invention have been described. These and other embodiments are within the scope of the following claims. It should be noted that: the technical schemes described in the embodiments of the present invention can be combined arbitrarily without conflict.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. An efficient method of rendering ahead, the method comprising:

2. The method of claim 1, further comprising:

after the cutting and dividing module divides each primitive according to the size of tile, creating a primitive relation table according to the divided primitives, and storing the primitive relation table into a system memory; the primitive relation table at least comprises all state information required for rendering the divided primitives, tiles covered by the divided primitives, a first flag bit corresponding to each tile and used for identifying whether the divided primitives are processed by a universal rendering core and a second flag bit used for identifying whether the divided primitives are rendered completely.

3. The method of claim 2, wherein the scheduling, by the scheduler, the general purpose rendering core for rendering the primitive that has completed the rasterization operation from the rendering core array according to the working state of the general purpose rendering core in the rendering core array and the tile covered by the primitive that has completed the rasterization operation, comprises:

reading a primitive relation table corresponding to the primitive which has completed the rasterization operation from the system memory through the scheduler;

distributing a general rendering core through the scheduler according to the tile covered by the recorded primitive which completes the rasterization operation in the primitive relation table, and updating a scheduling table based on the distribution of the general rendering core, wherein the scheduling table is based on the corresponding relation between the general rendering core and the tile and is used for representing the identifier of the general rendering core, the working state of the general rendering core, the tile identifier covered by the primitive and the primitive identifier covered by the tile in a rendering core array;

and scheduling the general rendering core by the scheduler according to the scheduling table and the working state of the current general rendering core so as to perform fragment coloring operation on the primitive which is subjected to rasterization operation.

4. The method of claim 3, wherein scheduling, by the scheduler, a generic rendering core to perform a fragment shading operation on the primitive that has completed the rasterization operation according to the scheduling table and a working state of a current generic rendering core comprises:

after the graphics primitive relation table is read, the working states of all the general rendering cores in the rendering core array are checked through the scheduler;

corresponding to the universal rendering core in the idle state, scheduling the tile with the same tile mark as the tile mark corresponding to the universal rendering core in the idle state in the tiles covered by the graphics primitives which finish the rasterization operation through the scheduler to perform fragment coloring processing;

and after waiting for the busy-state general rendering core to be converted into an idle state through the scheduler, scheduling tiles with the same tile marks as the tiles corresponding to the idle-state general rendering core in the tiles covered by the graphics primitives which are subjected to rasterization operation to the idle-state general rendering core for chip coloring treatment.

5. The method of claim 4, further comprising:

determining a rejection operation or a maintenance operation according to a set maintenance strategy by the scheduler under the condition that all the universal rendering cores are in a busy state after a part of tiles covered by the graphics primitives which finish the rasterization operation are scheduled to the corresponding universal rendering cores; the elimination operation comprises eliminating the processed tile from the universal rendering core in a busy state so that the universal rendering core can render the rest tiles in the tiles covered by the graphic element which is subjected to the rasterization operation; and the maintenance operation comprises maintaining the current state unchanged, and scheduling the universal rendering core which is transferred into the idle state after waiting for the universal rendering core in the busy state to be transferred into the idle state.

6. The method of claim 5, wherein the maintenance policy includes at least one of:

preferentially processing tiles with a large number of primitives;

preferentially processing the areas with dense triangles;

tiles at boundaries are preferentially discarded.

7. A Graphics Processor (GPU), the GPU comprising: the system comprises a cutting and dividing module, a rasterization module, a scheduler and a general rendering core; wherein the content of the first and second substances,

the general-purpose rendering core is configured to perform fragment shading processing on the primitive which is subjected to the rasterization operation aiming at the tile covered by the primitive which is subjected to the rasterization operation based on the scheduling of the scheduler.

8. The GPU of claim 7, wherein the clipping and partitioning module is further configured to create a primitive relationship table from each primitive that is partitioned according to tile size, and store the primitive relationship table in a system memory; the primitive relation table at least comprises all state information required for rendering the divided primitives, tiles covered by the divided primitives, a first flag bit corresponding to each tile and used for identifying whether the divided primitives are processed by a universal rendering core and a second flag bit used for identifying whether the divided primitives are rendered completely.

9. A GPU as claimed in claim 8, wherein the scheduler is configured to:

reading a primitive relation table corresponding to the primitive which completes the rasterization operation from the system memory; and the number of the first and second groups,

distributing the general rendering kernel 208 according to the tile covered by the recorded primitive which has completed the rasterization operation in the primitive relation table, and updating a scheduling table based on the distribution of the general rendering kernel 208, wherein the scheduling table is based on the corresponding relation between the general rendering kernel 208 and the tile and is used for representing the identifier of the general rendering kernel 208, the working state of the general rendering kernel 208, the tile identifier covered by the primitive and the primitive identifier covered by the tile which are included in the rendering kernel array 230; and the number of the first and second groups,

and scheduling the general rendering core 208 according to the scheduling table and the working state of the current general rendering core 208 to perform fragment shading operation on the primitive which has completed the rasterization operation.

10. A GPU as claimed in claim 9, wherein the scheduler is configured to:

after reading the primitive relation table, checking the working states of all the general rendering cores in the rendering core array; and the number of the first and second groups,

corresponding to the universal rendering core in the idle state, dispatching the tile with the same tile mark as the tile mark corresponding to the universal rendering core in the idle state in the tiles covered by the graphics primitives which finish the rasterization operation to the universal rendering core in the idle state for carrying out fragment coloring processing; and the number of the first and second groups,

and corresponding to the universal rendering core in the busy-green state, after waiting for the universal rendering core in the busy state to be converted into the idle state, scheduling the tile with the same tile mark as the tile mark corresponding to the universal rendering core converted into the idle state in the tile covered by the graphics primitive which is subjected to rasterization operation to the universal rendering core 208 converted into the idle state for fragment coloring processing.

11. The GPU of claim 10, wherein the scheduler is further configured to:

determining a removal operation or a maintenance operation according to a set maintenance strategy under the condition that all the general rendering cores are in a busy state after a part of tiles covered by the graphics primitives which have finished the rasterization operation are dispatched to the corresponding general rendering cores; the elimination operation comprises eliminating the processed tile from the universal rendering core in a busy state so that the universal rendering core can render the rest tiles in the tiles covered by the graphic element which is subjected to the rasterization operation; and the maintenance operation comprises maintaining the current state unchanged, and scheduling the universal rendering core which is transferred into the idle state after waiting for the universal rendering core in the busy state to be transferred into the idle state.

12. A computer storage medium storing an efficient prerender program that, when executed by at least one processor, implements the steps of the efficient prerender method of any one of claims 1-7.