CN116188244A

CN116188244A - Method, device, equipment and storage medium for distributing image blocks

Info

Publication number: CN116188244A
Application number: CN202310457192.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Moore Threads Technology Co Ltd
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2023-05-30
Anticipated expiration: 2043-04-25
Also published as: CN116188244B

Abstract

The embodiment of the application discloses a tile distribution method, a device, equipment and a storage medium, wherein the method comprises the following steps: determining a load level corresponding to each of a plurality of image blocks through a front end module of the TBR architecture; the load level is used to characterize the number of primitives present in the tile; transmitting the load level corresponding to each block to a back-end module of the TBR architecture; for each of the tiles, determining, by a back-end module of the TBR architecture, a target processor core corresponding to the tile from the at least two processor cores based on a status indicator corresponding to each of the processor cores in a set of status indicators corresponding to the tile; the arrangement sequence of the state indicators of the state indicator group corresponding to the image block is related to the load level of the image block.

Description

Method, device, equipment and storage medium for distributing image blocks

Technical Field

The present disclosure relates to the field of image processing technologies, but not limited to, and in particular, to a tile distributing method, apparatus, device, and storage medium.

Background

Graphics processors (Graphics Processing Unit, GPUs) are specialized graphics rendering devices for processing and displaying computerized graphics. GPUs are constructed in a highly parallel architecture that provides more efficient processing than a typical general purpose central processing unit (Central Processing Unit, CPU) for a range of complex algorithms. For example, the complex algorithm may correspond to a representation of a two-dimensional or three-dimensional computerized graphic.

However, in GPU rendering graphics, especially under power and system bandwidth constraints, a tile-based rendering (Tile Based Rendering, TBR) scheme is typically employed that splits a picture into tiles (also referred to as tiles, tiles) so that each tile can accommodate on-chip caching. For example, if the on-chip cache is capable of storing 512kB of data, the picture may be divided into tiles such that the pixel data contained in each tile is less than or equal to 512kB. In this way, a scene may be rendered by dividing the picture into tiles that may be rendered into an on-chip cache and rendering each tile of the scene into the on-chip cache individually, storing the rendered tile from the on-chip cache into a frame buffer, and repeating the rendering and storing for each tile of the picture. Thus, a picture may be rendered tile by tile to render each tile of the scene. It can be appreciated that the TBR scheme belongs to a mode of delaying reproduction of graphics, and is widely used in mobile devices due to its low power consumption.

However, in the rendering process of the conventional TBR architecture, there is a situation that the workload of the processor core is not distributed uniformly, which results in lower overall rendering performance.

Disclosure of Invention

In view of this, embodiments of the present application at least provide a tile distribution method, apparatus, device, and storage medium.

The technical scheme of the embodiment of the application is realized as follows:

in one aspect, an embodiment of the present application provides a tile distribution method applied to a graphics processor including at least two processor cores, the graphics processor performing a tile distribution process based on a tile rendering TBR architecture, the method comprising: determining a load level corresponding to each of a plurality of image blocks through a front end module of the TBR architecture; the load level is used to characterize the number of primitives present in the tile; transmitting the load level corresponding to each block to a back-end module of the TBR architecture; for each of the tiles, determining, by a back-end module of the TBR architecture, a target processor Core corresponding to the tile from the at least two processor cores based on a status indicator corresponding to each of the processor cores (cores) in a set of status indicators corresponding to the tile; the arrangement sequence of the state indicators of the state indicator group corresponding to the image block is related to the load level of the image block.

In some embodiments, the order of the status indicators includes a bit order of a corresponding status indicator for each of the processor cores; for each bit order, the number of the processor cores in the processor core set corresponding to the bit order is the same, and the processor core set corresponding to the bit order includes the processor cores corresponding to the bit order in the state indicator set corresponding to each load level.

In some embodiments, the determining, by the front-end module of the TBR architecture, a load level corresponding to each of a plurality of tiles includes: for each block, determining, by a front-end module of the TBR architecture, the number of primitives that fall within a block range of the block based on the locations of the primitives and the block range of the block; and determining the load level corresponding to each block based on the number of the primitives corresponding to each block.

In some embodiments, the determining the load level corresponding to each tile based on the number of primitives corresponding to each tile includes: acquiring a plurality of preset grades and a number interval corresponding to each preset grade; and aiming at each image block, taking a preset grade corresponding to a number interval which is in accordance with the number of the image elements corresponding to the image block as a load grade corresponding to the image block.

In some embodiments, the obtaining a plurality of preset levels and a number interval corresponding to each preset level includes: acquiring rendering condition parameters of a current rendering environment; the rendering condition parameters include at least one of: a hardware parameter for characterizing a hardware performance of the graphics processor and a rendering target parameter for characterizing a computation amount of a rendering object; determining the number of the plurality of preset levels based on the rendering condition parameters; and acquiring the preset grades and a number interval corresponding to each preset grade based on the number of the preset grades.

In some embodiments, the hardware parameters include at least one of: the number of processor cores and the read-write speed of the memory; the rendering target parameters include at least one of: the size of the rendered object and the number of tiles.

In some embodiments, the transferring the load level corresponding to each of the tiles to the back-end module of the TBR architecture includes: writing the load level corresponding to each block into the block header information of the corresponding block information in the process that the front end module of the TBR architecture writes the block information of each block into the system memory; for each block, in response to a rendering event for the block, block header information of block information corresponding to the block is read from the system memory through a back-end module of the TBR architecture, and a load level corresponding to the block is obtained from the block header information.

In some embodiments, the method further comprises: encoding the load level corresponding to each block by a front-end module of the TBR architecture to obtain an encoded value of at least one bit; writing the load level corresponding to each block into the block header information of the corresponding block information, including: writing the coded value of at least one bit corresponding to each block into block header information of corresponding block information; the reading, by the back-end module of the TBR architecture, tile header information of tile information corresponding to the tile from the system memory, and obtaining a load level corresponding to the tile from the tile header information, includes: and reading the block header information of the block information corresponding to the block from the system memory through a back-end module of the TBR architecture, and decoding the coding value of at least one bit in the block header information to obtain the load level corresponding to the block.

In some embodiments, the determining, by the backend module of the TBR architecture, a target processor core corresponding to the tile from the at least two processor cores based on a status indicator corresponding to each of the processor cores in the set of status indicators corresponding to the tile comprises: traversing each state indicator through a back-end module of the TBR architecture according to the arrangement sequence of each state indicator corresponding to the block; and taking the processor core corresponding to the state indicator with the first value as the target processor core.

In some embodiments, the method further comprises: distributing rendering tasks corresponding to the tiles to the target processor cores; and in response to assigning the rendering task corresponding to the tile to the target processor core, updating a status indicator corresponding to the target processor core in the set of status indicators corresponding to the tile to the second value.

In some embodiments, the method further comprises: and resetting each state indicator in the state indicator set corresponding to the block to the first value in response to each state indicator in the state indicator set corresponding to the block being the second value.

In some embodiments, the method further comprises: acquiring a state machine based on the load level corresponding to each block by a back-end module of the TBR architecture; the state machine comprises a state indicator set corresponding to each load level.

In some embodiments, the determining the load level corresponding to each tile based on the number of primitives corresponding to each tile includes: acquiring a first preset grade and a corresponding first number of intervals, a second preset grade and a corresponding second number of intervals, a third preset grade and a corresponding third number of intervals, a fourth preset grade and a corresponding fourth number of intervals; for each image block, determining a target preset level as a load level corresponding to the image block in the first preset level, the second preset level, the third preset level and the fourth preset level based on the number of the image elements corresponding to the image block; the target preset grade is a preset grade corresponding to a number interval in which the number of the graphic elements corresponding to the graphic block falls.

In some embodiments, the transferring the load level corresponding to each of the tiles to the back-end module of the TBR architecture includes: encoding the load level corresponding to each block by a front-end module of the TBR architecture to obtain two bit encoding values; writing the coded values of two bits corresponding to each block into reserved bits in block header information of block information of each block; and for each block, reading the block header information of the block from the system memory through a back-end module of the TBR architecture, and decoding the encoded values of two bits of reserved bits in the block header information to obtain the load level corresponding to the block.

In another aspect, an embodiment of the present application provides a tile distribution apparatus for use in a graphics processor including at least two processor cores, the graphics processor performing a tile distribution process based on a tile rendering TBR architecture, the apparatus comprising:

the front-end module is used for determining the load level corresponding to each of the plurality of image blocks; the load level is used to characterize the number of primitives present in the tile;

The front-end module is used for transmitting the load level corresponding to each block to the back-end module of the TBR framework;

a back-end module, configured to determine, for each of the tiles, a target processor core corresponding to the tile from the at least two processor cores based on a status indicator corresponding to each of the processor cores in a set of status indicators corresponding to the tile;

the arrangement sequence of the state indicators of the state indicator group corresponding to the image block is related to the load level of the image block.

In yet another aspect, embodiments of the present application provide a computer device including a memory and a processor, the memory storing a computer program executable on the processor, the processor implementing some or all of the steps of the above method when the program is executed.

In yet another aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, performs some or all of the steps of the above-described method.

In the embodiment of the present application, since the load level of each tile is counted during the distribution process of the graphics processor based on the TBR architecture, and the load level of each tile is transferred to the back-end module, the target processor core for processing the current tile is determined in at least two processor cores based on the load level. In this way, compared with the scheme using the positions of the image blocks as the distribution basis or the scheme using the number of the image blocks as the distribution basis in the related art, the method can realize a targeted image block distribution process, and further can balance the load of each processor core in the image processor; meanwhile, in the process of determining the target processor core of the current image block based on the load level, the arrangement sequence of the state indicators of the state indicator group corresponding to the image block is related to the load level of the image block, so that the probability that each processor core is called is ensured to be approximately the same, the load balancing capability is further improved, and the rendering performance of the image processor is also enhanced as a whole.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the aspects of the present application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the technical aspects of the application.

FIG. 1 is a schematic flow diagram of an exemplary TBR pipeline provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of an implementation of a tile distributing method according to an embodiment of the present application;

fig. 3 is a second implementation flow chart of a block distribution method according to an embodiment of the present application;

fig. 4 is a third implementation flow chart of a tile distributing method according to an embodiment of the present application;

fig. 5A is a schematic implementation flow diagram of a tile distribution method according to an embodiment of the present application;

fig. 5B is a schematic flowchart fifth implementation flow of a tile distributing method according to an embodiment of the present application;

fig. 6 is a flowchart sixth of an implementation flow of a tile distributing method according to an embodiment of the present application;

fig. 7 is a schematic diagram seventh of an implementation flow chart of a tile distributing method according to an embodiment of the present application;

Fig. 8 is a schematic diagram of a conventional TBR architecture according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a block distribution process in the related art according to an embodiment of the present application;

fig. 10 is a schematic diagram of an actual rendering scene according to an embodiment of the present application;

FIG. 11 is a schematic diagram illustrating execution time of each processor core in an actual rendering scene according to an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of primitive coverage provided in an embodiment of the present application;

fig. 13 is a schematic diagram of load interval division provided in the embodiment of the present application;

FIG. 14 is a block diagram of a state machine provided in an embodiment of the present application;

FIG. 15 is a block distribution process schematic provided in an embodiment of the present application;

fig. 16 is a schematic structural diagram of a block distribution device according to an embodiment of the present disclosure;

fig. 17 is a schematic hardware entity diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application are further elaborated below in conjunction with the accompanying drawings and examples, which should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making inventive efforts are within the scope of protection of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict. The term "first/second/third" is merely to distinguish similar objects and does not represent a specific ordering of objects, it being understood that the "first/second/third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the present application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing the present application only and is not intended to be limiting of the present application.

Tile-based rendering is a process of subdividing a computer graphics image in optical space with a regular grid and rendering portions of the grid or tile, respectively. The advantage of this design is that it reduces the consumption of memory and bandwidth compared to an immediate mode rendering system that draws the entire frame immediately. This makes tile rendering systems popular for use in low power hardware devices. Tile rendering is sometimes also referred to as a centrally placed ordering (sort) architecture because it performs geometric ordering in the middle of the drawing pipeline rather than near the end. TBR is the most common architecture for mobile GPUs, with significant advantages in reducing power consumption.

A typical TBR pipeline flow is shown in FIG. 1, and is divided into a front end module 110 and a back end module 120, wherein the front end module 110 includes a vertex processing module 111, a graphics processing module 112, and a blocking (Tiling) module 113; the back-end module 120 includes a rasterization module 121, a hidden surface removal (Hidden Surface Removal, HSR) module 122, a pixel shading module 123, and an output merging module 124.

The front-end module 110 may perform vertex and primitive transformation (Vertex processing), graphics processing (including clip/cut, etc.), then complete screen segmentation in the tile stage, record graphics data overlaid on tile, and write the generated information into the system memory 130. In this way, the system memory 130 may store tile information (priitivelist) and Vertex information (Vertex Data), where priitivelist is a fixed length array with a length of tile, in the array, each element is a linked List, and a pointer of all triangles intersected with the current tile is stored, where the pointer points to Vertex Data; vertex Data holds Vertex and Vertex attribute Data.

The back-end module 120 mainly performs rasterization (Raster), depth Test, pixel shading, etc., and finally outputs to a Render target (Render target). For each tile, because the data size is not large, depth data (depth), texture data (texture) or color data (color) required by the tile can be loaded into a static random-Access Memory (SRAM) of the GPU, namely, an on-chip Memory 140 in the figure; for example, the hidden surface removal module 122 may store depth data (depth) into a depth buffer (depth buffer) in the on-chip memory 140, the pixel shading module 123 may store texture data (texture) into a texture buffer in the on-chip memory 140, and the output merging module 124 may store color data (color) into a color buffer (color buffer) in the on-chip memory 140.

In the rendering process, the rendering object (picture) is divided into a plurality of tiles (tile), so that the on-chip memory 140 can hold all the data of the tile. After at least one drawing instruction reaches the GPU, the front end module 110 processes each drawing instruction in turn, and stores the corresponding tile information and vertex information into the system memory 130 until the data stored in the system memory 130 reaches a preset threshold or at least one drawing instruction is processed. The back-end module 120 will read the corresponding vertex information from the system memory 130 in units of tiles and perform subsequent processing. In this way, since the access of the back-end module 120 to the system memory 130 is changed to the access of the back-end module 120 to the on-chip memory 140, the rendering efficiency can be improved.

For a GPU of a TBR architecture, a general purpose rendering core is generally used to perform a related processing of a fragment coloring stage, specifically, each general purpose rendering core is respectively responsible for a fragment coloring rendering task of a small block (tile) on a screen, and since each tile correspondingly constructs a primitive list to record which primitives in a frame cover an area of the tile, it can be known that the size of the primitive list corresponding to each tile determines the workload of the tile rendering task. However, in a complete picture, the sizes of the corresponding primitive lists between tiles are different, so that the workload among the general rendering cores is unbalanced.

In view of this, the present embodiments provide a tile distribution method that may be performed by a processor of a computer device. The computer device may be a device with data processing capability, such as a server, a notebook computer, a tablet computer, a desktop computer, a smart television, a set-top box, a mobile device (e.g., a mobile phone, a portable video player, a personal digital assistant, a dedicated messaging device, and a portable game device).

Fig. 2 is a schematic flowchart of an implementation flow of a tile distributing method according to an embodiment of the present application, where the method may be executed by a processor of a computer device, and will be described with reference to the steps shown in fig. 2.

Step S201, determining a load level corresponding to each of a plurality of image blocks through a front-end module of the TBR architecture; the load level is used to characterize the number of primitives present in the tile.

In some embodiments, the front end modules of the TBR architecture may include vertex processing module 111, graphics processing module 112, and blocking module 113 in fig. 1. The plurality of image blocks are obtained by dividing the screen into blocks by the front-end module, generally, the image block ranges of all the image blocks in the screen are the same, and the size of the image blocks needs to meet the storage condition of the on-chip memory.

In some embodiments, for each tile, the corresponding load level for that tile is used to characterize the number of primitives present in that tile. The front-end module can determine the position of each graphic element in the screen, and meanwhile, after the segmentation is completed, the range of the graphic block corresponding to each graphic block can be obtained, so that the number of graphic elements existing in each graphic block can be determined, and the load level corresponding to the graphic block can be obtained based on the number of graphic elements corresponding to the graphic block.

In some embodiments, the number of primitives falling into a tile may be directly used as the load level, for example, if 2 primitives fall into a first tile, and 5 primitives fall into a second tile, then the load level of the first tile may be directly set to 2, the load level of the second tile may be set to 5, and so on. In other embodiments, a number interval may be set for each load level, and a load level corresponding to the number interval to which the number of primitives falling into the tile belongs is taken as the load level of the tile.

Step S202, transmitting the load level corresponding to each block to the back-end module of the TBR architecture.

In some embodiments, the load level corresponding to each block may be stored in the system memory by the front-end module of the TBR architecture, and then the back-end module of the TBR architecture may obtain the load level corresponding to each block from the system memory.

Step S203, for each block, determining, by a back-end module of the TBR architecture, a target processor core corresponding to the block from the at least two processor cores based on a status indicator corresponding to each of the processor cores in the set of status indicators corresponding to the block.

In some embodiments, for each of the tiles, the set of state indicators corresponding to the tile includes a state indicator corresponding to each processor core, and the indication code of the state indicator corresponding to each processor core is used to characterize the operating state of the processor core. By way of example, the operating state of the processor core may include an idle state, a busy state. In determining the target processor core from the at least two processor cores, a processor core of an idle state may be selected as the target processor core based on the operating states of the respective processor cores.

In other embodiments, for each of the tiles, a fixed order of the plurality of status indicators exists within the set of status indicators corresponding to the tile. In determining the target processor core from the at least two processor cores, each state indicator may be traversed sequentially according to the fixed arrangement sequence of the plurality of state indicators, and the processor core with the first running state being the idle state is taken as the target processor core.

In some embodiments, the target processor core is configured to process a rendering task corresponding to the tile.

In some embodiments, the status indicators corresponding to different load levels are arranged in different orders for the status indicators corresponding to the respective processor cores in the status indicator set. For example, referring to table 1, an arrangement sequence of status indicators corresponding to a plurality of status indicator sets is shown.

TABLE 1

；

The arrangement order of the processor cores in the state indicator set corresponding to the load level 1 is "1432", the arrangement order of the processor cores in the state indicator set corresponding to the load level 2 is "2143", the arrangement order of the processor cores in the state indicator set corresponding to the load level 3 is "3214", and the arrangement order of the processor cores in the state indicator set corresponding to the load level 4 is "4321". It can be seen that the order of the status indicators corresponding to the respective processor cores in the set of status indicators for the respective load levels is different.

In other embodiments, the status indicators corresponding to different load levels may be arranged in the same or different order. For example, referring to table 2, an arrangement sequence of status indicators corresponding to a plurality of status indicator sets is shown.

TABLE 2

；

The arrangement sequence of each processor core in the state indicator set corresponding to the load level 1 and the load level 3 is 12, and the arrangement sequence of each processor core in the state indicator set corresponding to the load level 2 and the load level 4 is 21. It can be seen that the order of the status indicators corresponding to the processor cores in the status indicator group for each load level may be the same or different.

The back-end module is preset with a plurality of state machines, and the quantity of the load levels corresponding to each state machine is different. After the back-end module of the TBR architecture determines the load level corresponding to each tile, a state machine corresponding to the current load level number may be obtained from the preset plurality of state machines based on the load level number.

For example, there may be a first state machine, a second state machine, and a third state machine; the first state machine corresponds to 2 load levels, the second state machine corresponds to 4 load levels and the third state machine corresponds to 8 load levels. In case 8 tiles are obtained, wherein the load levels corresponding to the respective tiles are (2, 3, 1, 2, 4, 2, 3, 4), it can be seen that the number of load levels of the tiles is 4, and therefore the second state machine, i.e. the 4 load levels and the set of state indicators corresponding to each of the 4 load levels, respectively, can be selected.

It should be noted that, in different state machines, the set of state indicators corresponding to each load level may be the same or different. That is, the set of state indicators corresponding to load level 1 in the first state machine, the set of state indicators corresponding to load level 1 in the second state machine, and the set of state indicators corresponding to load level 1 in the third state machine may be the same or different.

In some embodiments, the ordering of the status indicators includes a bit order of a corresponding status indicator for each of the processor cores. For each bit order, the number of the processor cores in the processor core set corresponding to the bit order is the same, and the processor core set corresponding to the bit order includes the processor cores corresponding to the bit order in the state indicator set corresponding to each load level.

For example, referring to table 2, although the arrangement order of the respective processor cores in the state indicator sets corresponding to load level 1 and load level 3 is "12", the arrangement order of the respective processor cores in the state indicator sets corresponding to load level 2 and load level 4 is "21", that is, there is a case where the arrangement order of the respective processor cores in the state indicator sets is the same, but for the two existing bit sequences (including "1" and "2"), the processor core sets corresponding to bit sequence "1" (processor core 1, processor core 2), the number of processor cores 1 and 2 is 2; the processor core set (processor core 1, processor core 2) corresponding to bit order "2" has a number of 2 for both processor core 1 and processor core 2. In this way, in the process of executing step S203, the probability that each processor core is invoked is the same, and the load balancing capability of the image processor is improved to some extent.

Fig. 3 is a second flowchart of an alternative tile distribution method provided in an embodiment of the present application, which may be performed by a processor of a computer device. Based on fig. 2, step S201 in fig. 2 may be updated to S301 to S302, and will be described in connection with the steps shown in fig. 3.

Step S301, for each tile, determining, by a front-end module of the TBR architecture, the number of primitives falling into the tile range based on the positions of the primitives and the tile range of the tile.

In some embodiments, after the front-end module processes the geometric Data to obtain the corresponding primitive Data, the location of each primitive may be obtained; meanwhile, after the front-end module performs the block processing, the block range corresponding to each block can be obtained. Then, for each tile, based on the location of the respective primitive and the tile range of that tile, the number of primitives that fall within the tile range may be obtained. Wherein the position of the primitive is embodied in the form of a trilateral equation.

Step S302, determining a load level corresponding to each tile based on the number of primitives corresponding to each tile.

In some embodiments, the number of primitives that may fall directly into a tile is taken as the load level of the tile.

In some embodiments, the determining the load level corresponding to each tile based on the number of primitives corresponding to each tile may be implemented through steps S3021 to S3022.

Step S3021, obtaining a plurality of preset levels and a number interval corresponding to each preset level.

In some embodiments, the number of levels of the plurality of preset levels is fixedly set. In other embodiments, the number of levels of the plurality of preset levels is dynamically changed and is related to the rendering condition parameters of the current rendering environment, please refer to the implementation process provided in the embodiment of fig. 4.

Step S3022, for each of the tiles, using a preset level corresponding to a number interval in which the number of primitives corresponding to the tile falls as the load level corresponding to the tile.

Illustratively, the plurality of preset levels obtained includes a first preset level and a second preset level, wherein, the number interval corresponding to the first preset level is [0,4], the number interval corresponding to the second preset level is (4, +++ ], under the condition that 2 graphic elements fall into the first graphic block and 5 graphic elements fall into the second graphic block, the load level of the first graphic block can be directly set to be a first preset level, and the load level of the second graphic block can be set to be a second preset level.

In the embodiment of the application, after the number of the primitives in each tile is obtained, that is, after the rendering workload borne by each tile is determined, the load condition of each tile is classified based on the number of the primitives, so that the workload condition of each tile can be considered in the process of subsequently distributing the tiles to the processor core, and the load balancing capability of the graphics processor is improved.

Fig. 4 is a flowchart third alternative block distribution method provided in an embodiment of the present application, which may be performed by a processor of a computer device. Based on fig. 3, step S3021 in fig. 3 may be updated to S401 to S403, and will be described in connection with the steps shown in fig. 4.

Step S401, obtaining a rendering condition parameter of the current rendering environment.

In some embodiments, the rendering condition parameters include at least one of: a hardware parameter for characterizing a hardware performance of the graphics processor, and a rendering target parameter for characterizing a computation amount of a rendering object.

In some embodiments, the hardware parameters include at least one of: the number of processor cores and the read-write speed of the memory.

In some embodiments, the rendering target parameters include at least one of: the size of the rendered object and the number of tiles.

Step S402, determining the number of the plurality of preset levels based on the rendering condition parameters.

In some embodiments, the number of the plurality of preset levels is greater the better the hardware parameters characterize the hardware performance of the graphics processor; the number of the plurality of preset levels is smaller as the hardware parameter characterizes the hardware performance of the graphics processor is worse.

Wherein the greater the number of processor cores, the better the hardware performance of the graphics processor; the faster the memory is read and written, the better the hardware performance of the graphics processor, and correspondingly, the greater the number of the plurality of preset levels. At this time, compared with the smaller number of levels, the increased number of preset levels brings about a certain degree of hardware load, but because the hardware performance of the graphics processor is better, the fineness of dividing the image block load can be improved without affecting other rendering tasks, and then the image block can be distributed to the processor cores of the graphics processor more uniformly.

In some embodiments, the number of the plurality of preset levels is greater as the hardware parameter characterizes the more computation of the rendering object; the number of the plurality of preset levels is smaller as the hardware parameter characterizes the hardware performance of the graphics processor is smaller.

Wherein, the more the number of the tiles, the more the calculation amount of the rendering object is; the larger the size of the rendering object, the more computationally the rendering object, and correspondingly, the greater the number of the plurality of preset levels. At this time, compared with the scheme with a smaller number of levels, the situation that the load cannot be balanced due to the fact that the overall calculated amount of the rendering object is larger, the situation that a large number of image blocks/image elements cannot be effectively distinguished by the smaller number of levels, namely, the embodiment can improve the fine degree of dividing the image block load, and then the image blocks can be distributed to the processor core of the image processor more uniformly can be avoided.

Step S403, based on the number of the preset levels, acquiring the preset levels and a number interval corresponding to each preset level.

In some embodiments, the number of preset levels may be n times 2, where n is a positive integer. By way of example, the number of preset ranks may be 2, 4, 8, …, and so on.

The number interval set corresponding to the number of each preset level can be preset for the number of each preset level, and the number interval set comprises the number interval corresponding to each preset level. For example, for the case that the number of preset levels is "2", a first preset level and a second preset level corresponding to the number of "2" may be preset, and a first number of intervals and a second number of intervals corresponding to the first preset level and the second preset level may be preset. For example, for the case that the number of preset levels is "4", a first preset level and a corresponding first number of sections, a second preset level and a corresponding second number of sections, a third preset level and a corresponding third number of sections, a fourth preset level and a corresponding fourth number of sections corresponding to the number "2" may be preset.

According to the method and the device for processing the load class, the number of the preset classes is determined by acquiring the rendering condition parameters of the current rendering environment and combining the hardware parameters and the rendering target parameters, and then the number of the load classes is dynamically changed, so that the self-adaptive adjustment of the load class division precision is realized, the balance between the load balancing capability and the rendering speed can be carried out, and the rendering efficiency is improved as a whole.

Fig. 5A is a flowchart illustrating an alternative method for distributing tiles according to an embodiment of the present application, which may be performed by a processor of a computer device. Based on fig. 2, S202 in fig. 2 may be updated to S501 to S502, and the steps shown in fig. 5A will be described.

Step S501, in the process that the front-end module of the TBR architecture writes the tile information of each tile into the system memory, writes the load level corresponding to each tile into the tile header information of the corresponding tile information.

Step S502, for each block, in response to a rendering event for the block, reads, by a back-end module of the TBR architecture, block header information of block information corresponding to the block from the system memory, and obtains a load level corresponding to the block from the block header information.

In some embodiments, please refer to fig. 5B, which illustrates an optional flowchart five of the tile distribution method provided in the embodiments of the present application, based on fig. 5A, before step S501, the method may further include step S503, and accordingly, steps S501 to S502 may be updated to S504 to S505, which will be described in connection with the steps illustrated in fig. 5A.

Step S503, encoding, by a front end module of the TBR architecture, a load level corresponding to each tile, to obtain an encoded value of at least one bit.

In some embodiments, the load level may be binary coded to obtain the coded value of the at least one bit. For example, in the case that the load level includes 1 and 2, after encoding the load level, a total of 2 encoding values of 00 and 01 can be obtained respectively; in the case where the load levels include 1, 2, 3, and 4, after encoding the load levels, four encoded values of 00, 01, 10, 11 can be obtained, respectively, and so on.

Step S504, in the process that the front end module of the TBR architecture writes the tile information of each tile into the system memory, writes the encoded value of at least one bit corresponding to each tile into the tile header information of the corresponding tile information.

Step S505, for each block, in response to a rendering event for the block, reads, by a back-end module of the TBR architecture, block header information of block information corresponding to the block from the system memory, and decodes an encoded value of at least one bit in the block header information, thereby obtaining a load level corresponding to the block.

In some embodiments, the decoding the encoded value of at least one bit in the tile header information to obtain the load level corresponding to the tile is an inverse process of encoding the load level to obtain the encoded value of at least one bit. Based on the above example, in the case where the encoded values are obtained to be 00, 01, after decoding the encoded values, the load level 1 and the load level 2 can be obtained, respectively; when the encoded values are 00, 01, 10, and 11, the encoded values are decoded, and then the load level 1, the load level 2, the load level 3, and the load level 4 are obtained, respectively.

In the embodiment of the application, the load grade is encoded, so that the transmission cost can be reduced as much as possible in the process of transferring the load grade to the back-end module, the transmission efficiency is improved, and the rendering efficiency is further improved.

FIG. 6 is a flowchart six of an alternative tile distribution method provided by an embodiment of the present application, which may be performed by a processor of a computer device. Based on any of the above embodiments, taking fig. 2 as an example, S203 in fig. 2 may be updated to S601 to S602, and the steps shown in fig. 6 will be described.

Step S601, traversing each status indicator according to the arrangement sequence of each status indicator corresponding to the tile through the back end module of the TBR architecture.

For example, referring to the arrangement order of the status indicators corresponding to the plurality of status indicator groups shown in table 1, in the case that the tile belongs to the load level 2, the processor cores corresponding to the 4 processor cores may be traversed sequentially according to the order of the processor core 2, the processor core 1, the processor core 4, and the processor core 3.

Wherein each status indicator may be configured to a first value that characterizes a processor core to which the status indicator corresponds as being in an idle state (distributable state); each status indicator may be configured to a second value that characterizes a busy state (non-distributable state) of the processor core to which the status indicator corresponds. In some embodiments, the initial state of each status indicator is configured to a first value.

In some embodiments, the first value may be set to 0 and the second value may be set to 1. The present application is not limited in this regard.

Step S602, taking the processor core corresponding to the first status indicator with the first value as the target processor core.

In some embodiments, the method further comprises steps S603 to S604.

Step S603, distributing the rendering task corresponding to the tile to the target processor core.

Step S604, in response to allocating the rendering task corresponding to the tile to the target processor core, updates a status indicator corresponding to the target processor core in the set of status indicators corresponding to the tile to the second value.

For example, referring to table 3, a state table of a set of state indicators is shown, the state table corresponding to table 1.

TABLE 3 Table 3

；/>

In the case that the current tile belongs to the load level 2, the processor cores corresponding to the 4 processor cores may be traversed sequentially in the order of the processor core 2, the processor core 1, the processor core 4, and the processor core 3. At this time, the first state indicator with the first value is the state indicator corresponding to the processor core 1, and then the rendering task corresponding to the current tile is allocated to the processor core 1. In response to assigning the rendering task corresponding to the current tile to the processor core 1, the status indicator corresponding to the processor core 1 corresponding to the tile is updated to a second value.

In some embodiments, the method further comprises step S605.

Step S605, in response to each of the status indicators in the set of status indicators corresponding to the tile being the second value, resets each of the status indicators in the set of status indicators corresponding to the tile to the first value.

In the case that the current tile belongs to the load level 3, the status indicators corresponding to the 4 processor cores may be traversed sequentially in the order of the processor core 3, the processor core 2, the processor core 1, and the processor core 4. At this time, the first state indicator with the first value is the state indicator corresponding to the processor core 4, and then the rendering task corresponding to the current tile is allocated to the processor core 4. In response to assigning the rendering task corresponding to the current tile to the processor core 4, the status indicator corresponding to the processor core 4 corresponding to the tile is updated to a second value.

At this time, a state table of a set of state indicators as shown in table 4 can be obtained.

TABLE 4 Table 4

；

Since the status indicators corresponding to the 4 processor cores corresponding to the load level 3 are all the second value "1", the status indicators corresponding to the 4 processor cores corresponding to the load level 3 are reset to the first value "0", so as to obtain a status table of a status indicator set as shown in table 5.

TABLE 5

；

According to the embodiment of the application, the problem of unbalanced load caused by continuous distribution of tiles to a certain processor core can be avoided by the updating method of each state indicator in the state indicator set.

Considering that there are two reserved bits in the tile header information of the tile information of each tile in the process of storing the tile information of each tile in the system memory by the front-end module, referring to fig. 7, fig. 7 is a schematic diagram seven of an alternative flow chart of the tile distribution method provided in the embodiment of the present application, and the method may be executed by a processor of the computer device. The steps shown in fig. 7 will be described.

Step S701, acquiring a first preset level and a corresponding first number of intervals, a second preset level and a corresponding second number of intervals, a third preset level and a corresponding third number of intervals, a fourth preset level and a corresponding fourth number of intervals.

Step S702, for each of the tiles, determining, based on the number of primitives corresponding to the tile, a target preset level from the first preset level, the second preset level, the third preset level, and the fourth preset level as a load level corresponding to the tile.

The target preset level is a preset level corresponding to a number interval in which the number of the primitives corresponding to the image block falls.

And step 703, coding the load level corresponding to each block by a front-end module of the TBR architecture to obtain two-bit coding values.

Step S704, writing the two-bit coded value corresponding to each tile into the reserved bit in the tile header information of the tile information of each tile.

Step S705, for each block, reading, by a back-end module of the TBR architecture, block header information of block information of the block from the system memory, and decoding encoded values of two bits of reserved bits in the block header information to obtain a load level corresponding to the block.

Here, the above-mentioned step S301 and step 203 correspond to the step S301 in the foregoing embodiment of fig. 3 and the step 203 in the foregoing embodiment of fig. 2, respectively, and reference may be made to the specific implementation in the foregoing embodiment.

In this embodiment of the present application, since two reserved bits exist in the tile header information of the tile information of each tile in the process that the front end module stores the tile information of each tile into the system memory, the number of load levels is set to 4, and the load levels are encoded to obtain a 2-bit encoded value, and thus the reserved bits can be effectively utilized, and compared with the existing TBR architecture, the embodiment of the present application does not affect the read-write process of the system memory.

The following describes an application of the tile distribution method provided in the embodiments of the present application in a practical scenario, mainly involving an image processor including four processor cores. Of course, the embodiments of the present application are not limited to the number of processor cores of the image processor, and the following embodiments are merely for clarity of description of the implementation of the present application.

Under a conventional TBR architecture, a front end (front end) 810 generates primitive rendering data and writes it out to a memory 840 (memory), and a back end 820 (back end) splits the tiles and distributes them into different graphics processor cores (GPU cores), where the primitive data is read out from the memory for the corresponding tile, as shown in fig. 8. Whether different GPU core loads are balanced or not is closely related to the distributing policy of the Tile, the Tile distributor 830 should not only distribute the Tile to different GPU cores relatively evenly as much as possible, but also can control the working time of each GPU core through Tile distribution, so as to avoid the degradation of GPU performance caused by overlong working time of some GPU cores and too short working time of some GPU cores.

For tile distribution, existing designs generally divide a screen with tiles, and then allocate a fixed area (including a plurality of tiles) on the screen to a GPU core for processing, which in fact establishes a mapping relationship between each area divided by the screen and each GPU core and uses the mapping relationship as a basis for tile distribution.

Referring to fig. 9, fig. 9 is a schematic diagram of a block distribution process in the related art. First, the screen is split in tiles, the whole screen is split into 16 tiles of t0 to t15, and then the tiles are grouped, as shown in fig. 9, each 4 tiles are grouped, and the tiles grouped into the same group are distributed to the same GPU core, so for the distribution process shown in fig. 9, the result of the Tile division is as shown in table 6:

TABLE 6

；

The above tile distribution results ensure that each GPU core processes the same number of tiles, which balances the workload (workload) of each GPU core. However, this approach has a significant limitation because the distribution algorithm only considers spatial averaging and does not consider time (or load) influencing factors.

Taking the actual rendering scenario shown in fig. 10 as an example, in fig. 10, the triangle situation that each tile should render is given, different tile loads are different, where t0 to t3 possess larger loads, and loads from t4 to t15 are relatively sparse, and according to the above distribution algorithm, t0 to t3 are fed into the same GPU core (i.e. GPU 0), resulting in the overall load of GPU0 being much larger than other GPU cores, thus resulting in extremely unbalanced execution time of GPU cores and outstanding performance problems.

Please refer to fig. 11, which illustrates an execution time diagram of each processor core in an actual rendering scene. Wherein, GPU0 execution time far exceeds the execution time of GPU1, GPU2 and GPU 3.

According to the embodiment of the application, an algorithm for tile distribution based on a screen in the related art is improved, a distribution algorithm based on tile load is provided, the algorithm is designed to introduce the calculation of the load factor of the tile, the calculation is used as an influence factor in tile distribution processing so as to adjust the distribution strategy of the tile, and then the load balance degree of a plurality of GPU cores is improved on a TBR (tunnel boring machine) framework, so that the overall performance is improved.

According to the embodiment of the application, based on a TBR architecture, by adding a load statistics module at the front end of the TBR and utilizing an existing tile header transmission mechanism, load information is sent from the front end of the TBR to the rear end of the TBR, and then the tile is reasonably distributed to each GPU core for execution by utilizing the load information in the tile distribution stage.

In some embodiments, the front end counts the number of times each tile is covered by a primitive (private) when doing the partitioning. As shown in fig. 12, according to the three-side equation of the graph, it is calculated which tiles are covered by each primitive, so as to count the loads of T0 to T8, where T0 has a load of 1 (including P3), T1 has a load of 2 (including P0 and P3), T2 has a load of 2 (including P0 and P3), T3 has a load of 2 (including P0 and P3), T4 has a load of 4 (including P0, P1, P2 and P3), T5 has a load of 2 (including P1 and P2), T6 has a load of 1 (including P0), T7 has a load of 3 (including P0, P1 and P2), and T8 has a load of 2 (including P1 and P2).

In some embodiments, it is not simple to record a value for the tile load and transmit it to the backend, because if the load value is large, it will occupy more bits, so that not only the size of the storage space needs to be considered to be increased, but also the bandwidth of the read/write memory needs to be correspondingly expanded. In order to reduce unnecessary hardware overhead, after counting the last primitive of tile, the load of tile is encoded.

By performing test statistics on a large number of reference points (benchmarks), 4 load zones (corresponding to the number zones in the above embodiment) are extracted, and referring to fig. 13, fig. 13 is a schematic diagram of load zone division. The two cases that the load is smaller than or equal to the threshold value 1 and the load is larger than the threshold value 3 belong to rare cases, and the load of most tile is in the middle two load intervals. Accordingly, the codes corresponding to the 4 load intervals are shown in table 7.

TABLE 7

；

As shown in Table 7, the encoded tile load is only 2bits, so that it can be easily inserted into the header information (header) of the tile and written into the memory.

In some embodiments, the TBR backend reads each tile header from memory, decodes the workload of that tile, and performs allocation to different GPU cores based thereon.

Based on the above implementation scenario, a system of 4 GPU cores is still used as a test platform, a state machine is built in a tile distributor based on a work load (abbreviated as TWL) of a tile, the state machine includes 4bits of 0-1 state indicator groups (groups), 4 groups represent the number of codes of TWL to be 4, wherein (including state indicator groups 0:twl_00, state indicator groups 1:twl_01, state indicator groups 2:twl_10, state indicator groups 3:twl_11), and 4bits represent that there are several cores in the system, and the configuration of the cores among the state indicator groups is rearranged (swizzle) to ensure that the number of the cores sent to each core is the same as much as possible. The structure of this state machine is shown in figure 14.

After reading a header of a tile and analyzing a work load code value of the tile, finding a corresponding group according to the code value, traversing an indication bit of each core from left to right, distributing the tile to the core if an indicator of the core is 0, and setting the indicator of the core to be 1; when the indicators of all the core of a group have been set to 1, all reset is again set to 0, ready for the next round of distribution.

Please refer to fig. 15, which illustrates a block distribution process. The screen in FIG. 15 includes 16 tiles (t 0 to t 15), if tile0 to tile3 are simply sent to core0, tile4 to tile7 are sent to core1, tile8 to tile9 are sent to core2, tile10 to tile15 are sent to core3, then the loads of core0 and core3 are too light and the loads of core1 and core2 are too heavy, resulting in a very uneven distribution of rendering tasks. In the tile distribution method provided by the above embodiment, the distribution process of t0 to t15 may include:

T0: twl=twl_00, checking the indicator code (group0_core_mask) of the status indicator set 0, selecting core0 as the distribution object, setting group0_core_mask=1000b;

t1: twl=twl_01, checking the indicator code (group1_core_mask) of the status indicator set 1, selecting core1 as the distribution object, setting group1_core_mask=1000b;

t2: twl=twl_01, checking the indicator code (group1_core_mask) of the status indicator set 1, selecting core2 as the distribution object, setting group1_core_mask=1100b;

t3: twl=twl_01, checking the indicator code (group1_core_mask) of the status indicator set 1, selecting core3 as the distribution object, setting group0_core_mask=1110 b;

t4: twl=twl_10, checking the indicator code (group2_core_mask) of the status indicator set 2, selecting core2 as the distribution object, setting group2_core_mask=1000b;

t5: twl=twl_11, checking the indicator code (group3_core_mask) of the status indicator set 3, selecting core3 as a distribution object, and setting group3_core_mask=1000b;

t6: twl=twl_11, checking the indicator code (group3_core_mask) of the status indicator set 3, selecting core0 as the distribution object, setting group3_core_mask=1100b;

t7: twl=twl_10, checking the indicator code (group2_core_mask) of the status indicator set 2, selecting core3 as the distribution object, setting group2_core_mask=1100b;

T8: twl=twl_10, checking the indicator code (group2_core_mask) of the status indicator set 2, selecting core0 as the distribution object, setting group2_core_mask=1110 b;

t9: twl=twl_10, checking the indicator code (group2_core_mask) of the status indicator set 2, selecting core1 as the distribution object, setting group2_core_mask=1111 b, resetting group2_core_mask=0000 b;

t10: twl=twl_10, checking the indicator code (group3_core_mask) of the status indicator set 3, selecting core1 as the distribution object, setting group3_core_mask=1110 b;

t11: twl=twl_11, checking the indicator code (group2_core_mask) of the status indicator set 2, selecting core3 as the distribution object, setting group2_core_mask=1111 b;

t12: twl=twl_01, checking the indicator code (group1_core_mask) of the status indicator set 1, selecting core0 as the distribution object, setting group1_core_mask=1111 b, resetting group1_core_mask=0000 b;

t13: twl=twl_00, checking the indicator code (group0_core_mask) of the status indicator set 0, selecting core1 as the distribution object, setting group0_core_mask=1100b;

t14: twl=twl_01, checking the indicator code (group1_core_mask) of the status indicator set 1, selecting core1 as the distribution object, setting group1_core_mask=1000b;

T15: twl=twl_01, checking the indicator code (group1_core_mask) of the status indicator set 1, selecting core2 as the distribution object, and setting group1_core_mask=1100b.

The number of the final 4 cores divided into the tile is the same, and the comparison can show that the load difference of the 4 cores is not large.

In practical application tests, it can be found that the larger the rendering target (render target) and/or the larger the number of tiles, the finer the tile load division, and the more uniformly distributing the tiles when the number of TWL codes is increased. The increase of the codes also brings certain memory overhead and the change of the dispatcher dispatching, and the actual hardware can be weighted according to the factors such as the reading and writing of the memory, the number of cores and the like.

According to the embodiment of the application, the running time of all GPU cores is balanced as a core, TWL loads of each core in the frame are counted and encoded by adding a load counting function in a tiler stage, the TWL loads are sent to the rear end, and the rear end performs tile distribution based on the load information. The algorithm avoids the defect of the traditional algorithm taking the tile quantity as the distribution basis, and enables the tile distributor to identify the load quantity of the tile at extremely low cost so as to distribute the tile in a targeted manner, so that the load balance of each GPU core is basically realized, and the rendering performance of the GPU is enhanced as a whole.

Compared with the related technical scheme, the embodiment of the application realizes statistics and transmission of each tile load at extremely low cost; meanwhile, the load is introduced as an influence factor in tile distribution, so that the working time of some GPU cores is not too long or too short, and the utilization efficiency of hardware is improved; meanwhile, a new group-core distribution arrangement is provided, so that the problem of unbalanced load caused by continuous distribution of tiles to a certain core is further avoided.

Based on the foregoing embodiments, the present embodiments provide a tile distributing apparatus, where the apparatus includes units included, and modules included in the units may be implemented by a processor in a computer device; but may of course also be implemented in specific logic circuits.

Fig. 16 is a schematic structural diagram of a tile distributing apparatus according to an embodiment of the present application, and as shown in fig. 16, a tile distributing apparatus 1600 includes: front end module 1610, back end module 1620, wherein:

a front-end module 1610, configured to determine a load level corresponding to each of a plurality of tiles; the load level is used to characterize the number of primitives present in the tile;

the front end module 1610 is configured to transmit a load level corresponding to each of the tiles to a back end module of the TBR architecture;

A back-end module 1620 configured to determine, for each of the tiles, a target processor core corresponding to the tile from the at least two processor cores based on a status indicator corresponding to each of the processor cores in a set of status indicators corresponding to the tile; the arrangement sequence of the state indicators of the state indicator group corresponding to the image block is related to the load level of the image block.

In some embodiments, the front end module 1610 is further configured to: for each of the tiles, determining a number of primitives that fall within a tile range of the tile based on a location of each primitive and the tile range of the tile; and determining the load level corresponding to each block based on the number of the primitives corresponding to each block.

In some embodiments, the front end module 1610 is further configured to: acquiring a plurality of preset grades and a number interval corresponding to each preset grade; and aiming at each image block, taking a preset grade corresponding to a number interval which is in accordance with the number of the image elements corresponding to the image block as a load grade corresponding to the image block.

In some embodiments, the front end module 1610 is further configured to: acquiring rendering condition parameters of a current rendering environment; the rendering condition parameters include at least one of: a hardware parameter for characterizing a hardware performance of the graphics processor and a rendering target parameter for characterizing a computation amount of a rendering object; determining the number of the plurality of preset levels based on the rendering condition parameters; and acquiring the preset grades and a number interval corresponding to each preset grade based on the number of the preset grades.

In some embodiments, the front end module 1610 is further configured to: writing the load level corresponding to each block into the block head information of the corresponding block information in the process of writing the block information of each block into the system memory; the back-end module 1620 is further configured to: for each block, in response to a rendering event for the block, block header information of block information corresponding to the block is read from the system memory, and a load level corresponding to the block is obtained from the block header information.

In some embodiments, the front end module 1610 is further configured to: encoding the load level corresponding to each image block to obtain an encoding value of at least one bit; writing the coded value of at least one bit corresponding to each block into block header information of corresponding block information; the back-end module 1620 is further configured to: and reading the block header information of the block information corresponding to the block from the system memory, and decoding the coded value of at least one bit in the block header information to obtain the load level corresponding to the block.

In some embodiments, the back-end module 1620 is further configured to: traversing each state indicator according to the arrangement sequence of each state indicator corresponding to the image block; and taking the processor core corresponding to the state indicator with the first value as the target processor core.

In some embodiments, the back-end module 1620 is further configured to: distributing rendering tasks corresponding to the tiles to the target processor cores; and in response to assigning the rendering task corresponding to the tile to the target processor core, updating a status indicator corresponding to the target processor core in the set of status indicators corresponding to the tile to the second value.

In some embodiments, the back-end module 1620 is further configured to: and resetting each state indicator in the state indicator set corresponding to the block to the first value in response to each state indicator in the state indicator set corresponding to the block being the second value.

In some embodiments, the back-end module 1620 is further configured to: acquiring a state machine based on the load level corresponding to each image block; the state machine comprises a state indicator set corresponding to each load level.

In some embodiments, the front end module 1610 is further configured to: acquiring a first preset grade and a corresponding first number of intervals, a second preset grade and a corresponding second number of intervals, a third preset grade and a corresponding third number of intervals, a fourth preset grade and a corresponding fourth number of intervals; for each image block, determining a target preset level as a load level corresponding to the image block in the first preset level, the second preset level, the third preset level and the fourth preset level based on the number of the image elements corresponding to the image block; the target preset grade is a preset grade corresponding to a number interval in which the number of the graphic elements corresponding to the graphic block falls.

In some embodiments, the front end module 1610 is further configured to: coding the load level corresponding to each image block to obtain a coding value of two bits; and writing the coded values of two bits corresponding to each block into reserved bits in block header information of block information of each block. The back-end module 1620 is further configured to read, for each of the tiles, tile header information of tile information of the tile from the system memory, and decode encoded values of two bits of reserved bits in the tile header information to obtain a load level corresponding to the tile.

The description of the apparatus embodiments above is similar to that of the method embodiments above, with similar advantageous effects as the method embodiments. In some embodiments, functions or modules included in the apparatus provided in the embodiments of the present application may be used to perform the methods described in the embodiments of the methods, and for technical details that are not disclosed in the embodiments of the apparatus of the present application, please refer to the description of the embodiments of the methods of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the block distribution is implemented in the form of a software functional module, and sold or used as a separate product, the block distribution may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or portions contributing to the related art, and the software product may be stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the present application are not limited to any specific hardware, software, or firmware, or to any combination of hardware, software, and firmware.

The embodiment of the application provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the program to realize part or all of the steps of the method.

Embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs some or all of the steps of the above-described method. The computer readable storage medium may be transitory or non-transitory.

Embodiments of the present application provide a computer program comprising computer readable code which, when run in a computer device, performs some or all of the steps for implementing the above method.

Embodiments of the present application provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program which, when read and executed by a computer, performs some or all of the steps of the above-described method. The computer program product may be realized in particular by means of hardware, software or a combination thereof. In some embodiments, the computer program product is embodied as a computer storage medium, in other embodiments the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

It should be noted here that: the above description of various embodiments is intended to emphasize the differences between the various embodiments, the same or similar features being referred to each other. The above description of apparatus, storage medium, computer program and computer program product embodiments is similar to that of method embodiments described above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus, storage medium, computer program and computer program product of the present application, please refer to the description of the method embodiments of the present application.

Fig. 17 is a schematic diagram of a hardware entity of a computer device according to an embodiment of the present application, as shown in fig. 17, a hardware entity of the computer device 1700 includes: a processor 1701 and a memory 1702, wherein the memory 1702 stores a computer program executable on the processor 1701, the processor 1701 implementing steps in a method of any of the embodiments described above when the program is executed.

The memory 1702 stores computer programs executable on the processor, the memory 1702 being configured to store instructions and applications executable by the processor 1701, and also to cache data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the respective modules in the processor 1701 and the computer device 1700, and may be implemented by a FLASH memory (FLASH) or a random access memory (Random Access Memory, RAM).

The processor 1701, when executing a program, performs the steps of the tile distribution method of any of the above. The processor 1701 generally controls the overall operation of the computer device 1700.

The present embodiments provide a computer storage medium storing one or more programs executable by one or more processors to implement the steps of the tile distribution method of any of the embodiments above.

It should be noted here that: the description of the storage medium and apparatus embodiments above is similar to that of the method embodiments described above, with similar benefits as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and the apparatus of the present application, please refer to the description of the method embodiments of the present application for understanding.

The processor may be at least one of a target application integrated circuit (Application Specific Integrated Circuit, ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital signal processing device (DigitalSignal Processing Device, DSPD), a programmable logic device (Programmable Logic Device, PLD), a field programmable gate array (Field Programmable Gate Array, FPGA), a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, and a microprocessor. It will be appreciated that the electronic device implementing the above-mentioned processor function may be other, and embodiments of the present application are not specifically limited.

The computer storage medium/Memory may be a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable programmable Read Only Memory (ElectricallyErasable Programmable Read-Only Memory, EEPROM), a magnetic random access Memory (Ferromagnetic Random Access Memory, FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Read Only optical disk (Compact Disc Read-Only Memory, CD-ROM); but may also be various terminals such as mobile phones, computers, tablet devices, personal digital assistants, etc., that include one or any combination of the above-mentioned memories.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence number of each step/process described above does not mean that the execution sequence of each step/process should be determined by the function and the internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application. The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units. Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read Only Memory (ROM), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the integrated units described above may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the related art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.

The foregoing is merely an embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered in the protection scope of the present application.

Claims

1. A tile distribution method, applied to a graphics processor comprising at least two processor cores, the graphics processor performing a tile distribution process based on a tile rendering TBR architecture, the method comprising:

Determining a load level corresponding to each of a plurality of image blocks through a front end module of the TBR architecture; the load level is used to characterize the number of primitives present in the tile;

transmitting the load level corresponding to each block to a back-end module of the TBR architecture;

for each of the tiles, determining, by a back-end module of the TBR architecture, a target processor core corresponding to the tile from the at least two processor cores based on a status indicator corresponding to each of the processor cores in a set of status indicators corresponding to the tile;

2. The method of claim 1, wherein the order of the status indicators comprises a bit order of the status indicators corresponding to each of the processor cores; for each bit order, the number of the processor cores in the processor core set corresponding to the bit order is the same, and the processor core set corresponding to the bit order includes the processor cores corresponding to the bit order in the state indicator set corresponding to each load level.

3. The method of claim 1, wherein the determining, by the front-end module of the TBR architecture, a load level for each of a plurality of tiles comprises:

for each block, determining, by a front-end module of the TBR architecture, the number of primitives that fall within a block range of the block based on the locations of the primitives and the block range of the block;

and determining the load level corresponding to each block based on the number of the primitives corresponding to each block.

4. A method according to claim 3, wherein said determining a load level for each of said tiles based on a number of primitives for each of said tiles comprises:

acquiring a plurality of preset grades and a number interval corresponding to each preset grade;

and aiming at each image block, taking a preset grade corresponding to a number interval which is in accordance with the number of the image elements corresponding to the image block as a load grade corresponding to the image block.

5. The method of claim 4, wherein the obtaining a plurality of preset levels and a number interval corresponding to each preset level comprises:

acquiring rendering condition parameters of a current rendering environment; the rendering condition parameters include at least one of: a hardware parameter for characterizing a hardware performance of the graphics processor and a rendering target parameter for characterizing a computation amount of a rendering object;

Determining the number of the plurality of preset levels based on the rendering condition parameters;

and acquiring the preset grades and a number interval corresponding to each preset grade based on the number of the preset grades.

6. The method of claim 5, wherein the hardware parameters include at least one of: the number of processor cores and the read-write speed of the memory; the rendering target parameters include at least one of: the size of the rendered object and the number of tiles.

7. The method of any one of claims 1 to 6, wherein said transferring the load level corresponding to each of the tiles to the backend module of the TBR architecture comprises:

writing the load level corresponding to each block into the block header information of the corresponding block information in the process that the front end module of the TBR architecture writes the block information of each block into the system memory;

for each block, in response to a rendering event for the block, block header information of block information corresponding to the block is read from the system memory through a back-end module of the TBR architecture, and a load level corresponding to the block is obtained from the block header information.

8. The method of claim 7, wherein the method further comprises: encoding the load level corresponding to each block by a front-end module of the TBR architecture to obtain an encoded value of at least one bit;

writing the load level corresponding to each block into the block header information of the corresponding block information, including: writing the coded value of at least one bit corresponding to each block into block header information of corresponding block information;

the reading, by the back-end module of the TBR architecture, tile header information of tile information corresponding to the tile from the system memory, and obtaining a load level corresponding to the tile from the tile header information, includes: and reading the block header information of the block information corresponding to the block from the system memory through a back-end module of the TBR architecture, and decoding the coding value of at least one bit in the block header information to obtain the load level corresponding to the block.

9. The method of any of claims 1 to 6, wherein the determining, by a backend module of the TBR architecture, a target processor core for the tile from the at least two processor cores based on a status indicator for each of the set of status indicators for the tile, comprises:

Traversing each state indicator through a back-end module of the TBR architecture according to the arrangement sequence of each state indicator corresponding to the block;

and taking the processor core corresponding to the state indicator with the first value as the target processor core.

10. The method according to claim 9, wherein the method further comprises:

distributing rendering tasks corresponding to the tiles to the target processor cores;

in response to assigning the rendering task corresponding to the tile to the target processor core, a status indicator corresponding to the target processor core in the set of status indicators corresponding to the tile is updated to a second value.

11. The method according to claim 10, wherein the method further comprises:

and resetting each state indicator in the state indicator set corresponding to the block to the first value in response to each state indicator in the state indicator set corresponding to the block being the second value.

12. The method according to any one of claims 1 to 6, further comprising:

acquiring a state machine based on the load level corresponding to each block by a back-end module of the TBR architecture; the state machine comprises a state indicator set corresponding to each load level.

13. A method according to claim 3, wherein said determining a load level for each of said tiles based on a number of primitives for each of said tiles comprises:

acquiring a first preset grade and a corresponding first number of intervals, a second preset grade and a corresponding second number of intervals, a third preset grade and a corresponding third number of intervals, a fourth preset grade and a corresponding fourth number of intervals;

for each image block, determining a target preset level as a load level corresponding to the image block in the first preset level, the second preset level, the third preset level and the fourth preset level based on the number of the image elements corresponding to the image block; the target preset grade is a preset grade corresponding to a number interval in which the number of the graphic elements corresponding to the graphic block falls.

14. The method of claim 13, wherein the transferring the load level corresponding to each tile to the back-end module of the TBR architecture comprises:

encoding the load level corresponding to each block by a front-end module of the TBR architecture to obtain two bit encoding values;

writing the coded values of two bits corresponding to each block into reserved bits in block header information of block information of each block;

And for each block, reading the block header information of the block from a system memory through a back-end module of the TBR architecture, and decoding the encoded values of two bits of reserved bits in the block header information to obtain the load level corresponding to the block.

15. A tile distribution apparatus for use with a graphics processor comprising at least two processor cores, the graphics processor performing a tile distribution process based on a tile rendering TBR architecture, the apparatus comprising:

16. A computer device comprising a memory and a processor, the memory storing a computer program executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1 to 14 when the program is executed.

17. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 14.