CN115168058B

CN115168058B - Thread load balancing method, device, equipment and storage medium

Info

Publication number: CN115168058B
Application number: CN202211082045.9A
Authority: CN
Inventors: 陈玉龙; 张坚
Original assignee: Shenliu Micro Intelligent Technology Shenzhen Co ltd
Current assignee: Shenliu Micro Intelligent Technology Shenzhen Co ltd
Priority date: 2022-09-06
Filing date: 2022-09-06
Publication date: 2022-11-25
Anticipated expiration: 2042-09-06
Also published as: CN115168058A

Abstract

The invention relates to the technical field of computers, and discloses a thread load balancing method, device, equipment and storage medium, which are used for realizing load distribution balancing and improving the utilization rate of computing resources. The method comprises the following steps: constructing a global scheduler based on a preset graphics processor and a preset load balancing strategy; receiving drawing instructions through a global scheduler to generate first drawing tasks, calculating first load data of a plurality of first schedulers and a plurality of second schedulers, and performing step-by-step average distribution and load data updating on the first drawing tasks according to the first load data to obtain second load data; the shader threads to be operated of the multiple computing units are configured according to the second load data, and whether the multiple computing units generate second drawing tasks or not is monitored; and if so, carrying out load distribution on the second drawing task through the plurality of first schedulers and the plurality of second schedulers according to the task distribution strategy to obtain a load distribution result.

Description

Thread load balancing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for thread load balancing.

Background

The Graphics Processing Unit (GPU) is provided with a plurality of physical computing units, and aiming at the condition that a plurality of vertex data need the same shader program logic processing, the purpose of parallel computing can be achieved by dividing different vertex data into threads to be executed at each computing unit simultaneously, so that the data processing is accelerated. Modern GPU shaders are divided into vertex shaders, tessellation shaders, geometry shaders, computation shaders, fragment shaders, etc., program logic is defined by a user, the time and space complexity thereof is not fixed, and each computing unit may also generate different execution times and cache occupation due to different input data when actually running a shader thread. In addition, according to the features of modern graphics processing pipelines, a certain primitive may generate a large amount of new primitive data in the tessellation shader, geometry shader, or rasterization stage, which brings a great challenge to efficient utilization of computing resources.

The existing scheme has no mechanism for effectively describing and monitoring the actual load of each computing unit, and cannot consider the actual load condition of each computing unit, so that the condition of unbalanced load distribution is easily caused, and computing resources cannot be efficiently utilized.

Disclosure of Invention

The invention provides a thread load balancing method, a thread load balancing device and a thread load balancing storage medium, which are used for realizing load distribution balancing and improving the utilization rate of computing resources.

The first aspect of the present invention provides a thread load balancing method, where the thread load balancing method includes: constructing a global scheduler based on a preset graphics processor and a preset load balancing policy, wherein the global scheduler comprises a plurality of first schedulers, each first scheduler comprises a plurality of second schedulers, and each second scheduler comprises a plurality of computing units; receiving a drawing instruction through the global scheduler, generating a first drawing task, and sending the first drawing task to the plurality of first schedulers and the plurality of second schedulers; calculating first load data of the first schedulers and the second schedulers, and performing stage-by-stage average distribution and load data updating on the first drawing task according to the first load data to obtain second load data; the shader threads to be operated by the multiple computing units are configured according to the second load data, the shader threads are operated by the multiple computing units, and whether the multiple computing units generate second drawing tasks or not is monitored; and if so, carrying out load distribution on the second drawing task through the plurality of first schedulers and the plurality of second schedulers according to a preset task distribution strategy to obtain a load distribution result.

Optionally, in a first implementation manner of the first aspect of the present invention, the calculating first load data of the multiple first schedulers and the multiple second schedulers, and performing step-by-step average allocation and load data update on the first drawing task according to the first load data to obtain second load data includes: calculating first load data of the plurality of first schedulers and the plurality of second schedulers, wherein the first load data is used for indicating original load data in the plurality of first schedulers and the plurality of second schedulers; obtaining a first scheduler number of the plurality of first schedulers, and obtaining a second scheduler number in each first scheduler; calculating load data of each first scheduler according to the first drawing task, the first load data and the number of the first schedulers, and calculating load data of each second scheduler according to the load data of each first scheduler and the number of the second schedulers in each first scheduler; second load data is generated from the load data of each first scheduler and the load data of each second scheduler.

Optionally, in a second implementation manner of the first aspect of the present invention, if yes, performing load distribution on the second drawing task through the multiple first schedulers and the multiple second schedulers according to a preset task distribution policy to obtain a load distribution result, where the load distribution result includes: if yes, acquiring a target computing unit generating the second drawing task, and determining a target second scheduler to which the target computing unit belongs; acquiring total load data of computing units of all computing units in the target second scheduler, and judging whether the target second scheduler meets a first task allocation condition according to the total load data of the computing units to obtain a first judgment result; and carrying out load distribution on the global scheduler according to the first judgment result and the second drawing task to obtain a load distribution result.

Optionally, in a third implementation manner of the first aspect of the present invention, the load distributing the global scheduler according to the first determination result and the second drawing task to obtain a load distribution result includes: when the first judgment result is that the first task allocation condition is not met, determining that the target second scheduler does not meet the first task allocation condition, and feeding the second drawing task back to a target first scheduler to which the target second scheduler belongs; acquiring total load data of second schedulers of all second schedulers in the target first scheduler, and judging whether the target first scheduler meets a second task allocation condition or not according to the total load data of the second schedulers; and if the target first scheduler does not meet a second task allocation condition, feeding the second drawing task back to the global scheduler, and performing task allocation on the second drawing task through the global scheduler to obtain a load allocation result.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the first task allocation condition includes: and judging whether the node load of the target second scheduler exceeds a target multiple corresponding to the node load average value of all other second schedulers in the same layer, and if so, determining that the load state of the target second scheduler is overload.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the thread load balancing method further includes: when the plurality of computing units run the shader threads, calculating clock cycles and data volumes consumed in the running process of the shader threads through the second scheduler; counting the number of each shader thread operated by all computing units in the second scheduler, and respectively computing a clock period average value and a data volume average value corresponding to the clock period and the data volume; load data for each shader thread is generated based on the number of each shader thread, the average number of clock cycles, and the average number of data.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the thread load balancing method further includes: after the second schedulers distribute the load, acquiring computing unit load data corresponding to all computing units in each second scheduler; acquiring scheduler load data of all second schedulers in each first scheduler according to the computing unit load data; load data for each shader thread is calculated from scheduler load data and by the global scheduler.

A second aspect of the present invention provides a thread load balancing apparatus, including: the system comprises a building module, a load balancing module and a control module, wherein the building module is used for building a global scheduler based on a preset graphics processor and a preset load balancing strategy, the global scheduler comprises a plurality of first schedulers, each first scheduler comprises a plurality of second schedulers, and each second scheduler comprises a plurality of computing units; the processing module is used for receiving drawing instructions through the global scheduler, generating first drawing tasks and sending the first drawing tasks to the plurality of first schedulers and the plurality of second schedulers; the updating module is used for calculating first load data of the first schedulers and the second schedulers, and performing step-by-step average distribution and load data updating on the first drawing task according to the first load data to obtain second load data; a configuration module, configured to configure shader threads to be run by the plurality of computing units according to the second load data, run the shader threads through the plurality of computing units, and monitor whether the plurality of computing units generate a second drawing task; and the distribution module is used for carrying out load distribution on the second drawing tasks through the plurality of first schedulers and the plurality of second schedulers according to a preset task distribution strategy to obtain a load distribution result if the load distribution result is positive.

Optionally, in a first implementation manner of the second aspect of the present invention, the update module is specifically configured to: calculating first load data of the plurality of first schedulers and the plurality of second schedulers, wherein the first load data is used for indicating original load data in the plurality of first schedulers and the plurality of second schedulers; obtaining a first scheduler number of the plurality of first schedulers, and obtaining a second scheduler number in each first scheduler; calculating load data of each first scheduler according to the first drawing task, the first load data and the number of the first schedulers, and calculating load data of each second scheduler according to the load data of each first scheduler and the number of the second schedulers in each first scheduler; second load data is generated from the load data of each first scheduler and the load data of each second scheduler.

Optionally, in a second implementation manner of the second aspect of the present invention, the allocating module further includes: if yes, acquiring a target calculation unit for generating the second drawing task, and determining a target second scheduler to which the target calculation unit belongs; the computing unit is used for acquiring the total load data of the computing units of all the computing units in the target second scheduler, and judging whether the target second scheduler meets a first task allocation condition according to the total load data of the computing units to obtain a first judgment result; and the distribution unit is used for carrying out load distribution on the global scheduler according to the first judgment result and the second drawing task to obtain a load distribution result.

Optionally, in a third implementation manner of the second aspect of the present invention, the allocating unit is specifically configured to: when the first judgment result is that the first task allocation condition is not met, determining that the target second scheduler does not meet the first task allocation condition, and feeding the second drawing task back to a target first scheduler to which the target second scheduler belongs; acquiring total load data of second schedulers of all second schedulers in the target first scheduler, and judging whether the target first scheduler meets a second task allocation condition or not according to the total load data of the second schedulers; and if the target first scheduler does not meet a second task allocation condition, feeding the second drawing task back to the global scheduler, and performing task allocation on the second drawing task through the global scheduler to obtain a load allocation result.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the first task allocation condition includes: and judging whether the node load of the target second scheduler exceeds a target multiple corresponding to the node load average value of all other second schedulers in the same layer, and if so, determining that the load state of the target second scheduler is overload.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the thread load balancing apparatus further includes: the generating module is used for calculating clock cycles and data volumes consumed in the shader thread running process through the second scheduler when the plurality of computing units run the shader threads; counting the number of each shader thread operated by all computing units in the second scheduler, and respectively computing a clock period average value and a data volume average value corresponding to the clock period and the data volume; load data for each shader thread is generated based on the number of each shader thread, the average number of clock cycles, and the average number of data.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the thread load balancing apparatus further includes: the analysis module is used for acquiring the load data of the computing units corresponding to all the computing units in each second scheduler after the second scheduler performs load distribution; acquiring scheduler load data of all second schedulers in each first scheduler according to the computing unit load data; load data for each shader thread is calculated from scheduler load data and by the global scheduler.

A third aspect of the present invention provides a thread load balancing device, including: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the thread load balancing device to perform the thread load balancing method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-mentioned thread load balancing method.

In the technical scheme provided by the invention, a global scheduler is constructed based on a preset graphics processor and a preset load balancing strategy, the global scheduler manages thread scheduling tasks in a hierarchical manner, each level of scheduler respectively corresponds to a multi-level cache structure of a GPU (graphics processing Unit), the calculation burden of a single scheduler can be reduced, the first drawing task is subjected to step-by-step average distribution and load data updating according to the first load data to obtain second load data, the time complexity and the space complexity of various shader programs are comprehensively considered, so that the actual load condition of each calculation unit is described as accurately as possible, whether a plurality of calculation units generate the second drawing tasks is monitored, a thread load redistribution mechanism is established aiming at new tasks possibly generated in a graphics pipeline stage, when the tasks are redistributed, the distribution to nodes in the same cluster is preferentially considered, and the shared cache is utilized to reduce the data transmission overhead. The present invention adopts the modern graphic processor with a unified shader architecture, which can balance the load of each computing unit of the graphic processor, improve the utilization rate of computing resources and increase the actual bandwidth of processing data.

Drawings

FIG. 1 is a diagram illustrating an embodiment of a thread load balancing method according to the present invention;

FIG. 2 is a diagram of another embodiment of a thread load balancing method according to the embodiment of the present invention;

FIG. 3 is a diagram of a thread load balancing apparatus according to an embodiment of the present invention;

FIG. 4 is a diagram of another embodiment of a thread load balancing apparatus according to the embodiment of the present invention;

FIG. 5 is a diagram of an embodiment of a thread load balancing device according to the embodiment of the present invention;

FIG. 6 is a diagram illustrating an embodiment of a thread load balancing hardware structure according to the present invention;

fig. 7 is a schematic diagram of an embodiment of task issuing in an embodiment of the present invention;

fig. 8 is a schematic diagram of an embodiment of task reallocation in the embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a thread load balancing method, a thread load balancing device and a thread load balancing storage medium, which are used for realizing load distribution balancing and improving the utilization rate of computing resources. The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Moreover, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of understanding, a detailed flow of an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a thread load balancing method according to an embodiment of the present invention includes:

101. constructing a global scheduler based on a preset graphics processor and a preset load balancing strategy, wherein the global scheduler comprises a plurality of first schedulers, each first scheduler comprises a plurality of second schedulers, and each second scheduler comprises a plurality of computing units;

it is to be understood that the execution subject of the present invention may be a thread load balancing device, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

The first scheduler is used for respectively calculating load data corresponding to the plurality of second schedulers; the second scheduler is used for respectively calculating the task allocation of the plurality of computing units and respectively configuring shader threads to be operated by the plurality of computing units; and acquiring input data of a plurality of computing units; recording clock cycles and data volumes corresponding to the process of running the shader threads by the plurality of computing units; the global scheduler is used to compute the load data for each shader thread.

Specifically, the computing unit includes a computing and storage module, the computing module performs shader thread execution, and the storage module is used for instruction and input/output data caching. The global scheduler manages several numbers of first schedulers, the first scheduler manages several numbers of second schedulers, the second scheduler manages several numbers of computing units, these numbers can be 4, 8, 16, 32, etc. different embodiments. As shown in fig. 6, fig. 6 is a schematic diagram of an embodiment of a thread load balancing hardware structure in this embodiment, in which the L0 scheduler is the second scheduler, and the L1 scheduler is the first scheduler.

102. Receiving a drawing instruction through a global scheduler, generating a first drawing task, and sending the first drawing task to a plurality of first schedulers and a plurality of second schedulers;

it should be noted that the second scheduler is responsible for task allocation of each computing unit, designates shader threads to be run by the computing unit, and prepares input data for the shader threads, and the computing unit notifies the second scheduler and returns output data to the second scheduler storage module each time the shader threads are executed. For the ith shader thread executed by the compute unit, the second scheduler records the clock period T consumed by its operation _i And amount of data M _i 。

103. Calculating first load data of the plurality of first schedulers and the plurality of second schedulers, and performing step-by-step average distribution and load data updating on the first drawing task according to the first load data to obtain second load data;

it should be noted that the load data calculation formula is as follows:

T _i ＝ T _end － T _start ；

M _i ＝ M _input ＋ M _output ；

wherein, T _start Indicates the time of the task delivery, T _end Indicating the time of reception of the result, M, by the second scheduler _input Representing the amount of shader thread input data, M _output Representing the amount of shader thread output data. The second scheduler counts the number of each shader thread which runs on each computing unit managed by the second scheduler, and the average value of the clock period and the data volume, and the first scheduler counts the load conditions of all the computing units under each second scheduler managed by the first scheduler; the global scheduler computes a for each shader threadA specific implementation algorithm is as follows;

；

wherein, T _i Representing the clock cycle consumed by the i-th run of a shader thread, M _i Representing the amount of data for the ith run of the shader thread; n is the number of terms, and the value of N has a plurality of different implementation modes. In this embodiment, as shown in fig. 7, fig. 7 is a schematic diagram of an embodiment of task reallocation in this embodiment, where CE represents a computing unit, L0 scheduler represents a second scheduler, and L1 scheduler represents a first scheduler.

104. Configuring shader threads to be operated by the plurality of computing units according to the second load data, operating the shader threads through the plurality of computing units, and monitoring whether the plurality of computing units generate second drawing tasks;

and if not, the scheduler node cluster is considered to be overloaded seriously compared with other node clusters with the same quantity. Specifically, when the task is redistributed, the computing unit feeds back a new task to the second scheduler, and the second scheduler judges whether the distribution condition is met according to the total load of the computing unit managed by the second scheduler. In one embodiment, the task allocation condition is that the second scheduler node load does not exceed the average of the other second scheduler nodes load at the same levelβOtherwise, this scheduler node cluster is considered to be overloaded compared to other equal number of node clusters.

105. And if so, carrying out load distribution on the second drawing task through the plurality of first schedulers and the plurality of second schedulers according to a preset task distribution strategy to obtain a load distribution result.

Specifically, when the schedulers at all levels meet the allocation condition, the average allocation principle is followed, so that the thread loads of the nodes managed by the schedulers are consistent, and finally the thread loads recorded by the schedulers are updated. In some embodiments, the compute units or scheduler nodes in the same cluster are provided with a shared cache, reducing transmission overhead.

In the embodiment of the invention, a global scheduler is constructed based on a preset graphic processor and a preset load balancing strategy, the global scheduler manages thread scheduling tasks in a hierarchical manner, each level of scheduler respectively corresponds to a multi-level cache structure of a GPU (graphics processing unit), the calculation burden of a single scheduler can be reduced, the first drawing task is subjected to step-by-step average distribution and load data updating according to the first load data to obtain second load data, the time complexity and the space complexity of various shader programs are comprehensively considered, so that the actual load condition of each calculation unit which is distributed is described as accurately as possible, whether a plurality of calculation units generate the second drawing task or not is monitored, a thread load redistribution mechanism is established aiming at new tasks which are possibly generated in a graphic pipeline stage, when the tasks are redistributed, the distribution to nodes in the same cluster is preferentially considered, and the data transmission cost is reduced by utilizing a shared cache. The present invention adopts the modern graphic processor with a unified shader architecture, which can balance the load of each computing unit of the graphic processor, improve the utilization rate of computing resources and increase the actual bandwidth of processing data. In this embodiment, as shown in fig. 8, fig. 8 is a schematic diagram of an embodiment of task issuing in this embodiment, where CE represents a computing unit, L0 scheduler represents a second scheduler, and L1 scheduler represents a first scheduler.

Referring to fig. 2, another embodiment of the thread load balancing method according to the embodiment of the present invention includes:

201. constructing a global scheduler based on a preset graphics processor and a preset load balancing strategy, wherein the global scheduler comprises a plurality of first schedulers, each first scheduler comprises a plurality of second schedulers, and each second scheduler comprises a plurality of computing units;

specifically, in this embodiment, the specific implementation of step 201 is similar to that of step 101, and is not described herein again.

202. Receiving a drawing instruction through a global scheduler, generating a first drawing task, and sending the first drawing task to a plurality of first schedulers and a plurality of second schedulers;

specifically, in this embodiment, the specific implementation of step 202 is similar to that of step 102, and is not described here again.

203. Calculating first load data of the plurality of first schedulers and the plurality of second schedulers, and performing step-by-step average distribution and load data updating on the first drawing task according to the first load data to obtain second load data;

specifically, first load data of a plurality of first schedulers and a plurality of second schedulers are calculated, wherein the first load data is used for indicating original load data in the plurality of first schedulers and the plurality of second schedulers; acquiring a first scheduler number of a plurality of first schedulers, and acquiring a second scheduler number in each first scheduler; calculating load data of each first scheduler according to the first drawing tasks, the first load data and the number of the first schedulers, and calculating load data of each second scheduler according to the load data of each first scheduler and the number of the second schedulers in each first scheduler; second load data is generated from the load data of each first scheduler and the load data of each second scheduler.

The second scheduler counts the number of each shader thread which runs on each computing unit managed by the second scheduler, and the average value of a clock cycle and data volume, wherein a specific implementation algorithm is as follows; the first scheduler counts the load conditions of all the computing units under each second scheduler managed by the first scheduler; the global scheduler calculates a global clock cycle T and data volume M for each shader thread.

，

；

Wherein, ti represents the clock period consumed by the ith operation of the shader thread, mi represents the data volume of the ith operation of the shader thread, and N is the number of terms, and the value of N has various different implementation modes.

204. Configuring shader threads to be operated by the plurality of computing units according to the second load data, operating the shader threads through the plurality of computing units, and monitoring whether the plurality of computing units generate second drawing tasks;

205. if yes, acquiring a target calculation unit for generating a second drawing task, and determining a target second scheduler to which the target calculation unit belongs;

specifically, when the server detects whether the plurality of computing units generate the second drawing task, the server performs process identification on the plurality of computing units, determines corresponding event identifiers, and then determines whether the plurality of computing units generate the second drawing task according to the event identifiers, if yes, acquires corresponding target computing units, and determines corresponding second schedulers.

206. Acquiring total load data of computing units of all computing units in the target second scheduler, and judging whether the target second scheduler meets a first task allocation condition according to the total load data of the computing units to obtain a first judgment result;

specifically, the first task allocation condition includes: and judging whether the node load of the target second scheduler exceeds a target multiple corresponding to the node load average value of all other second schedulers in the same layer, and if so, determining that the load state of the target second scheduler is overload.

It should be noted that, when a task is redistributed, the computing unit feeds back a new task to the second scheduler, and the second scheduler determines whether the distribution condition is satisfied according to the total load of the computing unit managed by the second scheduler. In one embodiment, the task allocation condition is that the second scheduler node load does not exceed the average of the other second scheduler nodes load at the same levelβAnd judging whether the node load of the target second scheduler exceeds a target multiple corresponding to the node load average value of all other second schedulers in the same layer, and if so, determining that the load state of the target second scheduler is overload.

207. And carrying out load distribution on the global scheduler according to the first judgment result and the second drawing task to obtain a load distribution result.

Specifically, when the first judgment result is that the first task allocation condition is not met, determining that the target second scheduler does not meet the first task allocation condition, and feeding the second drawing task back to the target first scheduler to which the target second scheduler belongs; acquiring total load data of second schedulers of all second schedulers in the target first scheduler, and judging whether the target first scheduler meets a second task allocation condition or not according to the total load data of the second schedulers; and if the target first scheduler does not meet the second task allocation condition, feeding the second drawing task back to the global scheduler, and performing task allocation on the second drawing task through the global scheduler to obtain a load allocation result.

It should be noted that, when the first determination result is that the first task allocation condition is not satisfied, it is determined that the target second scheduler does not satisfy the first task allocation condition, for example, the task allocation condition is:

；

wherein n is the total amount of the second scheduler under the first scheduler node cluster,βthere are many different embodiments of the value of the scaling factor, wherein the second scheduler feeds back new tasks to the first scheduler in the upper layer when the allocation condition is not met. Similarly, the first scheduler judges whether the allocation condition is met or not according to the total load of the first scheduler, if not, the task is fed back to the global scheduler, and the global scheduler allocates the task from top to bottom.

Optionally, when the plurality of computing units run the shader threads, the second scheduler calculates clock cycles and data volumes consumed in the running process of the shader threads; counting the number of each shader thread operated by all the computing units in the second scheduler, and respectively computing a clock period average value and a data volume average value corresponding to the clock period and the data volume; load data for each shader thread is generated based on the number of each shader thread, the clock cycle average, and the data volume average.

Optionally, after the second scheduler performs load distribution, the load data of the computing units corresponding to all the computing units in each second scheduler is obtained; acquiring scheduler load data of all second schedulers in each first scheduler according to the load data of the computing unit; load data for each shader thread is calculated from the scheduler load data and by the global scheduler.

Wherein, the shader load L _shader Obtained by weighted averaging of the clock period T and the data amount M, is defined as follows:

L _shader ＝ α* T ＋ (1 － α) * M；

αthere are many different embodiments of the value of the weighting factor. The second scheduler records the managed load of each computing unit and updates the load when performing thread task allocation. In one embodiment, the shader thread load L assigned to a compute unit task queue is _shader And accumulating the load value to the original load value.

L _CE ＝ L _CE ’ ＋ L _shader ；

After the second schedulers distribute the thread tasks, feeding back and updating the load condition layer by layer, that is, each second scheduler counts the load condition L of the computing unit managed by the second scheduler _CE And each first scheduler counts the load condition of a second scheduler managed by the first scheduler, and the global scheduler calculates the load of each shader thread according to each first scheduler.

In the embodiment of the invention, a global scheduler is constructed based on a preset graphic processor and a preset load balancing strategy, the global scheduler manages thread scheduling tasks in a hierarchical mode, each level of scheduler respectively corresponds to a multi-level cache structure of a GPU (graphics processing unit), the calculation burden of a single scheduler can be reduced, the first drawing task is subjected to step-by-step average distribution and load data updating according to the first load data to obtain second load data, the time complexity and the space complexity of various shader programs are comprehensively considered, so that the actual load condition distributed by each calculation unit is described as accurately as possible, whether a plurality of calculation units generate the second drawing tasks or not is monitored, a thread load redistribution mechanism is established aiming at new tasks possibly generated in a graphic pipeline stage, when the tasks are redistributed, the distribution to nodes in the same cluster is preferentially considered, and the data transmission cost is reduced by utilizing a shared cache. The present invention adopts the modern graphics processor with a unified shader architecture, can balance the load of each computing unit of the graphics processor, improve the utilization rate of computing resources, and increase the actual bandwidth of processing data.

With reference to fig. 3, the thread load balancing apparatus in the embodiment of the present invention is described above, and an embodiment of the thread load balancing apparatus in the embodiment of the present invention includes:

a building module 301, configured to build a global scheduler based on a preset graphics processor and a preset load balancing policy, wherein the global scheduler includes a plurality of first schedulers, each first scheduler includes a plurality of second schedulers, and each second scheduler includes a plurality of computing units;

a processing module 302, configured to receive a drawing instruction through the global scheduler, generate a first drawing task, and send the first drawing task to the plurality of first schedulers and the plurality of second schedulers;

an updating module 303, configured to calculate first load data of the multiple first schedulers and the multiple second schedulers, and perform step-by-step average allocation and load data update on the first drawing task according to the first load data to obtain second load data;

a configuration module 304, configured to configure shader threads to be executed by the multiple compute units according to the second load data, execute the shader threads through the multiple compute units, and monitor whether the multiple compute units generate a second drawing task;

and if so, performing load distribution on the second drawing task through the plurality of first schedulers and the plurality of second schedulers according to a preset task distribution strategy to obtain a load distribution result.

In the embodiment of the invention, a global scheduler is constructed based on a preset graphics processor and a preset load balancing strategy, the global scheduler manages thread scheduling tasks in a hierarchical mode, each level of scheduler respectively corresponds to a multi-level cache structure of a GPU (graphics processing unit), the calculation burden of a single scheduler can be reduced, the first drawing task is subjected to step-by-step average distribution and load data updating according to the first load data to obtain second load data, the time complexity and the space complexity of various shader programs are comprehensively considered, so that the actual load condition distributed by each calculation unit is described as accurately as possible, whether a plurality of calculation units generate the second drawing tasks or not is monitored, a thread load redistribution mechanism is established aiming at new tasks possibly generated in a graphics pipeline stage, when the tasks are redistributed, the distribution to nodes in the same cluster is preferentially considered, and the data transmission cost is reduced by utilizing a shared cache. The present invention adopts the modern graphics processor with a unified shader architecture, can balance the load of each computing unit of the graphics processor, improve the utilization rate of computing resources, and increase the actual bandwidth of processing data.

Referring to fig. 4, another embodiment of the thread load balancing apparatus according to the present invention includes:

a building module 301, configured to build a global scheduler based on a preset graphics processor and a preset load balancing policy, where the global scheduler includes a plurality of first schedulers, each first scheduler includes a plurality of second schedulers, and each second scheduler includes a plurality of computing units;

Optionally, the updating module 303 is specifically configured to:

calculating first load data of the plurality of first schedulers and the plurality of second schedulers, wherein the first load data is indicative of raw load data in the plurality of first schedulers and the plurality of second schedulers; acquiring a first scheduler number of the plurality of first schedulers, and acquiring a second scheduler number in each first scheduler; calculating load data of each first scheduler according to the first drawing task, the first load data and the number of the first schedulers, and calculating load data of each second scheduler according to the load data of each first scheduler and the number of the second schedulers in each first scheduler; second load data is generated from the load data of each first scheduler and the load data of each second scheduler.

Optionally, the allocating module 305 further includes:

if yes, acquiring a target calculation unit for generating the second drawing task, and determining a target second scheduler to which the target calculation unit belongs;

the computing unit is used for acquiring the total load data of the computing units of all the computing units in the target second scheduler, and judging whether the target second scheduler meets a first task allocation condition according to the total load data of the computing units to obtain a first judgment result;

and the distribution unit is used for carrying out load distribution on the global scheduler according to the first judgment result and the second drawing task to obtain a load distribution result.

Optionally, the allocation unit is specifically configured to:

when the first judgment result is that the first task allocation condition is not met, determining that the target second scheduler does not meet the first task allocation condition, and feeding the second drawing task back to a target first scheduler to which the target second scheduler belongs; acquiring total load data of second schedulers of all second schedulers in the target first scheduler, and judging whether the target first scheduler meets a second task allocation condition or not according to the total load data of the second schedulers; and if the target first scheduler does not meet a second task allocation condition, feeding the second drawing task back to the global scheduler, and performing task allocation on the second drawing task through the global scheduler to obtain a load allocation result.

Optionally, the first task allocation condition includes: and judging whether the node load of the target second scheduler exceeds a target multiple corresponding to the node load average value of all other second schedulers in the same layer, and if so, determining that the load state of the target second scheduler is overload.

Optionally, the thread load balancing apparatus further includes:

a generating module 306, configured to calculate, by the second scheduler, clock cycles and data amounts consumed during running of the shader threads when the plurality of computing units run the shader threads; counting the number of each shader thread operated by all computing units in the second scheduler, and respectively computing a clock period average value and a data volume average value corresponding to the clock period and the data volume; load data for each shader thread is generated based on the number of each shader thread, the clock cycle average, and the data volume average.

Optionally, the thread load balancing device further includes:

an analysis module 307, configured to obtain computing unit load data corresponding to all computing units in each second scheduler after the second scheduler performs load distribution; acquiring scheduler load data of all second schedulers in each first scheduler according to the computing unit load data; load data for each shader thread is calculated from scheduler load data and by the global scheduler.

Fig. 3 and fig. 4 describe the thread load balancing apparatus in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the thread load balancing device in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 5 is a schematic structural diagram of a thread load balancing apparatus according to an embodiment of the present invention, where the thread load balancing apparatus 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) for storing application programs 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the thread load balancing apparatus 500. Further, processor 510 may be configured to communicate with storage medium 530 to execute a series of instruction operations in storage medium 530 on thread load balancing device 500.

The thread load balancing apparatus 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, mac OS X, unix, linux, freeBSD, and the like. Those skilled in the art will appreciate that the thread load balancing apparatus configuration shown in fig. 5 does not constitute a limitation of the thread load balancing apparatus, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The present invention further provides a thread load balancing device, where the thread load balancing device includes a memory and a processor, where the memory stores computer readable instructions, and when the computer readable instructions are executed by the processor, the processor executes the steps of the thread load balancing method in the foregoing embodiments.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and which may also be a volatile computer readable storage medium, having stored therein instructions, which, when executed on a computer, cause the computer to perform the steps of the thread load balancing method.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media that can store program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A thread load balancing method is characterized in that the thread load balancing method comprises the following steps:

constructing a global scheduler based on a preset graphics processor and a preset load balancing strategy, wherein the global scheduler comprises a plurality of first schedulers, each first scheduler comprises a plurality of second schedulers, and each second scheduler comprises a plurality of computing units;

receiving a drawing instruction through the global scheduler, generating a first drawing task, and sending the first drawing task to the plurality of first schedulers and the plurality of second schedulers;

calculating first load data of the plurality of first schedulers and the plurality of second schedulers, and performing step-by-step average distribution and load data updating on the first drawing task according to the first load data to obtain second load data; wherein, the calculating first load data of the plurality of first schedulers and the plurality of second schedulers, and performing step-by-step average distribution and load data update on the first drawing task according to the first load data to obtain second load data includes: calculating first load data of the plurality of first schedulers and the plurality of second schedulers, wherein the first load data is used for indicating original load data in the plurality of first schedulers and the plurality of second schedulers; obtaining a first scheduler number of the plurality of first schedulers, and obtaining a second scheduler number in each first scheduler; calculating load data of each first scheduler according to the first drawing task, the first load data and the number of the first schedulers, and calculating load data of each second scheduler according to the load data of each first scheduler and the number of the second schedulers in each first scheduler; generating second load data according to the load data of each first scheduler and the load data of each second scheduler;

the shader threads to be operated by the multiple computing units are configured according to the second load data, the shader threads are operated by the multiple computing units, and whether the multiple computing units generate second drawing tasks or not is monitored;

if so, performing load distribution on the second drawing task through the plurality of first schedulers and the plurality of second schedulers according to a preset task distribution strategy to obtain a load distribution result; if so, performing load distribution on the second drawing task through the multiple first schedulers and the multiple second schedulers according to a preset task distribution strategy to obtain a load distribution result, including: if so, acquiring a target computing unit generating the second drawing task, and determining a target second scheduler to which the target computing unit belongs; acquiring total load data of computing units of all computing units in the target second scheduler, and judging whether the target second scheduler meets a first task allocation condition according to the total load data of the computing units to obtain a first judgment result; and carrying out load distribution on the global scheduler according to the first judgment result and the second drawing task to obtain a load distribution result.

2. The thread load balancing method according to claim 1, wherein the load distributing the global scheduler according to the first determination result and the second drawing task to obtain a load distribution result includes:

when the first judgment result is that the first task allocation condition is not met, determining that the target second scheduler does not meet the first task allocation condition, and feeding the second drawing task back to a target first scheduler to which the target second scheduler belongs;

acquiring total load data of second schedulers of all second schedulers in the target first scheduler, and judging whether the target first scheduler meets a second task allocation condition or not according to the total load data of the second schedulers;

and if the target first scheduler does not meet a second task allocation condition, feeding the second drawing task back to the global scheduler, and performing task allocation on the second drawing task through the global scheduler to obtain a load allocation result.

3. The thread load balancing method according to claim 1, wherein the first task allocation condition comprises: and judging whether the node load of the target second scheduler exceeds a target multiple corresponding to the node load average value of all other second schedulers at the same layer, and if so, determining that the load state of the target second scheduler is overload.

4. The thread load balancing method according to claim 1, further comprising:

when the plurality of computing units run the shader threads, calculating clock cycles and data volumes consumed in the running process of the shader threads through the second scheduler;

counting the number of each shader thread operated by all computing units in the second scheduler, and respectively computing a clock period average value and a data volume average value corresponding to the clock period and the data volume;

load data for each shader thread is generated based on the number of each shader thread, the clock cycle average, and the data volume average.

5. The thread load balancing method according to any one of claims 1 to 4, further comprising:

after the second schedulers distribute the load, acquiring computing unit load data corresponding to all computing units in each second scheduler;

acquiring scheduler load data of all second schedulers in each first scheduler according to the computing unit load data;

and calculating the load data of each shader thread according to the load data of the scheduler and through the global scheduler.

6. A thread load balancing apparatus, wherein the thread load balancing apparatus comprises:

the system comprises a building module, a load balancing module and a control module, wherein the building module is used for building a global scheduler based on a preset graphics processor and a preset load balancing strategy, the global scheduler comprises a plurality of first schedulers, each first scheduler comprises a plurality of second schedulers, and each second scheduler comprises a plurality of computing units;

the processing module is used for receiving drawing instructions through the global scheduler, generating first drawing tasks and sending the first drawing tasks to the plurality of first schedulers and the plurality of second schedulers;

the updating module is used for calculating first load data of the first schedulers and the second schedulers, and performing step-by-step average distribution and load data updating on the first drawing task according to the first load data to obtain second load data; wherein the calculating first load data of the plurality of first schedulers and the plurality of second schedulers, and performing level-by-level average allocation and load data update on the first drawing task according to the first load data to obtain second load data includes: calculating first load data of the plurality of first schedulers and the plurality of second schedulers, wherein the first load data is indicative of raw load data in the plurality of first schedulers and the plurality of second schedulers; obtaining a first scheduler number of the plurality of first schedulers, and obtaining a second scheduler number in each first scheduler; calculating load data of each first scheduler according to the first drawing task, the first load data and the number of the first schedulers, and calculating load data of each second scheduler according to the load data of each first scheduler and the number of the second schedulers in each first scheduler; generating second load data according to the load data of each first scheduler and the load data of each second scheduler;

the configuration module is used for configuring shader threads to be operated by the plurality of computing units according to the second load data, operating the shader threads through the plurality of computing units, and monitoring whether the plurality of computing units generate second drawing tasks or not;

the distribution module is used for carrying out load distribution on the second drawing tasks through the plurality of first schedulers and the plurality of second schedulers according to a preset task distribution strategy to obtain a load distribution result if the distribution module is used for carrying out load distribution on the second drawing tasks according to a preset task distribution strategy; if so, performing load distribution on the second drawing task through the plurality of first schedulers and the plurality of second schedulers according to a preset task distribution strategy to obtain a load distribution result, including: if so, acquiring a target computing unit generating the second drawing task, and determining a target second scheduler to which the target computing unit belongs; acquiring total load data of computing units of all computing units in the target second scheduler, and judging whether the target second scheduler meets a first task allocation condition according to the total load data of the computing units to obtain a first judgment result; and carrying out load distribution on the global scheduler according to the first judgment result and the second drawing task to obtain a load distribution result.

7. A thread load balancing device, the thread load balancing device comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invoking the instructions in the memory to cause the thread load balancing device to perform the thread load balancing method of any one of claims 1-5.

8. A computer-readable storage medium having instructions stored thereon, which when executed by a processor implement the method of thread load balancing according to any one of claims 1-5.