CN111078394B

CN111078394B - GPU thread load balancing method and device

Info

Publication number: CN111078394B
Application number: CN201911086251.5A
Authority: CN
Inventors: 王凯; 周玉龙
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2022-12-06
Anticipated expiration: 2039-11-08
Also published as: CN111078394A

Abstract

The invention discloses a GPU thread load balancing method and a device, comprising the following steps: polling and accessing the computing unit and the loading storage unit of each thread to acquire the register state information of each thread and form a register state table; analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table; and distributing new work tasks to the threads according to the register state table and the historical storage load table. The invention can balance the load of each thread of the GPU, improve the working efficiency and stability and prolong the service life of hardware.

Description

GPU thread load balancing method and device

Technical Field

The present invention relates to the field of load balancing, and more particularly, to a method and an apparatus for load balancing of GPU threads.

Background

Massively parallel processors such as GPUs (graphics processors) spatially stack a large number of computing units and improve computing performance by improving parallelism. However, because of the direct difference of the architecture between the GPU and the CPU (central processing unit), the GPU needs to continuously perfect the architecture design on the hardware, improve the optimal scheduling mechanism and strategy, and can fully ensure the full utilization of the computing resources, thereby avoiding the excessive hardware overhead. The GPU structure in the prior art is unbalanced in the load of thread computing tasks, so that part of computing units are in busy or idle states for a long time, computing efficiency is influenced, or a large amount of computing resources are wasted due to deadlock; and the load storage unit lacks corresponding access optimization, so that unbalanced hardware loss is easily caused.

Aiming at the problem of uneven hardware resource load of the GPU in the prior art, no effective solution is available at present.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for balancing GPU thread loads, which can balance loads of each thread in a GPU, improve work efficiency and stability, and prolong service life of hardware.

Based on the above objectives, a first aspect of the embodiments of the present invention provides a GPU thread load balancing method, including the following steps:

polling and accessing the computing unit and the loading storage unit of each thread to acquire the register state information of each thread and form a register state table;

analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table;

and distributing new work tasks to the threads according to the register state table and the historical storage load table.

In some embodiments, polling the compute units and load store units that access threads comprises:

sending a first access request to each computing unit and each loading storage unit in turn, wherein the first access request requires the computing units or the loading storage units to feed back a first data packet within preset time;

and sending a second access request to all the computing units and the load storage units which do not feed back the first data packet within the preset time, wherein the second access request requires the computing units or the load storage units to feed back a second data packet within the preset time, and the first data packet is the same as or different from the second data packet.

In some embodiments, taking the register state information of each thread and forming the register state table comprises:

acquiring busy state information of a computing unit and a load storage unit of each thread, wherein the busy state information comprises one of the following information: not busy, half busy indicating that the instruction runs to half, complete busy;

acquiring waiting state information of a computing unit and a loading storage unit of each thread, wherein the waiting state information comprises one of the following information: waiting for the instruction to be sent, waiting for the stored data to be loaded and waiting for the result of the previous step;

acquiring deadlock state information of a computing unit and a loading storage unit of each thread, wherein the deadlock state information comprises one of the following information: deadlock is not solved, invalid loading data is obtained, and the result of the previous step is lost;

acquiring the preparation state information of the computing unit and the loading storage unit of each thread, wherein the preparation state information comprises one of the following information: preparing to accept instructions, preparing to receive data and preparing to jump instructions;

the busy state information, wait state information, deadlock state information, and prepare state information for each thread are written to a register state table.

In some embodiments, determining the type of instructions executed by each thread and updating the history store load table comprises:

determining whether the execution instructions of the computing unit and the load storage unit occupy a long instruction with a clock period larger than a threshold value or a short instruction with a clock period smaller than the threshold value according to the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit;

in response to determining that the computing unit and the load storage unit are executing the long instruction, adding one to the recorded numerical value of the access times of the long instruction of the register corresponding to the computing unit and the load storage unit which are executing the long instruction in the history storage load table;

and in response to determining that the computing unit and the load storage unit are executing the short instruction, adding one to the recorded numerical value of the access times of the register short instruction corresponding to the computing unit and the load storage unit which are executing the short instruction in the history storage load table.

In some embodiments, assigning new work tasks to threads according to the register state table and the history storage load table comprises:

the computing unit with the deadlock state being non-solution deadlock is subjected to deadlock resolution to recover the availability of the computing unit;

distributing new work tasks to the threads of the computing units of which the waiting states are the waiting instructions;

a new work task is assigned to a thread whose readiness state is a compute unit that is ready to accept instructions.

A second aspect of the present invention provides a GPU thread load balancing apparatus, including:

a processor; and

a memory storing program code executable by the processor, the program code when executed performing the steps of:

acquiring waiting state information of a computing unit and a loading storage unit of each thread, wherein the waiting state information comprises one of the following: waiting for the instruction to be sent, waiting for the stored data to be loaded and waiting for the result of the previous step;

acquiring deadlock state information of a computing unit and a loading storage unit of each thread, wherein the deadlock state information comprises one of the following: deadlock is not solved, invalid loading data is obtained, and the result of the previous step is lost;

acquiring the preparation state information of the computing unit and the loading storage unit of each thread, wherein the preparation state information comprises one of the following information: preparing to receive an instruction, preparing to receive data and preparing to skip an instruction;

The invention has the following beneficial technical effects: according to the GPU thread load balancing method and device provided by the embodiment of the invention, the computing unit and the loading storage unit of each thread are accessed through polling to acquire the register state information of each thread and form a register state table; analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table; according to the technical scheme that the new work tasks are distributed to the threads according to the register state table and the historical storage load table, the threads of the GPU can be enabled to balance loads, work efficiency and stability are improved, and the service life of hardware is prolonged.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic flow chart of a GPU thread load balancing method according to the present invention;

FIG. 2 is a hardware schematic diagram of a GPU thread load balancing method according to the present invention;

FIG. 3 is an example of a register state table of the GPU thread load balancing method provided in the present invention;

FIG. 4 is an example of a historical storage load table of the GPU thread load balancing method provided by the present invention;

fig. 5 is a flowchart of a GPU thread load balancing method provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In view of the foregoing, a first aspect of the embodiments of the present invention provides an embodiment of a method for enabling threads of a GPU to balance load. Fig. 1 is a schematic flowchart illustrating a GPU thread load balancing method according to the present invention.

The GPU thread load balancing method, as shown in fig. 1, includes the following steps:

step S101: polling and accessing the computing unit and the loading storage unit of each thread to acquire the register state information of each thread and form a register state table;

step S103: analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table;

step S105: and distributing new work tasks to the threads according to the register state table and the historical storage load table.

The embodiment of the invention realizes the GPU balanced load optimization design method based on the FPGA, realizes the load balance of the calculation task and the memory access task, timely eliminates the deadlock state, and reduces the loss of the memory unit during memory access.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by instructing relevant hardware by a computer program, and the program may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a Random Access Memory (RAM). Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.

In some embodiments, polling access to compute units and load store units for each thread comprises:

acquiring deadlock state information of a computing unit and a loading storage unit of each thread, wherein the deadlock state information comprises one of the following: no solution deadlock, invalid loading data acquisition and result loss in the previous step;

In some embodiments, assigning new work tasks to threads according to the register state table and the history store load table comprises:

The method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention. The above-described method steps and system elements may also be implemented using a controller and a computer-readable storage medium for storing a computer program for causing the controller to implement the functions of the above-described steps or elements.

The following further illustrates an embodiment of the present invention in terms of an embodiment as shown in fig. 5.

Fig. 2 shows a hardware schematic of an embodiment of the invention. Where CU0-CU15 are 16 compute units and LSU0-LSU7 are 8 load store units, the RST (register state table) and HSLT (history store load table) are obtained by polling access, state acquisition, and instruction analysis to the 16 compute units and the 8 load store units. The polling access is to access the status registers of the CUs 0-15 in turn, skip when no access is encountered, and finally access the skipped registers again. The state acquisition is to acquire the data of the state register, then form a 32-bit packet and write the packet into the RAM. The instruction analysis is to divide the instruction into a long instruction and a segment instruction according to the clock cycle required by the instruction.

RSTs are stored in the RAM as shown in FIG. 2, and an example of RSTs is stored in the RAM as shown in FIG. 3, where register state information of 16 compute units and 8 load store units are stored. The register state information format for each register is as follows:

Bit	7:6	5:4	3:2	1:0
					Define	Busy	Prep	Dead	Ready

status BUSY (2 bit): 00-not busy; 01-half busy (instruction runs to half); 10-full busy; 11-reserved bit;

state PREP (2 bit): 00-waiting for instruction transmission; 01-wait for load store data; 10-waiting for the result of the previous step; 11-reserved bit;

state DEAD (2 bit): 00-deadlock without solution; 01-invalid load data acquisition; 10-loss of the last step result; 11-reserved bit;

status READY (2 bit): 00-prepare to accept the instruction; 01-prepare to accept data; 10-preparing a jump instruction; 11-reserved bits.

An example of HSLT is shown in fig. 4, which focuses on recording the instruction execution history of each register, and reference to the history data can better allocate new work tasks. The assigned priorities are: and (3) reading the state register and clearing deadlock in time according to a calculator which processes the deadlock state for the first time, and recovering the working capacity of the CU. Second, the compute task should be inserted in time for the register to enter the wait state. And secondly, when the register is in a ready state, tasks should be scheduled across priorities in time to allocate more task queues for idle CUs.

As can be seen from the foregoing embodiments, in the GPU thread load balancing method provided in the embodiments of the present invention, register state information of each thread is obtained by polling and accessing a computing unit and a load storage unit of each thread, and a register state table is formed; analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table; according to the technical scheme that the new work tasks are distributed to the threads according to the register state table and the historical storage load table, the threads of the GPU can be enabled to balance loads, work efficiency and stability are improved, and the service life of hardware is prolonged.

It should be particularly noted that, the steps in the embodiments of the GPU thread load balancing method described above may be mutually intersected, replaced, added, and deleted, so that these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the described embodiments.

In view of the foregoing, a second aspect of the embodiments of the present invention provides an embodiment of an apparatus for enabling threads of a GPU to balance load. The GPU thread load balancing device comprises:

a processor; and

acquiring deadlock state information of a computing unit and a loading storage unit of each thread, wherein the deadlock state information comprises one of the following information: no solution deadlock, invalid loading data acquisition and result loss in the previous step;

acquiring preparation state information of a computing unit and a load storage unit of each thread, wherein the preparation state information comprises one of the following: preparing to accept instructions, preparing to receive data and preparing to jump instructions;

the deadlock is released for the computing unit of which the deadlock state is non-deadlock solution so as to recover the availability of the computing unit;

As can be seen from the foregoing embodiments, the GPU thread load balancing apparatus provided in the embodiments of the present invention obtains the register state information of each thread by polling and accessing the computing unit and the load storage unit of each thread, and forms a register state table; analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table; according to the technical scheme of distributing new work tasks to each thread according to the register state table and the historical storage load table, each thread of the GPU can be enabled to balance loads, work efficiency and stability are improved, and service life of hardware is prolonged.

It should be particularly noted that, the above embodiment of the GPU thread load balancing apparatus employs the embodiment of the GPU thread load balancing method to specifically describe the working processes of the modules, and those skilled in the art can easily think that the modules are applied to other embodiments of the GPU thread load balancing method. Of course, since each step in the GPU thread load balancing method embodiment may be intersected, replaced, added, or deleted, the GPU thread load balancing apparatus based on these reasonable permutation and combination transformations also should belong to the protection scope of the present invention, and should not limit the protection scope of the present invention on the embodiment.

The foregoing are exemplary embodiments of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A GPU thread load balancing method is characterized by comprising the following steps:

accessing a computing unit and a load storage unit of a GPU (graphics processing Unit) based on a polling mode to acquire busy state information, waiting state information, deadlock state information and preparation state information of the computing unit and the load storage unit, and writing the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit into a register state table, wherein the busy state information comprises one of the following information: not busy, indicating that the instruction runs to half of half busy, completely busy, the wait state information including one of: waiting for an instruction to issue, waiting for a store data to be loaded, and waiting for a result of a previous step, wherein the deadlock state information comprises one of: the method comprises the following steps of deadlock avoidance, invalid loading data acquisition and result loss in the previous step, wherein the preparation state information comprises one of the following information: preparing to accept instructions, preparing to receive data and preparing to jump instructions;

analyzing the types of instructions being executed by the computing unit and the load storage unit according to the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit, and determining whether the execution instructions of the computing unit and the load storage unit are long instructions occupying clock cycles larger than a threshold value or short instructions occupying clock cycles smaller than the threshold value;

in response to determining that the computing unit and the load storage unit are executing long instructions, adding one to the recorded numerical value of the access times of the long instruction registers corresponding to the computing unit and the load storage unit executing the long instructions in a history storage load table;

in response to determining that the computing unit and the load storage unit are executing short instructions, adding one to the recorded numerical value of the access times of the short instructions of the registers corresponding to the computing unit and the load storage unit executing the short instructions in the history storage load table;

and distributing new work tasks to all threads according to the register state table and the historical storage load table.

2. The method of claim 1, wherein polling compute units and load store units for access to a GPU comprises:

sending a first access request to each computing unit and each load storage unit in turn, wherein the first access request requires the computing units or the load storage units to feed back first data packets within a preset time;

sending a second access request to all the computing units and the load storage units which do not feed back the first data packet within the preset time, wherein the second access request requires the computing units or the load storage units to feed back a second data packet within the preset time, and the first data packet is the same as or different from the second data packet.

3. The method of claim 1, wherein assigning new work tasks to threads based on the register state table and the history store load table comprises:

deadlocking the computing unit whose deadlock state is a non-solution deadlock to recover availability of the computing unit;

distributing a new work task to a computing unit which is in a waiting state and is sent by a waiting instruction;

new work tasks are assigned to the compute units whose ready state is ready to accept instructions.

4. A GPU thread load balancing apparatus, comprising:

a processor; and

accessing a computing unit and a load storage unit of a GPU (graphics processing unit) based on a polling mode to acquire busy state information, waiting state information, deadlock state information and preparation state information of the computing unit and the load storage unit, and writing the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit into a register state table, wherein the busy state information comprises one of the following information: no busy, half busy indicating that the instruction runs to half, and full busy, wherein the waiting status information includes one of the following: waiting for an instruction to issue, waiting for a store data to be loaded, and waiting for a result of a previous step, wherein the deadlock state information comprises one of: the method comprises the following steps of deadlock avoidance, invalid loading data acquisition and result loss in the previous step, wherein the preparation state information comprises one of the following information: preparing to accept instructions, preparing to receive data and preparing to jump instructions;

analyzing the types of the instructions being executed by the computing unit and the load storage unit according to the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit, and determining whether the execution instructions of the computing unit and the load storage unit are long instructions occupying clock cycles larger than a threshold value or short instructions occupying clock cycles smaller than the threshold value;

and in response to determining that the computing unit and the load storage unit are executing short instructions, adding the recorded numerical value of the access times of the short instructions of the registers corresponding to the computing unit and the load storage unit which are executing the short instructions to each thread according to the register state table and the history storage load table.

5. The apparatus of claim 4, wherein polling the compute unit and the load-store unit for access to the GPU comprises:

6. The apparatus of claim 4, wherein assigning new work tasks to threads according to the register state table and the history store load table comprises:

distributing a new work task to a computing unit which is in a waiting state and sends a waiting instruction;