CN111078394B - GPU thread load balancing method and device - Google Patents

GPU thread load balancing method and device Download PDF

Info

Publication number
CN111078394B
CN111078394B CN201911086251.5A CN201911086251A CN111078394B CN 111078394 B CN111078394 B CN 111078394B CN 201911086251 A CN201911086251 A CN 201911086251A CN 111078394 B CN111078394 B CN 111078394B
Authority
CN
China
Prior art keywords
state information
load
computing unit
storage unit
instructions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911086251.5A
Other languages
Chinese (zh)
Other versions
CN111078394A (en
Inventor
王凯
周玉龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN201911086251.5A priority Critical patent/CN111078394B/en
Publication of CN111078394A publication Critical patent/CN111078394A/en
Application granted granted Critical
Publication of CN111078394B publication Critical patent/CN111078394B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Abstract

The invention discloses a GPU thread load balancing method and a device, comprising the following steps: polling and accessing the computing unit and the loading storage unit of each thread to acquire the register state information of each thread and form a register state table; analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table; and distributing new work tasks to the threads according to the register state table and the historical storage load table. The invention can balance the load of each thread of the GPU, improve the working efficiency and stability and prolong the service life of hardware.

Description

GPU thread load balancing method and device
Technical Field
The present invention relates to the field of load balancing, and more particularly, to a method and an apparatus for load balancing of GPU threads.
Background
Massively parallel processors such as GPUs (graphics processors) spatially stack a large number of computing units and improve computing performance by improving parallelism. However, because of the direct difference of the architecture between the GPU and the CPU (central processing unit), the GPU needs to continuously perfect the architecture design on the hardware, improve the optimal scheduling mechanism and strategy, and can fully ensure the full utilization of the computing resources, thereby avoiding the excessive hardware overhead. The GPU structure in the prior art is unbalanced in the load of thread computing tasks, so that part of computing units are in busy or idle states for a long time, computing efficiency is influenced, or a large amount of computing resources are wasted due to deadlock; and the load storage unit lacks corresponding access optimization, so that unbalanced hardware loss is easily caused.
Aiming at the problem of uneven hardware resource load of the GPU in the prior art, no effective solution is available at present.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for balancing GPU thread loads, which can balance loads of each thread in a GPU, improve work efficiency and stability, and prolong service life of hardware.
Based on the above objectives, a first aspect of the embodiments of the present invention provides a GPU thread load balancing method, including the following steps:
polling and accessing the computing unit and the loading storage unit of each thread to acquire the register state information of each thread and form a register state table;
analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table;
and distributing new work tasks to the threads according to the register state table and the historical storage load table.
In some embodiments, polling the compute units and load store units that access threads comprises:
sending a first access request to each computing unit and each loading storage unit in turn, wherein the first access request requires the computing units or the loading storage units to feed back a first data packet within preset time;
and sending a second access request to all the computing units and the load storage units which do not feed back the first data packet within the preset time, wherein the second access request requires the computing units or the load storage units to feed back a second data packet within the preset time, and the first data packet is the same as or different from the second data packet.
In some embodiments, taking the register state information of each thread and forming the register state table comprises:
acquiring busy state information of a computing unit and a load storage unit of each thread, wherein the busy state information comprises one of the following information: not busy, half busy indicating that the instruction runs to half, complete busy;
acquiring waiting state information of a computing unit and a loading storage unit of each thread, wherein the waiting state information comprises one of the following information: waiting for the instruction to be sent, waiting for the stored data to be loaded and waiting for the result of the previous step;
acquiring deadlock state information of a computing unit and a loading storage unit of each thread, wherein the deadlock state information comprises one of the following information: deadlock is not solved, invalid loading data is obtained, and the result of the previous step is lost;
acquiring the preparation state information of the computing unit and the loading storage unit of each thread, wherein the preparation state information comprises one of the following information: preparing to accept instructions, preparing to receive data and preparing to jump instructions;
the busy state information, wait state information, deadlock state information, and prepare state information for each thread are written to a register state table.
In some embodiments, determining the type of instructions executed by each thread and updating the history store load table comprises:
determining whether the execution instructions of the computing unit and the load storage unit occupy a long instruction with a clock period larger than a threshold value or a short instruction with a clock period smaller than the threshold value according to the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit;
in response to determining that the computing unit and the load storage unit are executing the long instruction, adding one to the recorded numerical value of the access times of the long instruction of the register corresponding to the computing unit and the load storage unit which are executing the long instruction in the history storage load table;
and in response to determining that the computing unit and the load storage unit are executing the short instruction, adding one to the recorded numerical value of the access times of the register short instruction corresponding to the computing unit and the load storage unit which are executing the short instruction in the history storage load table.
In some embodiments, assigning new work tasks to threads according to the register state table and the history storage load table comprises:
the computing unit with the deadlock state being non-solution deadlock is subjected to deadlock resolution to recover the availability of the computing unit;
distributing new work tasks to the threads of the computing units of which the waiting states are the waiting instructions;
a new work task is assigned to a thread whose readiness state is a compute unit that is ready to accept instructions.
A second aspect of the present invention provides a GPU thread load balancing apparatus, including:
a processor; and
a memory storing program code executable by the processor, the program code when executed performing the steps of:
polling and accessing the computing unit and the loading storage unit of each thread to acquire the register state information of each thread and form a register state table;
analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table;
and distributing new work tasks to the threads according to the register state table and the historical storage load table.
In some embodiments, polling the compute units and load store units that access threads comprises:
sending a first access request to each computing unit and each loading storage unit in turn, wherein the first access request requires the computing units or the loading storage units to feed back a first data packet within preset time;
and sending a second access request to all the computing units and the load storage units which do not feed back the first data packet within the preset time, wherein the second access request requires the computing units or the load storage units to feed back a second data packet within the preset time, and the first data packet is the same as or different from the second data packet.
In some embodiments, taking the register state information of each thread and forming the register state table comprises:
acquiring busy state information of a computing unit and a load storage unit of each thread, wherein the busy state information comprises one of the following information: not busy, half busy indicating that the instruction runs to half, complete busy;
acquiring waiting state information of a computing unit and a loading storage unit of each thread, wherein the waiting state information comprises one of the following: waiting for the instruction to be sent, waiting for the stored data to be loaded and waiting for the result of the previous step;
acquiring deadlock state information of a computing unit and a loading storage unit of each thread, wherein the deadlock state information comprises one of the following: deadlock is not solved, invalid loading data is obtained, and the result of the previous step is lost;
acquiring the preparation state information of the computing unit and the loading storage unit of each thread, wherein the preparation state information comprises one of the following information: preparing to receive an instruction, preparing to receive data and preparing to skip an instruction;
the busy state information, wait state information, deadlock state information, and prepare state information for each thread are written to a register state table.
In some embodiments, determining the type of instructions executed by each thread and updating the history store load table comprises:
determining whether the execution instructions of the computing unit and the load storage unit occupy a long instruction with a clock period larger than a threshold value or a short instruction with a clock period smaller than the threshold value according to the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit;
in response to determining that the computing unit and the load storage unit are executing the long instruction, adding one to the recorded numerical value of the access times of the long instruction of the register corresponding to the computing unit and the load storage unit which are executing the long instruction in the history storage load table;
and in response to determining that the computing unit and the load storage unit are executing the short instruction, adding one to the recorded numerical value of the access times of the register short instruction corresponding to the computing unit and the load storage unit which are executing the short instruction in the history storage load table.
In some embodiments, assigning new work tasks to threads according to the register state table and the history storage load table comprises:
the computing unit with the deadlock state being non-solution deadlock is subjected to deadlock resolution to recover the availability of the computing unit;
distributing new work tasks to the threads of the computing units of which the waiting states are the waiting instructions;
a new work task is assigned to a thread whose readiness state is a compute unit that is ready to accept instructions.
The invention has the following beneficial technical effects: according to the GPU thread load balancing method and device provided by the embodiment of the invention, the computing unit and the loading storage unit of each thread are accessed through polling to acquire the register state information of each thread and form a register state table; analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table; according to the technical scheme that the new work tasks are distributed to the threads according to the register state table and the historical storage load table, the threads of the GPU can be enabled to balance loads, work efficiency and stability are improved, and the service life of hardware is prolonged.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic flow chart of a GPU thread load balancing method according to the present invention;
FIG. 2 is a hardware schematic diagram of a GPU thread load balancing method according to the present invention;
FIG. 3 is an example of a register state table of the GPU thread load balancing method provided in the present invention;
FIG. 4 is an example of a historical storage load table of the GPU thread load balancing method provided by the present invention;
fig. 5 is a flowchart of a GPU thread load balancing method provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
In view of the foregoing, a first aspect of the embodiments of the present invention provides an embodiment of a method for enabling threads of a GPU to balance load. Fig. 1 is a schematic flowchart illustrating a GPU thread load balancing method according to the present invention.
The GPU thread load balancing method, as shown in fig. 1, includes the following steps:
step S101: polling and accessing the computing unit and the loading storage unit of each thread to acquire the register state information of each thread and form a register state table;
step S103: analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table;
step S105: and distributing new work tasks to the threads according to the register state table and the historical storage load table.
The embodiment of the invention realizes the GPU balanced load optimization design method based on the FPGA, realizes the load balance of the calculation task and the memory access task, timely eliminates the deadlock state, and reduces the loss of the memory unit during memory access.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by instructing relevant hardware by a computer program, and the program may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a Random Access Memory (RAM). Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.
In some embodiments, polling access to compute units and load store units for each thread comprises:
sending a first access request to each computing unit and each loading storage unit in turn, wherein the first access request requires the computing units or the loading storage units to feed back a first data packet within preset time;
and sending a second access request to all the computing units and the load storage units which do not feed back the first data packet within the preset time, wherein the second access request requires the computing units or the load storage units to feed back a second data packet within the preset time, and the first data packet is the same as or different from the second data packet.
In some embodiments, taking the register state information of each thread and forming the register state table comprises:
acquiring busy state information of a computing unit and a load storage unit of each thread, wherein the busy state information comprises one of the following information: not busy, half busy indicating that the instruction runs to half, complete busy;
acquiring waiting state information of a computing unit and a loading storage unit of each thread, wherein the waiting state information comprises one of the following: waiting for the instruction to be sent, waiting for the stored data to be loaded and waiting for the result of the previous step;
acquiring deadlock state information of a computing unit and a loading storage unit of each thread, wherein the deadlock state information comprises one of the following: no solution deadlock, invalid loading data acquisition and result loss in the previous step;
acquiring the preparation state information of the computing unit and the loading storage unit of each thread, wherein the preparation state information comprises one of the following information: preparing to accept instructions, preparing to receive data and preparing to jump instructions;
the busy state information, wait state information, deadlock state information, and prepare state information for each thread are written to a register state table.
In some embodiments, determining the type of instructions executed by each thread and updating the history store load table comprises:
determining whether the execution instructions of the computing unit and the load storage unit occupy a long instruction with a clock period larger than a threshold value or a short instruction with a clock period smaller than the threshold value according to the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit;
in response to determining that the computing unit and the load storage unit are executing the long instruction, adding one to the recorded numerical value of the access times of the long instruction of the register corresponding to the computing unit and the load storage unit which are executing the long instruction in the history storage load table;
and in response to determining that the computing unit and the load storage unit are executing the short instruction, adding one to the recorded numerical value of the access times of the register short instruction corresponding to the computing unit and the load storage unit which are executing the short instruction in the history storage load table.
In some embodiments, assigning new work tasks to threads according to the register state table and the history store load table comprises:
the computing unit with the deadlock state being non-solution deadlock is subjected to deadlock resolution to recover the availability of the computing unit;
distributing new work tasks to the threads of the computing units of which the waiting states are the waiting instructions;
a new work task is assigned to a thread whose readiness state is a compute unit that is ready to accept instructions.
The method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention. The above-described method steps and system elements may also be implemented using a controller and a computer-readable storage medium for storing a computer program for causing the controller to implement the functions of the above-described steps or elements.
The following further illustrates an embodiment of the present invention in terms of an embodiment as shown in fig. 5.
Fig. 2 shows a hardware schematic of an embodiment of the invention. Where CU0-CU15 are 16 compute units and LSU0-LSU7 are 8 load store units, the RST (register state table) and HSLT (history store load table) are obtained by polling access, state acquisition, and instruction analysis to the 16 compute units and the 8 load store units. The polling access is to access the status registers of the CUs 0-15 in turn, skip when no access is encountered, and finally access the skipped registers again. The state acquisition is to acquire the data of the state register, then form a 32-bit packet and write the packet into the RAM. The instruction analysis is to divide the instruction into a long instruction and a segment instruction according to the clock cycle required by the instruction.
RSTs are stored in the RAM as shown in FIG. 2, and an example of RSTs is stored in the RAM as shown in FIG. 3, where register state information of 16 compute units and 8 load store units are stored. The register state information format for each register is as follows:
Bit 7:6 5:4 3:2 1:0
Define Busy Prep Dead Ready
status BUSY (2 bit): 00-not busy; 01-half busy (instruction runs to half); 10-full busy; 11-reserved bit;
state PREP (2 bit): 00-waiting for instruction transmission; 01-wait for load store data; 10-waiting for the result of the previous step; 11-reserved bit;
state DEAD (2 bit): 00-deadlock without solution; 01-invalid load data acquisition; 10-loss of the last step result; 11-reserved bit;
status READY (2 bit): 00-prepare to accept the instruction; 01-prepare to accept data; 10-preparing a jump instruction; 11-reserved bits.
An example of HSLT is shown in fig. 4, which focuses on recording the instruction execution history of each register, and reference to the history data can better allocate new work tasks. The assigned priorities are: and (3) reading the state register and clearing deadlock in time according to a calculator which processes the deadlock state for the first time, and recovering the working capacity of the CU. Second, the compute task should be inserted in time for the register to enter the wait state. And secondly, when the register is in a ready state, tasks should be scheduled across priorities in time to allocate more task queues for idle CUs.
As can be seen from the foregoing embodiments, in the GPU thread load balancing method provided in the embodiments of the present invention, register state information of each thread is obtained by polling and accessing a computing unit and a load storage unit of each thread, and a register state table is formed; analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table; according to the technical scheme that the new work tasks are distributed to the threads according to the register state table and the historical storage load table, the threads of the GPU can be enabled to balance loads, work efficiency and stability are improved, and the service life of hardware is prolonged.
It should be particularly noted that, the steps in the embodiments of the GPU thread load balancing method described above may be mutually intersected, replaced, added, and deleted, so that these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the described embodiments.
In view of the foregoing, a second aspect of the embodiments of the present invention provides an embodiment of an apparatus for enabling threads of a GPU to balance load. The GPU thread load balancing device comprises:
a processor; and
a memory storing program code executable by the processor, the program code when executed performing the steps of:
polling and accessing the computing unit and the loading storage unit of each thread to acquire the register state information of each thread and form a register state table;
analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table;
and distributing new work tasks to the threads according to the register state table and the historical storage load table.
In some embodiments, polling access to compute units and load store units for each thread comprises:
sending a first access request to each computing unit and each loading storage unit in turn, wherein the first access request requires the computing units or the loading storage units to feed back a first data packet within preset time;
and sending a second access request to all the computing units and the load storage units which do not feed back the first data packet within the preset time, wherein the second access request requires the computing units or the load storage units to feed back a second data packet within the preset time, and the first data packet is the same as or different from the second data packet.
In some embodiments, taking the register state information of each thread and forming the register state table comprises:
acquiring busy state information of a computing unit and a load storage unit of each thread, wherein the busy state information comprises one of the following information: not busy, half busy indicating that the instruction runs to half, complete busy;
acquiring waiting state information of a computing unit and a loading storage unit of each thread, wherein the waiting state information comprises one of the following: waiting for the instruction to be sent, waiting for the stored data to be loaded and waiting for the result of the previous step;
acquiring deadlock state information of a computing unit and a loading storage unit of each thread, wherein the deadlock state information comprises one of the following information: no solution deadlock, invalid loading data acquisition and result loss in the previous step;
acquiring preparation state information of a computing unit and a load storage unit of each thread, wherein the preparation state information comprises one of the following: preparing to accept instructions, preparing to receive data and preparing to jump instructions;
the busy state information, wait state information, deadlock state information, and prepare state information for each thread are written to a register state table.
In some embodiments, determining the type of instructions executed by each thread and updating the history store load table comprises:
determining whether the execution instructions of the computing unit and the load storage unit occupy a long instruction with a clock period larger than a threshold value or a short instruction with a clock period smaller than the threshold value according to the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit;
in response to determining that the computing unit and the load storage unit are executing the long instruction, adding one to the recorded numerical value of the access times of the long instruction of the register corresponding to the computing unit and the load storage unit which are executing the long instruction in the history storage load table;
and in response to determining that the computing unit and the load storage unit are executing the short instruction, adding one to the recorded numerical value of the access times of the register short instruction corresponding to the computing unit and the load storage unit which are executing the short instruction in the history storage load table.
In some embodiments, assigning new work tasks to threads according to the register state table and the history storage load table comprises:
the deadlock is released for the computing unit of which the deadlock state is non-deadlock solution so as to recover the availability of the computing unit;
distributing new work tasks to the threads of the computing units of which the waiting states are the waiting instructions;
a new work task is assigned to a thread whose readiness state is a compute unit that is ready to accept instructions.
As can be seen from the foregoing embodiments, the GPU thread load balancing apparatus provided in the embodiments of the present invention obtains the register state information of each thread by polling and accessing the computing unit and the load storage unit of each thread, and forms a register state table; analyzing the executing instruction type according to the register state information of each thread, determining the executing instruction type of each thread, and updating a historical storage load table; according to the technical scheme of distributing new work tasks to each thread according to the register state table and the historical storage load table, each thread of the GPU can be enabled to balance loads, work efficiency and stability are improved, and service life of hardware is prolonged.
It should be particularly noted that, the above embodiment of the GPU thread load balancing apparatus employs the embodiment of the GPU thread load balancing method to specifically describe the working processes of the modules, and those skilled in the art can easily think that the modules are applied to other embodiments of the GPU thread load balancing method. Of course, since each step in the GPU thread load balancing method embodiment may be intersected, replaced, added, or deleted, the GPU thread load balancing apparatus based on these reasonable permutation and combination transformations also should belong to the protection scope of the present invention, and should not limit the protection scope of the present invention on the embodiment.
The foregoing are exemplary embodiments of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (6)

1. A GPU thread load balancing method is characterized by comprising the following steps:
accessing a computing unit and a load storage unit of a GPU (graphics processing Unit) based on a polling mode to acquire busy state information, waiting state information, deadlock state information and preparation state information of the computing unit and the load storage unit, and writing the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit into a register state table, wherein the busy state information comprises one of the following information: not busy, indicating that the instruction runs to half of half busy, completely busy, the wait state information including one of: waiting for an instruction to issue, waiting for a store data to be loaded, and waiting for a result of a previous step, wherein the deadlock state information comprises one of: the method comprises the following steps of deadlock avoidance, invalid loading data acquisition and result loss in the previous step, wherein the preparation state information comprises one of the following information: preparing to accept instructions, preparing to receive data and preparing to jump instructions;
analyzing the types of instructions being executed by the computing unit and the load storage unit according to the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit, and determining whether the execution instructions of the computing unit and the load storage unit are long instructions occupying clock cycles larger than a threshold value or short instructions occupying clock cycles smaller than the threshold value;
in response to determining that the computing unit and the load storage unit are executing long instructions, adding one to the recorded numerical value of the access times of the long instruction registers corresponding to the computing unit and the load storage unit executing the long instructions in a history storage load table;
in response to determining that the computing unit and the load storage unit are executing short instructions, adding one to the recorded numerical value of the access times of the short instructions of the registers corresponding to the computing unit and the load storage unit executing the short instructions in the history storage load table;
and distributing new work tasks to all threads according to the register state table and the historical storage load table.
2. The method of claim 1, wherein polling compute units and load store units for access to a GPU comprises:
sending a first access request to each computing unit and each load storage unit in turn, wherein the first access request requires the computing units or the load storage units to feed back first data packets within a preset time;
sending a second access request to all the computing units and the load storage units which do not feed back the first data packet within the preset time, wherein the second access request requires the computing units or the load storage units to feed back a second data packet within the preset time, and the first data packet is the same as or different from the second data packet.
3. The method of claim 1, wherein assigning new work tasks to threads based on the register state table and the history store load table comprises:
deadlocking the computing unit whose deadlock state is a non-solution deadlock to recover availability of the computing unit;
distributing a new work task to a computing unit which is in a waiting state and is sent by a waiting instruction;
new work tasks are assigned to the compute units whose ready state is ready to accept instructions.
4. A GPU thread load balancing apparatus, comprising:
a processor; and
a memory storing program code executable by the processor, the program code when executed performing the steps of:
accessing a computing unit and a load storage unit of a GPU (graphics processing unit) based on a polling mode to acquire busy state information, waiting state information, deadlock state information and preparation state information of the computing unit and the load storage unit, and writing the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit into a register state table, wherein the busy state information comprises one of the following information: no busy, half busy indicating that the instruction runs to half, and full busy, wherein the waiting status information includes one of the following: waiting for an instruction to issue, waiting for a store data to be loaded, and waiting for a result of a previous step, wherein the deadlock state information comprises one of: the method comprises the following steps of deadlock avoidance, invalid loading data acquisition and result loss in the previous step, wherein the preparation state information comprises one of the following information: preparing to accept instructions, preparing to receive data and preparing to jump instructions;
analyzing the types of the instructions being executed by the computing unit and the load storage unit according to the busy state information, the waiting state information, the deadlock state information and the preparation state information of the computing unit and the load storage unit, and determining whether the execution instructions of the computing unit and the load storage unit are long instructions occupying clock cycles larger than a threshold value or short instructions occupying clock cycles smaller than the threshold value;
in response to determining that the computing unit and the load storage unit are executing long instructions, adding one to the recorded numerical value of the access times of the long instruction registers corresponding to the computing unit and the load storage unit executing the long instructions in a history storage load table;
and in response to determining that the computing unit and the load storage unit are executing short instructions, adding the recorded numerical value of the access times of the short instructions of the registers corresponding to the computing unit and the load storage unit which are executing the short instructions to each thread according to the register state table and the history storage load table.
5. The apparatus of claim 4, wherein polling the compute unit and the load-store unit for access to the GPU comprises:
sending a first access request to each computing unit and each load storage unit in turn, wherein the first access request requires the computing units or the load storage units to feed back first data packets within a preset time;
sending a second access request to all the computing units and the load storage units which do not feed back the first data packet within the preset time, wherein the second access request requires the computing units or the load storage units to feed back a second data packet within the preset time, and the first data packet is the same as or different from the second data packet.
6. The apparatus of claim 4, wherein assigning new work tasks to threads according to the register state table and the history store load table comprises:
deadlocking the computing unit whose deadlock state is a non-solution deadlock to recover availability of the computing unit;
distributing a new work task to a computing unit which is in a waiting state and sends a waiting instruction;
new work tasks are assigned to the compute units whose ready state is ready to accept instructions.
CN201911086251.5A 2019-11-08 2019-11-08 GPU thread load balancing method and device Active CN111078394B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911086251.5A CN111078394B (en) 2019-11-08 2019-11-08 GPU thread load balancing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911086251.5A CN111078394B (en) 2019-11-08 2019-11-08 GPU thread load balancing method and device

Publications (2)

Publication Number Publication Date
CN111078394A CN111078394A (en) 2020-04-28
CN111078394B true CN111078394B (en) 2022-12-06

Family

ID=70310724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911086251.5A Active CN111078394B (en) 2019-11-08 2019-11-08 GPU thread load balancing method and device

Country Status (1)

Country Link
CN (1) CN111078394B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256435B (en) * 2020-11-03 2023-05-05 成都海光微电子技术有限公司 Method for assigning work groups for graphics processor and graphics processor
US20220413911A1 (en) * 2021-06-29 2022-12-29 International Business Machines Corporation Routing instructions in a microprocessor
CN115168058B (en) * 2022-09-06 2022-11-25 深流微智能科技(深圳)有限公司 Thread load balancing method, device, equipment and storage medium
CN116820786B (en) * 2023-08-31 2023-12-19 本原数据(北京)信息技术有限公司 Data access method and device of database, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7185338B2 (en) * 2002-10-15 2007-02-27 Sun Microsystems, Inc. Processor with speculative multithreading and hardware to support multithreading software
KR20150019349A (en) * 2013-08-13 2015-02-25 삼성전자주식회사 Multiple threads execution processor and its operating method
CN103871032A (en) * 2014-03-07 2014-06-18 福建工程学院 Image enhancement method for Wallis filter based on GPU (Graphics Processing Unit)
US10102031B2 (en) * 2015-05-29 2018-10-16 Qualcomm Incorporated Bandwidth/resource management for multithreaded processors
CN109032793B (en) * 2018-07-11 2021-03-16 Oppo广东移动通信有限公司 Resource allocation method, device, terminal and storage medium
CN109947569B (en) * 2019-03-15 2021-04-06 Oppo广东移动通信有限公司 Method, device, terminal and storage medium for binding core

Also Published As

Publication number Publication date
CN111078394A (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN111078394B (en) GPU thread load balancing method and device
CN109582455B (en) Multithreading task processing method and device and storage medium
CN106502791B (en) A kind of method for allocating tasks and device
US8219993B2 (en) Frequency scaling of processing unit based on aggregate thread CPI metric
JP5040773B2 (en) Memory buffer allocation device and program
US8914805B2 (en) Rescheduling workload in a hybrid computing environment
US8739171B2 (en) High-throughput-computing in a hybrid computing environment
JP4292198B2 (en) Method for grouping execution threads
US11163677B2 (en) Dynamically allocated thread-local storage
US20110307903A1 (en) Soft partitions and load balancing
CN110308982B (en) Shared memory multiplexing method and device
CN103197916A (en) Methods and apparatus for source operand collector caching
US20110161965A1 (en) Job allocation method and apparatus for a multi-core processor
JP2010079622A (en) Multi-core processor system and task control method thereof
US20130097382A1 (en) Multi-core processor system, computer product, and control method
CN109840149B (en) Task scheduling method, device, equipment and storage medium
US10545890B2 (en) Information processing device, information processing method, and program
KR100883655B1 (en) System and method for switching context in reconfigurable processor
CN103543989A (en) Adaptive parallel processing method aiming at variable length characteristic extraction for big data
CN112114967B (en) GPU resource reservation method based on service priority
US20060100986A1 (en) Task switching
EP3495960A1 (en) Program, apparatus, and method for communicating data between parallel processor cores
US20120158651A1 (en) Configuration of asynchronous message processing in dataflow networks
CN113835852B (en) Task data scheduling method and device
CN116414541B (en) Task execution method and device compatible with multiple task working modes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant