CN112131008B

CN112131008B - Method for scheduling thread bundle warp, processor and computer storage medium

Info

Publication number: CN112131008B
Application number: CN202011045489.6A
Authority: CN
Inventors: 黄虎才; 姚冠宇; 李洋
Original assignee: Xi'an Xintong Semiconductor Technology Co ltd
Current assignee: Xi'an Xintong Semiconductor Technology Co ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2024-04-19
Anticipated expiration: 2040-09-28
Also published as: CN112131008A

Abstract

The embodiment of the invention relates to a GPU technology and discloses a method for scheduling thread bundles warp, a processor and a computer storage medium; the method may include: a monitoring step of monitoring the number of threads in an idle state in a first warp of a currently executed task; a determining step, corresponding to monitoring the threads currently in the idle state in the first warp, of determining a second warp from other warp according to the number of the threads in the idle state and the number of the threads in the active state in the other warp except the first warp; and a scheduling step of scheduling the thread of the second warp to a core corresponding to the thread in the idle state in the first warp so as to execute the instruction to be executed by the second warp. By the method, idle execution resources can be fully utilized, the waste of calculation resources is reduced, and the performance of the GPU is improved.

Description

Method for scheduling thread bundle warp, processor and computer storage medium

Technical Field

Embodiments of the present invention relate to graphics processing unit (GPU, graphics Processing Unit) technology, and more particularly, to a method, processor, and computer storage medium for scheduling thread bundles warp.

Background

Single Instruction Multiple Threads (SIMT), single-Instruction-Mltiple-Thread, are the parallel execution modes conventionally adopted by some GPUs at present, while Thread groups or Thread bundles (warp) are a basic scheduling unit in the GPU, and the number of threads (threads) corresponding to each warp is usually fixed, so that the parallel architecture is simple and easy to maintain, but not all threads contained in warp are in an active state when executing some specific applications or in some specific scenes, such as when one warp in a processor processes conditional branches (for example if-else) language blocks, part of threads (for example M threads) contained in the warp can follow an "if" path and thus be in an active state; other portions of threads included in the warp (e.g., K-M threads, where K represents the number of all threads included in a warp) are temporarily disabled (waiting) by following the "else" path, and are thus in an idle state. At this time, the computing resources corresponding to the K-M threads in the idle state are not used because of the idle state in which the K-M threads are in, and are not used because the K-M threads are changed from the idle state to the active state until waiting for executing the "else" path. The above phenomenon causes a waste of computing resources.

Disclosure of Invention

In view of the foregoing, embodiments of the present invention desirably provide a method, a processor, and a computer storage medium for scheduling thread bundles warp; idle execution resources can be fully utilized, the waste of calculation resources is reduced, and the performance of the GPU is improved.

The technical scheme of the embodiment of the invention is realized as follows:

In a first aspect, an embodiment of the present invention provides a processor, including: a pipeline controller, a plurality of cores organized into a plurality of thread groups warp; wherein each warp comprises a plurality of cores, and each core corresponds to a thread;

the pipeline controller is configured to have performed the steps of:

a monitoring step of monitoring the number of threads in an idle state in a first warp of a currently executed task;

A determining step, corresponding to monitoring the threads currently in the idle state in the first warp, of determining a second warp from other warp according to the number of the threads in the idle state and the number of the threads in the active state in the other warp except the first warp;

And a scheduling step of scheduling the thread of the second warp to a core corresponding to the thread in the idle state in the first warp so as to execute the instruction to be executed by the second warp.

In a second aspect, an embodiment of the present invention provides a method for scheduling thread bundles warp, where the method includes:

In a third aspect, an embodiment of the present invention provides a computer storage medium, where a program for scheduling thread bundles warp is stored, where the program for scheduling thread bundles warp is executed by at least one processor, to implement the steps of the method for scheduling thread bundles warp according to the second aspect.

The embodiment of the invention provides a method for scheduling thread bundles warp, a processor and a computer storage medium; when an idle state thread appears in a first warp of a currently executed task, the active state thread in a second warp is scheduled to the idle state thread in the first warp, so that the width of the first warp is flexibly adjusted according to a specific execution scene, idle computing resources corresponding to the idle state thread are filled, and the utilization rate and execution performance of the computing resources are improved.

Drawings

Fig. 1 is a schematic diagram of a processor according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of dynamic variation of warp width according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a warp scheduling based on time sequence according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of an instruction pipeline based on two warp according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a warp width based on time sequence according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of an instruction pipeline based on four warp according to an embodiment of the present invention.

Fig. 7 is a flowchart of a method for scheduling thread bundles warp according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

As a basic scheduling unit for implementing parallel execution by the GPU, the Warp width has a great influence on the performance and effect of parallel execution, and the Warp width generally represents the number of threads contained in one Warp, while currently for the GPU, the Warp width value is a fixed value, so that the Warp with the fixed width cannot obtain the best performance result when facing different requirements of multiple applications. For example, when the warp width is 8, the best performance can be obtained in the application scenario a; when the warp width is extended to 16, the best performance can be obtained in the application scenario B. Whereas in current GPU designs, the width of warp is fixed; if the warp width of a certain GPU is fixed to be 8, the GPU can obtain the best performance in the application scene a, but cannot obtain the best performance in the application scene B; similarly, if the warp width of a GPU is fixed to 16, the GPU can obtain the best performance in the application scenario B, but cannot obtain the best performance in the application scenario a. As can be seen from the above description, the warp width of the current GPU cannot be changed according to the application scenarios, and it is difficult to obtain the best performance result in the different application scenarios as long as the width is fixed no matter how much the warp width of the GPU is set.

In order to avoid the occurrence of the above situation and improve the parallel execution performance of the GPU, the embodiment of the present invention expects to be able to adjust the width of the warp in real time during the parallel execution process, so as to ensure that the computing resources corresponding to the threads in the warp can be fully utilized, reduce the waste of the computing resources, and improve the performance of the parallel execution.

Referring to FIG. 1, there is shown a schematic diagram of a processor 100 suitable for SIMT execution mode that can implement aspects of embodiments of the present invention, in some examples the processor 100 can implement one of a general purpose processing cluster in an array of processor clusters for highly parallel computing as a GPU to implement executing a large number of threads in parallel, each thread being an instance of a program (instance). In other examples, the processor 100 may be implemented as a streaming multiprocessor (SM, STREAMING MULTIPROCESSORS) in a GPU. In the processor 100, multiple thread processors, or cores, may be included, each of which may correspond to a thread, organized as a warp. In some examples, the core may be implemented as a streaming processor (SP, STREAMING PROCESSOR) or may also be referred to as a unified computing device architecture core (CUDA core, compute Unified Device Architecture core), corresponding to processor 100 being implemented as an SM. The processor 100 may contain J warp 104-1 to 104-J, each warp having K cores 106-1 to 106K. In some examples, the warp 104-1 through 104-J may be further organized into one or more thread blocks (blocks) 102. In some examples, each warp may have 32 cores; in other examples, each warp may have 4 cores, 8 cores, 16 cores, or more than several tens of thousands of cores; as shown in fig. 1, the embodiment of the present invention is described by taking an example of setting that each warp has 8 cores (i.e. k=8), and it is understood that the above setting is only used for describing the technical scheme, and is not limited to the protection scope of the technical scheme, and those skilled in the art can easily adapt the technical scheme described based on the above setting to other situations, which is not repeated herein. In some alternative examples, the processor 100 may organize the cores into only warp, omitting the organization level of the thread block.

Further, the processor 100 may also include a pipeline controller 108, a shared memory 110, and an array of local memories 112-1 through 112-J associated with the warp 104-1 through 104-J. The pipeline controller 108 distributes tasks to the various warp 104-1 through 104-J via the data bus 114. The pipeline controller 108 creates, manages, schedules, executes, and provides mechanisms to synchronize the warp 104-1 through 104-J. With continued reference to the processor 100 shown in FIG. 1, cores within the warp execute in parallel with each other. The warp 104-1 through 104-J communicates with the shared memory 110 through the memory bus 116. The warp 104-1 through 104-J communicates with the local memories 112-1 through 112-J, respectively, through local buses 118-1 through 118-J. Such as that shown in fig. 1, warp 104-J to communicate over local bus 118-J to utilize local memory 112-J. Some embodiments of the processor 100 allocate a shared portion of the shared memory 110 to each thread block 102 and allow access to the shared portion of the shared memory 110 by all of the warp within the thread block 102. Some embodiments include warp that uses only local memory. Many other embodiments include warp that balances the use of local memory and shared memory 110.

For the processor 100 shown in fig. 1, the current scheme is to fix the above K value or determine the K value by setting the K value in advance before performing parallel processing, that is, in the actual process of performing parallel processing, the width of a single warp (which may also be referred to as the number of threads or cores included in the warp) is fixed, and cannot be changed along with the change of the application scenario during the processing, so that in some cases, for example, when executing a branch task, a part of threads in the warp are not executed, and the cores corresponding to the part of threads not executed are not enabled correspondingly, which results in a waste of computing resources. Based on this, to enable real-time adjustment of the width of the warp to ensure that the computing resources corresponding to the threads within the warp can be fully utilized, for the processor 100 shown in fig. 1, in some examples, the pipeline controller 108 is configured to perform: a monitoring step of monitoring the number of threads in an idle state in a first warp of a currently executed task; a determining step, corresponding to monitoring the threads currently in the idle state in the first warp, of determining a second warp from other warp according to the number of the threads in the idle state and the number of the threads in the active state in the other warp except the first warp; and a scheduling step of scheduling the thread of the second warp to a core corresponding to the thread in the idle state in the first warp so as to execute the instruction to be executed by the second warp.

By the above example, when the idle state thread appears in the first warp of the currently executing task, the active state thread in the second warp is scheduled to the idle state thread in the first warp, so that the width of the first warp is flexibly adjusted according to the specific execution scene, and the idle computing resource corresponding to the idle state thread is filled, thereby improving the utilization rate and execution performance of the computing resource.

For the above examples, in some preferred examples, pipeline controller 108 may count, in real-time, the effective execution parameters corresponding to the current task in executing branch and store instructions; acquiring threads in an active state in the first warp according to the effective execution parameters; and acquiring the thread in the idle state of the first warp according to the thread in the active state of the first warp. For example, the effective execution parameters may be thread branch activity rate and memory program activity rate. Specifically, during the process of executing the instruction, the thread branching activity rate may record the number of effective threads in the first warp, i.e. count the number of effective threads each time, for example, warp with n threads, and correspondingly configure n thread counters, where the total count number corresponding to each counter is a reference parameter, and according to the reference parameter, the thread in an active state and the thread in an idle state in the first warp can be known; thus, in the process of performing warp scheduling according to the above example, the largest reference parameter in the n thread counts is selected as a selection object, so that the thread in the active state in the first warp is known, and further, the thread in the idle state in the first warp is known.

For the pipeline controller 108 described in the above example, the thread in the second warp is scheduled to the core corresponding to the thread in the first warp in the idle state to execute the instruction that needs to be executed by the second warp. For example, the width of the first warp is set to 8 before scheduling, 4 idle threads appear by executing the branch instruction, and the width of the second warp is set to 4; after the pipeline controller 108 monitors that the first warp includes 4 idle state threads, a second warp having the same number as that of idle state threads can be selected from other warp except the first warp, and then the 4 threads of the second warp are scheduled to the cores corresponding to the 4 idle state threads in the first warp for execution, at this time, the first warp only corresponds to the 4 threads, and the width of the first warp is also adjusted from 8 before scheduling to 4 after scheduling, so that real-time adjustment of the width of the warp is realized.

As can be seen from the above examples, the width of the warp can be adjusted in real time based on a specific scene to realize dynamic variability, and still taking 8 threads as an example, and setting that there are 2 warp in running states currently, then the widths of two warp can be presented according to the specific application scene in combination with the above technical scheme as shown in fig. 2, with warp0 and warp1 as marks, and with no filled frame representing warp0, and with cross-line filled frame representing warp1, in fig. 2, the time period from t0 to t1, the width of warp0 is 5, and the width of warp1 is 3; in the period from t1 to t2, the width of warp0 is 3, and the width of warp1 is 5; in the period from t 2to t3, the widths of warp0 and warp1 are both 4; in the period from t3 to t4, the width of warp0 is 5, and the width of warp1 is 3. It can be understood that the above-mentioned width adjustment can be implemented by adopting the technical solutions set forth in the foregoing examples, which are not described in detail in the embodiments of the present invention.

For the above example, the pipeline controller 108 is further configured to: and the thread in the idle state in the first warp is called out according to the call of the current execution task to be changed into an active state again, and the thread in the idle state in the first warp is dispatched to a core corresponding to the thread in the idle state in the first warp before the dispatching step so as to execute the current execution task.

Further, for the technical solution set forth in the above example, and for the illustration with reference to fig. 3, in fig. 3, the longitudinal direction is a time sequence represented by time points, each adjacent time point differs by one processing period, the threads included in the first warp are shown by cross-line filled boxes, the threads included in the second warp are shown by dot filled boxes, and as can be seen from the figure, the initial width of the first warp is 8, which includes 8 threads, and are respectively marked as T1, T2, T3, T4, T5, T6, T7 and T8; at cycle-n+0 (which may also be referred to as an initial time), all threads contained in the first warp executing the instruction are in an active state; then at cycle-n+1, the four threads T4-T7 in the first warp enter an idle state based on task execution, and the pipeline controller 108 can determine a second warp from other warp; and the second warp, now having a width of 4, contains 4 threads, marked T '1, T'2, T '3 and T'4 in sequence; after determining the second warp, the pipeline controller 108 may schedule the threads in the second warp to execute in the cores corresponding to the four threads T4 to T7 in the first warp; until cycle-n+8, the four threads T4 to T7 in the first warp are changed from the idle state to the active state again due to the executed subsequent instructions, at this time, the 4 threads T '1 to T'4 in the second warp are scheduled to exit the core scheduled to enter before, the four threads T4 to T7 in the first warp are executed in the core corresponding to the core scheduled to enter again, at this time, the 8 threads in the first warp are all in the active state again.

For the technical solution set forth in the above example, the embodiment of the present invention is described by taking the capability of the processor 100 to support parallel execution of 8 threads, 16 threads and 32 threads as an example, specifically:

Taking 8 threads as an example, the processor 100 may support parallel execution of at most 2 warp at a time, that is, the pipeline controller 108 may include 2 instruction fetch units and 2 decode units, that is, each warp corresponds to one instruction fetch unit and one decode unit; the total number of threads of 2 warp is within 8, and in the case of running 2 warp simultaneously, each warp is on average 4 thread widths;

Taking 16 threads as an example, the processor 100 may support parallel execution of at most 4 warp at a time, that is, the pipeline controller 108 may include 4 instruction fetch units and 4 decode units, where each warp still corresponds to one instruction fetch unit and one decode unit; the total number of threads of the 4 warp is within 16, and in the case of running the 4 warp simultaneously, each warp is on average 4 threads wide;

Taking 32 threads as an example, the processor 100 may support parallel execution of at most 4 warp at a time, that is, the pipeline controller 108 may include 4 instruction fetch units and 4 decode units, where each warp still corresponds to one instruction fetch unit and one decode unit; the total number of threads for 4 warp is within 32, and in the case of running 4 warp simultaneously, each warp is on average 8 threads wide.

Based on the above exemplary illustration, it may be appreciated that when the processor 100 may support at most n warp execution in parallel at a time, the pipeline controller 108 may include n fetch units and n decode units, each warp still corresponds to one fetch unit and one decode unit, but if the corresponding warp is not activated, the corresponding fetch unit and decode unit will not be activated accordingly. In FIG. 1, all of the instruction fetch units in the pipeline controller 108 may share an instruction cache, which may be specifically part of the shared memory 110, or implemented by the shared memory 110, and in some examples, the tag of the instruction cache may be composed of the warp-id corresponding to the instruction fetch unit and a corresponding instruction Counter (PC).

For the above exemplary illustration, for the capability of processor 100 to support 8-thread parallel execution, two warp each correspond to one Fetch unit I-Fetch and one decode unit I-Dec, and in conjunction with execution units I-Exec formed by cores in processor 100, a warp pipeline structure can be formed as shown in FIG. 3 in which processor 100 executes 8 threads in parallel, two warp can be identified as warp-m and warp-n, respectively, in FIG. 4, the pipeline structure components corresponding to warp-n are denoted as cross-line fill boxes, and the pipeline structure components corresponding to warp-n are denoted as dot fill boxes. Since the processor 100 is capable of supporting 8 threads of parallel execution, the total number of execution units I-Exec is 8, it is noted that two warp shares 8 execution units I-Exec and that each execution unit I-Exec is available to only one warp at any time. In some examples, as shown in FIG. 5, the warp-m width is set to 5, which corresponds to a number of execution units I-Exec of 5, which may be labeled I-Exec0 through I-Exec4; the width of warp-n is set to 3, and its corresponding number of execution units I-Exec is 3, which can be labeled I-Exec0 through I-Exec2. As can be seen in connection with fig. 4 and 5, the 8 threads are divided into two warp, each warp corresponds to an independent instruction pipeline, each instruction pipeline contains independent fetch, decode and execution units, wherein the 8 execution units in the computing core are shared by the two instruction pipelines, and each execution unit is controlled by one warp at any time, so that the number of execution units in the pipeline corresponding to each warp is consistent with the width (i.e. the number of threads contained in the pipeline).

For the above exemplary illustration, for the capability of the processor 100 to support parallel execution of 16 or 32 threads, each of the 4 warp supported by the processor 100 corresponds to an independent instruction pipeline, each instruction pipeline includes an independent instruction fetch unit, a decode unit, and an execution unit, for the 16 threads, the number of execution units is 16, and the 16 execution units are shared by the 4 warp, and similarly, each execution unit is controlled by and used by only one warp at any time; for 32 threads, the number of execution units is 32, the 32 execution units are also shared by 4 warp, and each execution unit is used at any time and controlled by only one warp. As shown in FIG. 6, each column represents a pipeline to which one warp corresponds, if the processor 100 is capable of supporting 16 threads executing in parallel, then 16 execution units are shared by 4 warp, i.e., M, N, J, K in FIG. 5 adds up to 16-4, i.e., 12; if the processor 100 is capable of supporting 32 threads executing in parallel, then 32 execution units are shared by 4 warp, that is, M, N, J, K in FIG. 6 sums to 32-4, i.e., 28.

Referring to fig. 7, a method for scheduling thread bundles warp according to an embodiment of the present invention may be applied to a processor 100 according to the foregoing technical solution, as shown in fig. 7, and the method includes:

s701: a monitoring step of monitoring the number of threads in an idle state in a first warp of a currently executed task;

S702: a determining step, corresponding to monitoring the threads currently in the idle state in the first warp, of determining a second warp from other warp according to the number of the threads in the idle state and the number of the threads in the active state in the other warp except the first warp;

s703: and a scheduling step of scheduling the thread of the second warp to a core corresponding to the thread in the idle state in the first warp so as to execute the instruction to be executed by the second warp.

For the solution shown in fig. 7, in some examples, the monitoring step, that is, monitoring the number of threads in an idle state in the first warp of the currently executing task includes:

counting the effective execution parameters corresponding to the current task in the process of executing branches and storing instructions in real time;

acquiring a thread in an active state in the first warp according to the effective execution parameters;

And acquiring the thread in the idle state of the first warp according to the thread in the active state of the first warp.

In the above example, the effective execution parameters include thread branching activity rate and memory access program activity rate.

For the solution shown in fig. 7, in some examples, the width of the first warp is adjusted in real time during the execution of the scheduling step.

For the solution shown in fig. 7, in some examples, the method further includes:

And the thread in the idle state in the first warp is called out according to the call of the current execution task to be changed into an active state again, and the thread in the idle state in the first warp is dispatched to a core corresponding to the thread in the idle state in the first warp before the dispatching step so as to execute the current execution task.

It should be noted that, the technical solution shown in fig. 7 and the example thereof may be implemented in combination with the description of the processor 100 in the foregoing technical solution, and the embodiments of the present invention are not repeated.

It will be appreciated that the technical solution shown in fig. 7 and its examples may be implemented in the form of hardware or in the form of software functional modules.

If implemented as a software functional module, rather than being sold or used as a separate product, may be stored on a computer readable storage medium, based on the understanding that the technical solution and its examples shown in fig. 7 are essentially or partly contributing to the prior art or that all or part of the technical solution may be embodied in the form of a software product stored on a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or processor (processor) to perform all or part of the steps of the method described in this embodiment. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Accordingly, the present embodiment provides a computer storage medium storing a program for scheduling thread bundles warp, where the program for scheduling thread bundles warp is executed by at least one processor to implement the method steps for scheduling thread bundles warp in the above technical solution.

It should be noted that: the technical schemes described in the embodiments of the present invention may be arbitrarily combined without any collision.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A processor, the processor comprising: a pipeline controller, a plurality of cores organized into a plurality of thread groups warp; wherein each warp comprises a plurality of cores, and each core corresponds to a thread;

The pipeline controller is configured to perform the steps of:

A scheduling step of scheduling the thread of the second warp to a core corresponding to the thread in the idle state in the first warp so as to execute an instruction to be executed by the second warp;

the pipeline controller is configured to adjust the width of the first warp in real time during execution of the scheduling step.

2. The processor of claim 1, wherein the pipeline controller is configured to perform:

3. The processor of claim 2, wherein the effective execution parameters include thread branching activity rate and memory program activity rate.

4. The processor of claim 1, wherein the pipeline controller is further configured to perform:

5. The processor of any one of claims 1 to 4, wherein the pipeline controller comprises n number of fetch units and n number of decode units, each of the cores corresponding to an execution unit; where n represents the number of warp organized in the processor; and each warp corresponds to a fetch unit, a decode unit, and an execution unit corresponding to a thread included in the warp.

6. The processor of claim 5, further comprising a shared memory coupled to the processor-composed warp via a memory bus; the shared memory is shared by all the value units by realizing an instruction cache, and the label of the instruction cache consists of a warp identifier corresponding to each value unit and a corresponding instruction counter PC.

7. A method of scheduling thread bundles warp, the method comprising:

And in the process of executing the scheduling step, adjusting the width of the first warp in real time.

8. The method of claim 7, wherein the monitoring step comprises:

9. The method of claim 7, wherein the method further comprises:

10. A computer storage medium storing a program for scheduling a thread bundle warp, which program when executed by at least one processor implements the steps of the method for scheduling a thread bundle warp according to any of claims 7 to 9.