CN111708622A

CN111708622A - Instruction group scheduling method, architecture, equipment and storage medium

Info

Publication number: CN111708622A
Application number: CN202010482280.XA
Authority: CN
Inventors: 王凯; 周玉龙
Original assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Current assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-09-25
Anticipated expiration: 2040-05-28
Also published as: CN111708622B

Abstract

The invention discloses a method, a structure, equipment and a storage medium for scheduling instruction groups, wherein the method comprises the following steps: dividing threads contained in an input thread group into different instruction groups, and sequentially determining each instruction group as a current instruction group according to the arrangement sequence of the instruction groups; taking out the current instruction to be executed from the current instruction group as the current instruction, and executing the current instruction by using the computing resource and the storage resource distributed for the current instruction; after the current instruction is taken out, predicting the instruction which needs to be executed after the current instruction is executed, determining the predicted instruction as a target instruction, reading instruction information of the target instruction, and distributing corresponding computing resources and storage resources for the target instruction based on the instruction information; the instruction information comprises operands and operators; and after the current instruction is executed, determining that the target instruction is the current instruction, and returning to the step of executing the current instruction by using the computing resources and the storage resources distributed for the current instruction. Thereby enabling a significant increase in instruction execution efficiency.

Description

Instruction group scheduling method, architecture, equipment and storage medium

Technical Field

The present invention relates to the field of instruction processing technologies, and in particular, to an instruction group scheduling method, architecture, device, and storage medium.

Background

For GPU scheduling, a GPGPU scheduling structure is usually adopted in the prior art, but after an instruction to be executed is obtained, the instruction is directly executed by the structure, but the inventor finds that the problem of low execution efficiency exists when the instruction is executed by using the scheme.

Disclosure of Invention

The invention aims to provide a method, a structure, equipment and a storage medium for scheduling an instruction group, which can effectively improve the execution efficiency of instructions.

In order to achieve the above purpose, the invention provides the following technical scheme:

an instruction group scheduling method, comprising:

dividing threads contained in an input thread group into different instruction groups, and sequentially determining each instruction group as a current instruction group according to the arrangement sequence of the instruction groups;

taking out the current instruction to be executed from the current instruction group as the current instruction, and executing the current instruction by using the computing resource and the storage resource distributed for the current instruction; after the current instruction is taken out, predicting the instruction which needs to be executed after the current instruction is executed, determining the predicted instruction as a target instruction, reading instruction information of the target instruction, and distributing corresponding computing resources and storage resources for the target instruction based on the instruction information; the instruction information comprises operands and operators;

and after the current instruction is executed, determining that the target instruction is the current instruction, and returning to the step of executing the current instruction by using the computing resources and the storage resources distributed for the current instruction.

Preferably, the method further includes, after dividing the threads included in the input thread group into different instruction groups:

analyzing the correlation among the instruction groups, and if the correlation does not exist among the instruction groups, sequencing the instruction groups according to the sequence of the priorities of the instruction groups from high to low; if any instruction group has correlation, the instruction groups are sorted according to the correlation, and the instruction groups except the instruction groups are sorted according to the order of the priorities from high to low.

Preferably, the number of the current instructions is multiple; executing the current instruction, including:

and judging whether read-after-write conflicts or write-after-write conflicts exist among the current instructions, if so, controlling the current instructions with the read-after-write conflicts or write-after-write conflicts to be sequentially executed by utilizing the computing resources and the storage resources distributed for the current instructions.

Preferably, the method further comprises the following steps:

monitoring each instruction being executed in real time, if the read-after-write conflict or the write-after-write conflict is monitored, suspending the instruction which is started to be executed later in the instructions which generate the read-after-write conflict or the write-after-write conflict, and after the instruction which is started to be executed first in the instructions which generate the read-after-write conflict or the write-after-write conflict is completely executed, executing the instruction which is started to be executed later.

An instruction set scheduling architecture comprising:

a thread processing module to: dividing threads contained in an input thread group into different instruction groups, and sequentially determining each instruction group as a current instruction group according to the arrangement sequence of the thread groups;

an instruction flow module to: taking out the current instruction to be executed from the current instruction group as the current instruction, predicting the instruction to be executed after the current instruction is executed, determining the predicted instruction as a target instruction, reading instruction information of the target instruction, and distributing corresponding computing resources and storage resources for the target instruction based on the instruction information; the instruction information comprises operands and operators;

an instruction execution module to: and after the current instruction is taken out, executing the current instruction by using the computing resources and the storage resources distributed for the current instruction, determining that the target instruction is the current instruction after the current instruction is executed, and returning to the step of executing the current instruction by using the computing resources and the storage resources distributed for the current instruction.

Preferably, the method further comprises the following steps:

an instruction set ordering module to: dividing threads contained in an input thread group into different instruction groups, analyzing the correlation among the instruction groups, and if the correlation does not exist among the instruction groups, sequencing the instruction groups according to the sequence of the priorities of the instruction groups from high to low; if any instruction group has correlation, the instruction groups are sorted according to the correlation, and the instruction groups except the instruction groups are sorted according to the order of the priorities from high to low.

Preferably, the instruction execution module comprises:

an instruction execution unit to: judging whether read-after-write conflicts or write-after-write conflicts exist among the current instructions, if so, controlling the current instructions with the read-after-write conflicts or write-after-write conflicts to be sequentially executed by utilizing computing resources and storage resources distributed for the current instructions; the number of current instructions is plural.

Preferably, the method further comprises the following steps:

a real-time monitoring module for: monitoring each instruction being executed in real time, if the read-after-write conflict or the write-after-write conflict is monitored, suspending the instruction which is started to be executed later in the instructions which generate the read-after-write conflict or the write-after-write conflict, and after the instruction which is started to be executed first in the instructions which generate the read-after-write conflict or the write-after-write conflict is completely executed, executing the instruction which is started to be executed later.

An instruction group scheduling apparatus comprising:

a memory for storing a computer program;

a processor for implementing the steps of the instruction set scheduling method as described in any one of the above when executing the computer program.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the instruction set scheduling method of any one of the above.

The invention provides a method, a structure, equipment and a storage medium for scheduling instruction groups, wherein the method comprises the following steps: dividing threads contained in an input thread group into different instruction groups, and sequentially determining each instruction group as a current instruction group according to the arrangement sequence of the instruction groups; taking out the current instruction to be executed from the current instruction group as the current instruction, and executing the current instruction by using the computing resource and the storage resource distributed for the current instruction; after the current instruction is taken out, predicting the instruction which needs to be executed after the current instruction is executed, determining the predicted instruction as a target instruction, reading instruction information of the target instruction, and distributing corresponding computing resources and storage resources for the target instruction based on the instruction information; the instruction information comprises operands and operators; and after the current instruction is executed, determining that the target instruction is the current instruction, and returning to the step of executing the current instruction by using the computing resources and the storage resources distributed for the current instruction. After the current instruction to be executed is determined, the instruction to be executed after the execution of the current instruction to be executed is finished can be predicted, then, the calculation resource and the storage resource required by the execution of the instruction are allocated to the instruction based on the operand and the operator of the instruction, further, the execution of the instruction can be directly realized based on the allocated calculation resource and storage resource when the instruction is executed, and obviously, the instruction execution efficiency can be greatly improved compared with the method that the resource required by the execution of any instruction is allocated to any instruction when the instruction is required to be executed.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flowchart of a method for instruction group scheduling according to an embodiment of the present invention;

FIG. 2 is a block diagram of a method for instruction set dispatch according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an instruction set scheduling architecture according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of an instruction group scheduling method according to an embodiment of the present invention is shown, where the method includes:

s11: dividing the threads contained in the input thread group into different instruction groups, and sequentially determining each instruction group as the current instruction group according to the arrangement sequence of the instruction groups.

It should be noted that the execution main body of the instruction group scheduling method provided by the embodiment of the present invention may be a corresponding instruction group scheduling device, and the present application may be implemented on an FPGA platform by using Verilog hardware description language based on RISC-V architecture.

Dividing the threads contained in the input thread group into different instruction groups is consistent with the realization principle of the corresponding technical scheme in the prior art, specifically, after the thread groups are input, a software editor and an optimizer can be used for optimizing the running sequence of the thread groups, the thread groups are arranged according to the running sequence, corresponding labels such as 00, 01 and 02 … are added to each thread group in a hardware stage, then each thread group is separately stored, the threads contained in the thread groups are divided to obtain the corresponding instruction groups, each instruction group is rearranged according to the running sequence required to be realized, and then each instruction group is sequentially executed; and the storage is carried out by taking the instruction group as a unit, and compared with the storage by taking the thread group as a unit, the storage unit can be thinned, and further, the fragmentary cache space is fully utilized.

In addition, when the threads included in the thread group are divided into corresponding instruction groups, the division may be performed according to the size and the constraint of the instruction group set in advance, for example, 36 × 36 threads are set as one instruction group, and one thread group includes 72 × 72 threads, so that one thread group may be divided into 4 instruction groups; when dividing a thread group into 4 instruction groups, division is performed in accordance with a set matrix or matrix. After the splitting is complete, each instruction group may also be tagged with a corresponding warp-id.

S12: taking out the current instruction to be executed from the current instruction group as the current instruction, and executing the current instruction by using the computing resource and the storage resource distributed for the current instruction; after the current instruction is taken out, predicting the instruction which needs to be executed after the current instruction is executed, determining the predicted instruction as a target instruction, reading instruction information of the target instruction, and distributing corresponding computing resources and storage resources for the target instruction based on the instruction information; the instruction information includes operands and operators.

The current instruction group is the instruction group which needs to be operated currently, the instruction which needs to be executed currently is taken out from the current instruction group to be the current instruction, and then the current instruction is executed by utilizing the computing resource and the storage resource which are distributed for the current instruction; in addition, in order to realize the advance allocation of the computing resources and the storage resources, the method can predict the instruction which needs to be executed after the current instruction is taken out, further read the computing resources and the storage resources which are needed by the instruction in the execution process, allocate the corresponding computing resources and the storage resources for the instruction, and enable the instruction to quickly realize the operation; the instruction to be executed after the current instruction is executed may be predicted according to any preset prediction mode, for example, the number of times of executing each instruction after the current instruction is executed may be counted historically, and the instruction with the largest number of times is determined to be the predicted instruction to be executed after the current instruction is executed, and if the operation to be implemented of the current instruction is determined, the association between the operation to be implemented by the other instruction in the instruction group where the current instruction is located and the operation to be implemented by the current instruction is determined, and further, the instruction having the association between the operation to be implemented and the operation to be implemented by the current instruction (for example, the operation to lock a certain storage area and the operation to write to the storage area have the association) is determined to be the predicted instruction to be executed after the current instruction is executed, and of course, according to other settings required actually, are within the scope of the invention.

In addition, when the corresponding computing resources and storage resources are allocated to the target instruction based on the instruction information, the operand and the operator of the target instruction can be obtained, so that the computing resources and the storage resources required by the execution of the target instruction can be estimated based on the operand and the operator, and the required computing resources and the required storage resources are allocated to the target instruction; the calculation resources and storage resources required for predicting and executing the target instruction based on the operand and the operator can be calculation resources and storage resources required for counting the historical operands with the same number of bits and the same operator, and then the calculation resources and storage resources required for predicting and executing the target instruction are the maximum calculation resources and storage resources required for counting the historical operands with the same number of bits and the same operator, so that smooth execution of target execution is ensured, and other settings can be performed according to actual needs.

S13: and after the current instruction is executed, determining that the target instruction is the current instruction, and returning to the step of executing the current instruction by using the computing resources and the storage resources distributed for the current instruction.

After the current instruction is executed, taking out the next instruction needing the instruction as the current instruction, and if the current instruction is the target instruction, executing the current instruction according to the computing resource and the storage resource distributed to the current instruction, thereby achieving the purpose of quickly executing the instruction; otherwise, the required computing resources and storage resources may be allocated to the current instruction first, and then the execution of the current instruction is realized by using the allocated computing resources and storage resources.

According to the technical characteristics disclosed by the application, after the current instruction needing to be executed is determined, the instruction needing to be executed after the current instruction needing to be executed is executed can be predicted, the calculation resource and the storage resource which are needed by the execution of the instruction are distributed to the instruction based on the operand and the operator of the instruction, the execution of the instruction can be directly realized based on the distributed calculation resource and storage resource when the instruction is executed, and compared with the method that the resource which is needed by the execution of any instruction is distributed to the instruction when the instruction needs to be executed, the instruction execution efficiency can be obviously greatly improved.

It should be noted that, when predicting the target instruction, it can be implemented by SM/SFU/LU priority arbitration; specifically, the time required by execution of each instruction in the current instruction group, the priority (the higher the requirement on the timeliness of execution, the higher the priority) and the correlation among different instructions can be obtained; the principle of determining the correlation between instructions is the same as that of determining the correlation between instruction groups, and is not described herein again. When a target instruction is predicted, if an instruction (such as jump, call and the like) which has correlation with the current instruction exists, selecting the instruction as the target instruction, if the instruction which has correlation with the current instruction does not exist, selecting an instruction with the highest priority as the target instruction, and if a plurality of instructions with the highest priority exist, selecting the instruction with the shortest execution time as the target instruction; therefore, the purpose of meeting the current requirement during instruction prediction is achieved.

In addition, when the computing resources and the storage resources are allocated to the target instruction, the computing resources and the storage resources can comprise a plurality of nodes, so that the resource allocation can be realized in a load balancing manner, and the resource utilization rate is improved.

The method for scheduling an instruction group according to an embodiment of the present invention may further include, after dividing the threads included in the input thread group into different instruction groups:

analyzing the correlation among the instruction groups, and if the correlation does not exist among the instruction groups, sequencing the instruction groups according to the priority of the instruction groups from high to low; if any instruction group has correlation, the instruction groups are sorted according to the correlation, and the instruction groups except the instruction groups are sorted according to the priority from high to low.

Considering that different instruction groups may have different requirements on timeliness of execution, corresponding priorities may be set for the instruction groups (the higher the requirement on timeliness of execution is, the higher the priority is), so that the instruction group with the higher priority is preferentially executed; however, since there may be a correlation between any plurality of instruction groups, the instruction groups having a correlation need to be executed in the order in which the correlation requires execution, and the instruction groups having no correlation may be executed in a manner that the higher the priority is, the earlier the instruction groups are executed; the execution according to the sequence in which the dependency needs to be executed specifically means that operations of any instruction group may exist between some instruction groups to affect operations of other instruction groups, such as reading and writing the same data, modifying and deleting the same file, and at this time, the corresponding instruction groups need to be executed according to the required sequence to ensure effective implementation of corresponding operations of each instruction group; if the two instruction groups are used for reading and writing the same data respectively, in order to ensure the validity of the read data, the instruction group for writing the data is executed first, and then the instruction group for reading the data is executed. In the instruction group scheduling method provided by the embodiment of the invention, the number of the current instructions is multiple; executing the current instruction may include:

If a plurality of instructions need to be executed in parallel currently, whether a write-after-read conflict or a write-after-write conflict exists among the plurality of instructions can be judged, specifically, if any plurality of current instructions need to perform read operation and write operation on the same data respectively, then a write-after-read conflict exists at the moment, if any plurality of current instructions need to perform write operation on the same data, then a write-after-write conflict exists at the moment, in order to avoid the conflict, effective data operation is realized, in the application, if the conflict exists among the current instructions, the current instructions with the conflict are controlled to be executed in sequence, namely, the current instructions with the conflict do not execute in parallel any more, but only one instruction is executed at the same time, and the conflict is solved in such a way; in addition, when the current instruction with read-after-write conflict is controlled to be executed sequentially, the current instruction needing write operation can be executed first, and then the instruction needing read operation is executed, so that the read effective data is ensured.

The instruction group scheduling method provided by the embodiment of the invention can further comprise the following steps:

and monitoring each instruction which is being executed in real time, if the read-after-write conflict or the write-after-write conflict is monitored, suspending the instruction which is started to be executed later in the instructions which have the read-after-write conflict or the write-after-write conflict, and after the instruction which is started to be executed first in the instructions which have the read-after-write conflict or the write-after-write conflict is completely executed, executing the instruction which is started to be executed later.

In order to further ensure the validity of data operation, the method also monitors whether conflicts exist among the executed instructions in real time in the instruction execution process after judging whether write-read conflicts and read-write conflicts exist among a plurality of current instructions, if yes, the instructions with the conflicts are controlled to be sequentially executed according to the execution starting time of the instructions with the conflicts from early to late, namely, the instructions which are started to be executed later in the instructions with the write-read conflicts or the write-write conflicts are suspended, and the instructions which are started to be executed first in the instructions with the write-read conflicts or the write-write conflicts are executed again after the execution of the instructions which are started to be executed first in the instructions with the write-read conflicts or the write-write conflicts is finished.

In a specific application scenario, an architecture flowchart of the above technical solution disclosed in the present application may be as shown in fig. 2, and each function is as follows:

1. warp (instruction set) stack: controlling the execution, thread bifurcation and thread reunion point of the next instruction of a certain thread in warp (the thread bifurcation and the thread reunion have the same meaning as the corresponding concepts in the prior art);

2. warp scoreboard: checking write-after-write conflicts and read-after-write conflicts;

3. and (3) to-be-transmitted cache and DU interaction: caching warp to be transmitted and realizing the transmission of the instruction based on a transmitting module; and, determining the computational resources allocated to it based on the operands and operators;

4. pre-read operand: determining whether the LSU needs to be started or not by pre-reading the operand and the operator, namely whether the operand needs to be acquired and put into a cache by using the LSU or not; and, also based on operands and operators, determining the storage resources allocated to it;

5. and (3) instruction flow: the method comprises the steps of instruction fetching, prediction, SM/SFU/LU priority arbitration, PC decoding and write-back, wherein the value taking is to take out an instruction to be executed currently, the prediction and SM/SFU/LU priority arbitration are to predict the instruction to be executed after the instruction to be executed is executed completely, the PC decoding is to translate the instruction into information which can be identified by computing resources, and the write-back is to write back according to the arrangement form of a single warp after the single warp is completed;

6. PC _ CACHE: a cache for caching the instruction pc so as to obtain the instruction;

7. and (3) decoding: primarily decoding an instruction in warp to obtain a key bit in the instruction, determining simple correlation among the instructions based on the key bit, and further realizing instruction execution based on the simple correlation, wherein if a plurality of instructions do a large amount of repeated work, the plurality of instructions can be dispatched to make calculation faster; wherein, the correlation bits in the instructions are the bits of the corresponding operational characters and operational characters, and the correlation is whether the two warp instructions influence the operation result mutually or not;

8. thread buffering and warp queue sequencing internal insertion: and acquiring the correlation between the warps, and setting the arrangement sequence of the warps through the correlation.

Therefore, the functions of the thread in the architecture flowchart are thread buffering, warp queue sequencing internal insertion, decoding, PC caching, instruction pipelining, operand pre-reading, and resource allocation and emission based on the cache to be emitted and DU interaction.

An embodiment of the present invention further provides an instruction group scheduling architecture, as shown in fig. 3, which may include:

a thread processing module 11, configured to: dividing threads contained in an input thread group into different instruction groups, and sequentially determining each instruction group as a current instruction group according to the arrangement sequence of the thread groups;

an instruction pipeline module 12 for: taking out the current instruction to be executed from the current instruction group as the current instruction, predicting the instruction to be executed after the current instruction is executed, determining the predicted instruction as a target instruction, reading instruction information of the target instruction, and distributing corresponding computing resources and storage resources for the target instruction based on the instruction information; the instruction information comprises operands and operators;

an instruction execution module 13, configured to: and after the current instruction is taken out, executing the current instruction by using the computing resources and the storage resources distributed for the current instruction, determining that the target instruction is the current instruction after the current instruction is executed, and returning to the step of executing the current instruction by using the computing resources and the storage resources distributed for the current instruction.

The instruction group scheduling architecture provided in the embodiment of the present invention may further include:

an instruction set ordering module to: dividing the threads contained in the input thread groups into different instruction groups, analyzing the correlation among the instruction groups, and if the correlation does not exist among the instruction groups, sequencing the instruction groups according to the sequence of the priorities of the instruction groups from high to low; if any instruction group has correlation, the instruction groups are sorted according to the correlation, and the instruction groups except the instruction groups are sorted according to the priority from high to low.

In an instruction group scheduling architecture provided in an embodiment of the present invention, an instruction execution module may include:

a real-time monitoring module for: and monitoring each instruction which is being executed in real time, if the read-after-write conflict or the write-after-write conflict is monitored, suspending the instruction which is started to be executed later in the instructions which have the read-after-write conflict or the write-after-write conflict, and after the instruction which is started to be executed first in the instructions which have the read-after-write conflict or the write-after-write conflict is completely executed, executing the instruction which is started to be executed later.

An embodiment of the present invention further provides an instruction group scheduling apparatus, which may include:

a memory for storing a computer program;

The embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the instruction set scheduling method are implemented.

It should be noted that, for the description of the relevant parts in the instruction set scheduling architecture, the device and the storage medium provided in the embodiments of the present invention, reference is made to the detailed description of the corresponding parts in the instruction set scheduling method provided in the embodiments of the present invention, and details are not described herein again. In addition, parts of the above technical solutions provided in the embodiments of the present invention that are consistent with the implementation principles of the corresponding technical solutions in the prior art are not described in detail, so as to avoid redundant description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for instruction group scheduling, comprising:

2. The method of claim 1, wherein after dividing the threads included in the input thread group into different instruction groups, further comprising:

3. The method of claim 2, wherein the number of current instructions is plural; executing the current instruction, including:

4. The method of claim 3, further comprising:

5. An instruction set scheduling architecture, comprising:

6. The architecture of claim 5, further comprising:

7. The architecture of claim 6, wherein the instruction execution module comprises:

8. The architecture of claim 7, further comprising:

9. An instruction group scheduling apparatus, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the instruction set scheduling method according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the instruction set scheduling method according to any one of claims 1 to 4.