CN115543448A

CN115543448A - Dynamic instruction scheduling method on data flow architecture and data flow architecture

Info

Publication number: CN115543448A
Application number: CN202211265787.5A
Authority: CN
Inventors: 王飞; 姜志颖; 栾国庆; 肖开明
Original assignee: Suzhou Ruixin Integrated Circuit Technology Co ltd
Current assignee: Suzhou Ruixin Integrated Circuit Technology Co ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2022-12-30

Abstract

The invention provides a dynamic instruction scheduling method on a data flow architecture and the data flow architecture, which solve the problem of unbalanced local load of a computing node in the dynamic execution process of the data architecture in the prior art. In the method for dynamically scheduling instructions on a data flow architecture provided in the embodiments of the present invention, when scheduling a data flow instruction that has been statically allocated on a computing node, load conditions of all neighboring nodes of the computing node are considered at the same time, that is, instruction scheduling is selected based on load conditions of all neighboring nodes of the computing node, for example, an instruction with a low load in a destination direction is preferentially selected for scheduling, and a data supply of a computing node with a low load is accelerated to improve a computing load thereof, so as to alleviate a problem of local load imbalance of the computing node during a dynamic execution process of a data flow program, thereby improving an execution efficiency of the data flow architecture.

Description

Dynamic instruction scheduling method on data flow architecture and data flow architecture

Technical Field

The invention relates to the field of computer architecture, in particular to a dynamic instruction scheduling method on a data flow architecture and the data flow architecture.

Background

With the development of computer technology and the increasing competition, high-performance computing technology is increasingly applied to various fields to solve practical problems encountered in scientific research and social production. In the field of high-performance computing, data stream computing embodies good computing performance and applicability. The data flow architecture generally includes several or several tens of computing nodes (PEs), each computing node has the characteristics of strong computing capability, weak control capability, and small complexity, the multiple computing nodes are interconnected through a network on chip NoC (generally, a 2D mesh), and a Router on chip (Router) is responsible for transferring operands between the computing nodes (PEs). In the dataflow architecture, a dataflow program is represented by a dataflow graph, where each node in the dataflow graph represents an instruction and each edge represents a dependency between an instruction and another instruction. The basic principles of dataflow instruction execution are: if all source operands of the instruction are ready and the downstream instruction has a free data slot for receiving data, the instruction can be transmitted and executed, and the execution result of the instruction is not written into the shared register or the shared cache but is directly transmitted to the downstream target instruction through the dependent edge. For a data flow architecture, how to better schedule a data flow graph to a plurality of computing nodes for execution is a key problem which needs to be solved urgently in the data flow computing process, and a good instruction scheduling algorithm plays a crucial role in improving the data flow computing performance.

Instruction scheduling is generally divided into a static scheduling method implemented by software and a dynamic scheduling method implemented by hardware. In a data flow architecture, because an instruction communication mode is visible to software, a static scheduling mode is usually adopted, a compiler is responsible for completing static placement of instructions on data flow computing nodes, and relevant factors such as balance of metering load, communication distance between instructions and conflicts among arithmetic units are comprehensively considered by the compiler in a stage of compiling a program into a data flow graph so as to seek a better static scheduling result. In a data flow architecture, a static scheduling method can ensure the balance of the global load of each computing node (the total computation amount on each computing node is basically the same in the whole execution process of a data flow program), but in a dynamic execution process, the computing nodes may have a problem of local load imbalance (only part of the computing nodes are busy while the other part of the computing nodes are in a waiting idle state when the data flow program is executed for a certain period of time), and the local load imbalance may waste the computing power of the functional units of the computing nodes and cause the low execution efficiency of the data flow architecture.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides the instruction dynamic scheduling method on the data flow architecture and the data flow architecture, which can solve the problem of unbalanced local load of the computing node in the dynamic execution process of the data architecture, thereby improving the execution efficiency of the data flow architecture.

Specifically, the invention provides the following technical scheme:

as a first aspect of the present invention, the present invention provides a method for dynamically scheduling instructions on a data stream architecture, including:

calculating load information of the computing node at a time point;

receiving load information of a plurality of neighboring nodes neighboring the computing node, wherein the computing node is respectively connected with the plurality of neighboring nodes through a dedicated interconnection bus;

selecting a priority scheduling instruction from an instruction storage unit according to the load information of the adjacent nodes;

and

and executing the priority scheduling instruction.

In an embodiment of the present invention, selecting a priority scheduling instruction from an instruction storage unit according to load information of the neighboring node includes:

searching a neighboring node with the lowest load dynamically executed in a target direction in the neighboring nodes according to the load information of the plurality of neighboring nodes; and

and selecting a priority scheduling instruction from the instruction storage unit, wherein the destination direction of the priority scheduling instruction is the adjacent node with the lowest dynamic execution load of the destination direction.

In an embodiment of the present invention, searching a neighboring node with a lowest load dynamically executed in a destination direction among the neighboring nodes according to load information of a plurality of neighboring nodes includes:

searching the adjacent node with the least instruction number of instructions in a 'ready' state according to the load information of the adjacent nodes;

and the adjacent node with the lowest dynamic execution load in the destination direction is the adjacent node with the least number of instructions in the ready state.

In one embodiment of the present invention, the time point is a time beat in the instruction pipeline.

In an embodiment of the present invention, the number of the neighboring nodes of the computing node is four.

As a second aspect of the present invention, the present invention also provides a data flow architecture, including:

a plurality of computing nodes, two adjacent computing nodes connected by a dedicated interconnect bus;

wherein one of the compute nodes comprises:

an instruction storage unit storing instructions;

the load counting module is used for calculating the load information of the computing nodes in real time, wherein the computing nodes are respectively connected with the adjacent nodes through special interconnection buses;

the receiving and sending module is in communication connection with the load counting module and is used for receiving the load information of a plurality of adjacent nodes adjacent to the computing node in real time and sending the load information of the computing node to the adjacent nodes;

the instruction transmitting selection module is used for selecting a priority scheduling instruction from the instruction storage unit according to the load information of the adjacent nodes; and

an arithmetic unit for executing the prioritized scheduling instruction.

In an embodiment of the present invention, when the number of instruction strips that can be held by the compute node is 2^N at most, the line width of the dedicated interconnect bus is 2N, where N is an integer greater than or equal to one.

When scheduling data stream instructions that have been statically allocated on a computing node, that is, selecting an instruction on a computing node for scheduling execution, simultaneously considering load conditions of all neighboring nodes of the computing node, that is, selecting instruction scheduling based on load conditions of all neighboring nodes of the computing node, for example, preferentially selecting an instruction with a low load in a destination direction for scheduling, and increasing a computational load of the computing node by accelerating data supply of a computing node with a low load to alleviate a problem of local load imbalance of the computing node during dynamic execution of a data stream program, thereby increasing execution efficiency of the data stream architecture.

Drawings

FIG. 1 is a flowchart illustrating a method for dynamically scheduling instructions on a dataflow architecture according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for dynamically scheduling instructions on a dataflow architecture according to another embodiment of the present invention;

FIG. 3 is a block diagram of a dataflow architecture according to an embodiment of the present invention;

FIG. 4 is a block diagram of a compute node in a dataflow architecture according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating dynamic instruction scheduling of a compute node in cycle1 according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating a method for dynamically scheduling an instruction of a compute node in cycle2 according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

Fig. 1 is a flowchart illustrating a method for dynamically scheduling instructions on a dataflow architecture according to an embodiment of the present invention, where as shown in fig. 1, the method for dynamically scheduling instructions on the dataflow architecture includes the following steps:

step S101: calculating load information of the computing node at a time point;

specifically, one time point is a time beat in the instruction pipeline.

Step S102: receiving load information of a plurality of adjacent nodes adjacent to the computing node, wherein the computing node is respectively connected with the plurality of adjacent nodes through a special interconnection bus;

specifically, the dedicated interconnection bus can be used for collecting load information of the neighboring nodes in real time, that is, the computing node can obtain load information of the computing node neighboring to the computing node in real time through the dedicated interconnection bus, and meanwhile, the load information of the computing node itself can be transmitted to other neighboring computing nodes.

Step S103: selecting a priority scheduling instruction from an instruction storage unit according to the load information of a plurality of adjacent nodes; and

step S104: and executing the priority scheduling instruction.

That is, the instruction selected by the computing node from the instruction storage unit is executed in the arithmetic unit of the current computing node. The result of the execution is sent to other computing nodes through Router.

When scheduling data stream instructions that have been statically allocated on a computing node, that is, when a computing node selects an instruction to perform scheduling execution, load conditions of all neighboring nodes of the computing node are considered at the same time, that is, instruction scheduling is selected based on load conditions of all neighboring nodes of the computing node, for example, an instruction with a low load in a destination direction is preferentially selected to perform scheduling, and a computing load of the computing node is increased by accelerating data supply of a computing node with a low load, so that a problem of local load imbalance of the computing node in a dynamic execution process of a data stream program is alleviated, and execution efficiency of the data stream architecture is increased.

Specifically, as shown in fig. 2, the step S103 of selecting a priority scheduling instruction from the instruction storage unit according to the load information of the adjacent node includes the following steps:

step S1031: searching a neighboring node with the lowest dynamic execution load in a target direction in the neighboring nodes according to the load information of the neighboring nodes;

step S1032: selecting a priority scheduling instruction from the instruction storage unit, wherein the destination direction of the priority scheduling instruction is the adjacent node with the lowest dynamic execution load of the destination direction;

specifically, when four adjacent computing nodes, namely east, south, west and north, are arranged around one computing node, and the adjacent node with the least number of instructions in the 'ready' state comprises the east adjacent computing node and the north adjacent computing node, a priority scheduling instruction is selected from the instruction storage unit, wherein the target direction of the priority scheduling instruction is the east adjacent computing node or the north adjacent computing node.

That is, in the embodiment of the present invention, when scheduling a data stream instruction that has been statically allocated on a compute node, and selecting an instruction for preferential scheduling, the basis for instruction selection is: the method includes that a neighbor node with the least number of instructions in a 'ready' state in all neighbor nodes of a computing node, namely, a neighbor node with the lowest load in a destination direction is dynamically executed, namely, when a scheduling instruction is selected, all neighbor nodes of the computing node are considered, and the computing load is improved by accelerating data supply of a computing node with low load, so that the problem of local load imbalance of the computing node in the dynamic execution process of a data flow program is solved, and the execution efficiency of a data flow architecture is improved.

Specifically, the method searches for the neighboring node with the lowest load dynamically in the destination direction in at least one of the neighboring nodes, that is, step S1032 specifically includes:

searching the adjacent node with the least number of instructions in the ready state according to the load information of the adjacent nodes; and the adjacent node with the lowest dynamic execution load in the destination direction is the adjacent node with the least number of instructions in the ready state.

When data stream instructions which are statically distributed on the computing nodes are scheduled, namely when one instruction is selected for scheduling execution of one computing node, the computing load of the computing node is improved by accelerating data supply of the computing node with low load, so that the problem of local load imbalance of the computing nodes in the dynamic execution process of a data stream program is solved, and the execution efficiency of a data stream architecture is improved.

Optionally, the number of the neighboring nodes of a computing node is four, for example, there may be four neighboring nodes around a computing node, which are located in the east, south, west, and north directions of the computing node.

As a second aspect of the present invention, an embodiment of the present invention provides a data flow architecture, as shown in fig. 3, the data flow architecture includes:

the system comprises a plurality of computing nodes PE, a plurality of communication buses and a plurality of data processing units, wherein two adjacent computing nodes PE are connected through a special interconnection bus;

as shown in fig. 4, one of the computing nodes PE includes:

an instruction storage unit 100 for storing instructions;

a load counting module 200, configured to calculate load information of a computing node PE in real time, where the computing node PE is connected to a plurality of the neighboring nodes through a dedicated interconnection bus 500;

a transceiver module 300, configured to receive load information of multiple adjacent nodes adjacent to the node in real time, and send the load information of the computing node PE to the adjacent nodes;

an instruction transmission selection module 400, configured to select a priority scheduling instruction from an instruction storage unit, that is, an instruction storage unit, according to load information of a plurality of the neighboring nodes; and

and an arithmetic unit 600 which executes the priority scheduling instruction.

That is, the instruction selected from the instruction storage unit by the instruction transmission selection module 400 in the computing node is executed in the arithmetic unit 600 of the current computing node. The result of the execution is sent to other computing nodes through Router.

Specifically, when the instruction transmission selection module 400 selects the priority scheduling instruction, first, according to the load information of a plurality of adjacent nodes of the computing node PE, the adjacent node with the least number of instructions in the "ready" state is searched for; and selecting a priority scheduling instruction from the instruction storage unit, wherein the destination direction of the priority scheduling instruction is the adjacent node with the lowest load of the dynamic execution of the destination direction. For example, when there are four adjacent nodes, i.e., east, south, west, and north, around a computing node PE, the load is dynamically executed according to the destination directions of the four adjacent nodes, i.e., the destination direction dynamically executes the adjacent node with the lowest load, e.g., the destination direction dynamically executes the load of the north adjacent node is smaller than that of the east adjacent node, so the adjacent node with the lowest destination direction dynamically executes the load is the north adjacent node, and the north adjacent node of the computing node PE is the direction of the scheduling instruction.

That is, in the embodiment of the present invention, when scheduling a data stream instruction that has been statically allocated on a compute node, and selecting an instruction for preferential scheduling, the basis for instruction selection is: when the scheduling instruction is selected, all the adjacent nodes of the computing node PE are considered, and the computing load is increased by accelerating the data supply of the computing node PE with low load, so as to alleviate the problem of local load imbalance of the computing node PE during the dynamic execution of the data flow program, thereby increasing the execution efficiency of the data flow architecture.

Optionally, when the number of instruction strips that can be saved by the computing node PE is 2^N at most,

the line width of the dedicated interconnection bus is 2N, where N is an integer greater than or equal to one.

Optionally, the computing nodes PE are arranged in an array, the 16 computing nodes PE are interconnected through a 2D mesh network, and a dedicated interconnection bus is further provided between the computing nodes PE. As shown in fig. 3, each computing node PE (except for the array edge computing node) is connected to its four (east, west, south, north) neighbor nodes through a dedicated interconnect bus, and when a computing node PE at the array edge has several neighbor nodes, i.e., 2 neighbor nodes, for example, the computing node PE at the array edge has 2 neighbor nodes, the computing node PE is connected to the 2 neighbor nodes through the dedicated interconnect bus. When a maximum of 2^N instructions can be stored on each compute node PE, the bit width of the dedicated interconnect bus is 2N (N bit encoding).

In order to better understand the node scheduling method in the data flow architecture, the following describes the node scheduling method in the data flow architecture by a specific embodiment.

As shown in fig. 3, the data flow architecture includes 16 compute nodes, and the 16 compute nodes are arranged in a 4X4 array. As shown in fig. 4, in addition to the instruction storage unit 100, a computing node PE further includes a load statistics module 200 for calculating load information of a computing node in real time, where the computing node PE is connected to a plurality of adjacent nodes through dedicated interconnection buses; a transceiver module 300, configured to receive load information of multiple adjacent nodes adjacent to a node in real time, and send the load information of the computing node PE to the adjacent node; the instruction transmission selection module 400 is configured to select a priority scheduling instruction from the instruction storage unit according to the load information of the multiple adjacent nodes, and the arithmetic unit 600 executes the priority scheduling instruction, where the execution result is sent to other computing nodes through the Router.

The node scheduling method on the data flow architecture is introduced in two stages:

Cycle 1:

as shown in fig. 5, at this time point, the 4 adjacent neighboring compute nodes of the compute node are: the method comprises the following steps that adjacent computing nodes in four directions of south, west, normal and east are provided, at the moment, a load counting module and a receiving and sending module provide load information (the number of ready 'instructions) of a current computing node and the adjacent computing nodes to an instruction transmitting and selecting module, and the instruction transmitting and selecting module sequences the current computing node and the adjacent computing nodes from small to large (namely, the load is from low to high) based on the number of the ready' instructions of the adjacent computing nodes: south < west < not < self < east, and according to the sorting result, it indicates that the dynamic execution load of the adjacent computing nodes of south is lowest, so a priority scheduling instruction is selected from the instruction storage unit, the target direction of the priority scheduling instruction is south, and the priority scheduling instruction is sent to the arithmetic unit 600 for execution, and the execution result is sent to other computing nodes through Router.

Cycle 2:

As shown in fig. 6, at this time point, the instruction transmission selection module obtains the ordering result based on the number of "ready" instructions provided by the load statistics module and the transceiver module as follows: west < east < self < normal < south, therefore, inst0 is selected as the priority scheduling instruction and sent to the arithmetic unit 600 for execution, and the result of the execution is sent to other computing nodes through Router.

An embodiment of the present invention further provides an electronic device, which includes one or more processors and a memory.

The processor may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.

The memory may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by a processor to implement the above-described method for dynamically scheduling instructions on a data flow architecture of various embodiments of the present application and/or other desired functionality. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In addition to the above method and apparatus, the embodiments of the present application may also be a computer program product including computer program instructions, which when executed by a processor, cause the processor to execute the steps of the instruction dynamic scheduling method on a data flow architecture according to the embodiment shown in fig. 1 and fig. 2 of the present application described above in the present specification.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor to perform the steps of the method for dynamically scheduling instructions on a data flow architecture according to various embodiments of the present application described above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably herein. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for dynamically scheduling instructions on a dataflow architecture is characterized by comprising the following steps:

calculating load information of the computing node at a time point;

receiving load information of a plurality of adjacent nodes adjacent to the computing node, wherein the computing node is respectively connected with the plurality of adjacent nodes through a special interconnection bus;

selecting a priority scheduling instruction from an instruction storage unit according to the load information of a plurality of adjacent nodes;

and

and executing the priority scheduling instruction.

2. The method according to claim 1, wherein selecting a priority scheduling instruction from an instruction storage unit according to the load information of the neighboring node comprises:

searching a neighboring node with the lowest dynamic execution load in a target direction in the neighboring nodes according to the load information of the neighboring nodes; and

3. The method according to claim 2, wherein the searching for the neighboring node with the lowest load dynamically executed in the destination direction among the neighboring nodes according to the load information of the plurality of neighboring nodes comprises:

searching the adjacent nodes with the least instruction number of instructions in a ready state according to the load information of the adjacent nodes;

4. The method of claim 1, wherein the time point is a time beat in an instruction pipeline.

5. The method of claim 1, wherein the number of neighbor nodes of the compute node is four.

6. A data flow architecture, comprising:

wherein one of the compute nodes comprises:

an instruction storage unit storing instructions;

the load counting module is used for calculating the load information of the computing nodes in real time, wherein the computing nodes are respectively connected with the adjacent nodes through a special interconnection bus;

an arithmetic unit for executing the prioritized scheduling instruction.

7. The data flow architecture of claim 6, wherein the line width of the dedicated interconnect bus is 2N when the number of instructions that the compute node can hold is at most 2^N, where N is an integer greater than or equal to one.