CN114760241A

CN114760241A - Routing method for data flow architecture computing equipment

Info

Publication number: CN114760241A
Application number: CN202210461301.9A
Authority: CN
Inventors: 吴萌; 李易; 安述倩; 李文明; 叶笑春; 范东睿
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-07-15
Anticipated expiration: 2042-04-28
Also published as: CN114760241B

Abstract

The invention provides a routing method used in a computing device of a data flow architecture, wherein the computing device comprises a plurality of processing units and a plurality of routing nodes, each processing unit is directly connected with one routing node and is associated as a local processing unit corresponding to the routing node, and the routing nodes are mutually connected, and the method comprises the following steps: at each routing node, acquiring data sent to the local processing unit from each direction and respectively maintaining a corresponding cache queue for the data sent from each direction, wherein the cache queues are blocking queues; and determining the supply priority of a data packet at the head of a queue in a cache queue corresponding to data sent to the local processing unit in each direction at each routing node, and selecting to send the data packet at the head of one queue in the plurality of queues to the local processing unit according to the supply priority, wherein the supply priority is related to the number of operands which need to be reached when the instruction distance of the operands in the head data packet needs to be converted into the ready state in the local processing unit.

Description

Routing method for data flow architecture computing equipment

Technical Field

The present invention relates to computer architecture, in particular to routing technology in pe (processing element) arrays, and more particularly to a routing method in a computing device for a dataflow architecture.

Background

With the progress of computer technology research and the increase of competition, high-performance computing technology is increasingly applied to various fields to solve practical problems encountered in scientific research and social production. In the field of high-performance computing, data stream computing embodies good computing performance and applicability. The data flow program is represented by a data flow graph, each node in the data flow graph represents an instruction, each edge represents a dependency relationship between one instruction and another instruction, and the basic principle of execution of the data flow instruction is as follows: all source operands of the instruction are ready, and the downstream instruction has a free data slot for receiving data, then the instruction can be transmitted and executed; the instruction execution results are not written to a shared register or shared cache, but are passed directly to its downstream destination instructions through the dependent edge.

The data flow architecture generally includes several, tens of or even more Processing units (i.e., Processing elements, referred to as PEs, and some documents also referred to as Processing nodes or computing nodes), the Processing units are interconnected by a Network on Chip (NoC), and a routing node (also referred to as an on-Chip routing) is responsible for transferring data between the Processing units, so as to form a PE array (or referred to as a computing device) shown in fig. 1, for example. Each processing unit is a processor core with strong computing capability, weak control capability and small complexity. The routing node is responsible for transferring operands between the processing units, the transfer of the operands is a critical path in the dataflow architecture, and the transfer efficiency of the operands determines the number of executable instructions on the processing units, and thus the overall execution efficiency of the dataflow architecture. Therefore, it is important for the data flow architecture to design an efficient routing structure.

At present, the NoC most widely used in the data flow architecture is a two-dimensional wireless network Mesh (i.e. 2d Mesh, also called two-dimensional Mesh), i.e. the topology shown in fig. 1. Fig. 2 shows a principle of a routing structure corresponding to the topology shown in fig. 1, where a routing node is mainly divided into an input part and an output part, and has five directions, namely, East (East), West (West), South (South), North (North), and Local (Local), which are abbreviated as E, W, S, N and L, respectively, and each direction is connected to four other directions. The routing node needs to set a routing arbitration mechanism, which is a process for deciding which input data packet request should be granted when there are multiple input data packet requests on the same output port. The currently common routing arbitration mechanisms are: a Round Robin (Round Robin) algorithm and a fixed priority route arbitration mechanism. Route arbitration is a process of determining the sequence of a plurality of data packets passing through an output port when the same output port has a plurality of data packets to pass through.

The polling mechanism belongs to a non-priority uniform response arbitration mechanism, is relatively fair and is widely applied, but the arbitration mechanism lacks flexibility when facing specific applications. Currently, arbitration designs of NOC routing nodes are mostly designed to respond to requests sent by each output port by using an arbitration algorithm (referred to as a polling algorithm for short) based on a polling mechanism. Referring to fig. 3, in

cycles

1, 2, 3, 4, and 5, the polling algorithm averagely polls each buffer queue corresponding to the same output port, and sequentially selects a data packet from each buffer queue for transmission; it can be seen that the round robin algorithm is a uniform response arbitration scheme without priority. In the arbitration mode of the polling algorithm, the priority level and the degree of the data packets cannot be distinguished, the data packets in each cache queue are treated equally, and the arbitration controller uniformly responds to the requests of each output port, so that the polling algorithm lacks flexibility when facing specific applications.

The fixed priority is, for example, to divide corresponding priorities for different data and transmit related data according to a set priority order. Fixed priority arbitration mechanisms are simple, but are prone to "starvation" phenomena, such as: the priority of the data in one buffer queue is higher than that of the data in other buffer queues for a long time, so that the data of lower level which is reached by other buffer queues first is difficult to send out. Therefore, fixed priority is difficult to apply in on-chip routing.

In the computing device of the dataflow architecture, the condition that the instruction in the processing unit reaches the Ready state (namely: Ready) is that all operands of the instruction have arrived, therefore, the data packet corresponding to the instruction which has arrived at part of the operands and only needs a small number of operands to arrive to become the Ready state should have higher priority, and the polling mechanism does not consider the arrival state of the operands of the instruction, which may cause a large number of instructions to be in a non-Ready state because of long waiting for the arrival of another operand, thereby limiting the execution efficiency of the computing device of the dataflow architecture.

Therefore, in order to improve the overall execution efficiency of the data flow architecture, it is necessary to improve the routing method of the data flow architecture.

Disclosure of Invention

It is therefore an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a routing method in a computing device for a data flow architecture.

The purpose of the invention is realized by the following technical scheme:

according to a first aspect of the present invention, there is provided a routing method in a computing device for a dataflow architecture, the computing device comprising a plurality of processing units and a plurality of routing nodes, each processing unit being directly connected to a routing node and being associated to a local processing unit corresponding to that routing node, the plurality of routing nodes being interconnected, the routing method comprising: at each routing node, acquiring data sent to a local processing unit from each direction and respectively maintaining a corresponding cache queue for the data sent from each direction, wherein the cache queues are blocking queues; determining the supply priority of a data packet at the head of a queue in a cache queue corresponding to data sent to a local processing unit in each direction at each routing node, and selecting to send the data packet at the head of one queue in a plurality of queues to the local processing unit according to the supply priority, wherein the supply priority is related to the number of operands which need to be reached when the instruction distance of the operands in the head data packet needs to be converted into the ready state in the local processing unit.

In some embodiments of the present invention, the routing method further comprises: and maintaining an instruction state statistical table at each routing node, wherein the instruction state statistical table is used for recording the instruction index and the instruction state of each instruction in the instruction slot of the local processing unit, and the instruction state comprises the operand number which is required to arrive when the instruction distance is converted into the ready state.

In some embodiments of the present invention, the routing method further comprises: analyzing the instruction index corresponding to the data packet at the head of the queue, and searching the number of operands which are needed to be reached when the instruction distance corresponding to the instruction index is converted into the ready state in the instruction state statistical table so as to determine the supply priority of the data packet at the head of the queue.

In some embodiments of the present invention, the instruction states in the instruction state statistics table are synchronized by the local processing unit to the routing node via a dedicated interconnect bus in its instruction slot, wherein the dedicated interconnect bus is only used to synchronize instruction states between the routing node and the local processing unit to which it is connected.

In some embodiments of the invention, the method further comprises: and when the supply priorities of the data packets at the head of the queue in the cache queues corresponding to the data sent to the local processing unit from all directions are equal, performing one round of scheduling according to a routing arbitration mechanism based on a polling algorithm.

In some embodiments of the present invention, the routing method further comprises: at each routing node, data not sent to the local processing unit is scheduled according to a round robin algorithm based routing arbitration mechanism.

According to a second aspect of the present invention, there is provided a computing device comprising a plurality of processing units and a plurality of routing nodes, each processing unit being directly connected to a routing node and being associated as a local processing unit corresponding to the routing node, and the plurality of routing nodes being interconnected in a predetermined topology; each routing node comprises a routing algorithm module and a special arbitration controller for scheduling data sent to the local processing unit from each direction; the routing algorithm module is configured to: determining the direction to which the currently received data is sent according to a routing algorithm, acquiring the data sent to the local processing unit from all directions, and respectively maintaining a corresponding cache queue for the data sent from each direction, wherein the cache queues are blocking queues; the dedicated arbitration controller is configured to: determining the supply priority of a data packet at the head of a queue in a cache queue corresponding to data sent to a local processing unit from each direction, and selecting to send the data packet at the head of one queue of a plurality of queues to the local processing unit according to the supply priority, wherein the supply priority is related to the number of operands which need to arrive when an instruction in the local processing unit, which needs operands in the head data packet, is converted from a ready state.

In some embodiments of the invention, each routing node comprises a polling arbitration controller for each direction to send data to the other direction, the polling arbitration controller being configured to: data not sent to the local processing unit is scheduled according to a round robin algorithm based route arbitration mechanism.

In some embodiments of the present invention, an instruction state statistics table is maintained in the routing node, where the instruction state statistics table is used to record an instruction index and an instruction state of each instruction in an instruction slot of the local processing unit, and the instruction state includes the number of operands that need to be reached before an instruction distance is converted into a ready state.

In some embodiments of the invention, a dedicated interconnect bus is provided between each processing unit and its directly connected routing node for synchronizing instruction state between its connected routing node and the local processing unit.

Compared with the prior art, the invention has the advantages that:

the invention can simply and efficiently select the data packet transmission which can reduce waiting of the local processing unit according to the priority of the data supply, and can more quickly start the processing unit to obtain the data required by the instruction, thereby improving the calculation efficiency.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

figure 1 is a schematic diagram of a two-dimensional Mesh topology;

FIG. 2 is a schematic diagram of route arbitration;

FIG. 3 is a schematic diagram illustrating prior art arbitration for output ports to local processing units;

FIG. 4 is a schematic of a topology of a computing device according to one embodiment of the invention;

FIG. 5 is a schematic routing diagram of a computing device according to one embodiment of the present invention;

FIG. 6 is a schematic diagram of route arbitration according to one embodiment of the present invention;

FIG. 7 is a diagram of an instruction state statistics table according to one embodiment of the invention;

FIG. 8 is a diagram of an instruction state statistics table according to another embodiment of the present invention;

FIG. 9 is a schematic diagram of an exemplary route arbitration principle according to the present invention;

FIG. 10 is a schematic of a topology of a computing device according to another embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As mentioned in the background section, in a data flow architecture based computing device, it is relatively fair to employ a polling mechanism for data routing arbitration, but this polling mechanism lacks flexibility; the fixed priority arbitration mechanism is simple to implement, but is easy to starve, so that the fixed priority arbitration mechanism is difficult to apply to on-chip routing. In the computing device of the data flow architecture, the condition that the instruction in the processing unit reaches the Ready state (i.e. Ready) is that all operands of the instruction have arrived, so that the data packet corresponding to the instruction which has arrived at part of the operands and only needs a small number of operands to arrive to be in the Ready state should have higher priority, and the polling mechanism does not consider the arrival state of the operands of the instruction, which may cause a large number of instructions to be in a non-Ready state due to long waiting time for the arrival of another operand, thereby limiting the execution efficiency of the computing device of the data flow architecture.

Therefore, the invention obtains the data sent to the local processing unit from each direction at each routing node and respectively maintains an independent cache queue for the data sent from each direction; and at each routing node, determining the supply priority of the data packets at the head of the queue in the cache queue corresponding to the data sent to the local processing unit in each direction, and preferentially sending the data packet with the highest supply priority in the data packets at the heads of the queues to the local processing unit, wherein the supply priority is determined based on the number of the operands which need to be reached when the instruction distance of the operands in the data packets at the heads needs to be converted into the ready state in the local processing unit. Therefore, the instruction with the least number of operands to be waited for when the distance reaches the ready state can be preferentially sent to the local processing unit, so that the local computing unit can execute the computing task corresponding to the instruction more quickly, and the computing efficiency is improved; moreover, after the computing task of the instruction is completed, the data corresponding to the computing result can be sent to other processing units executing subsequent tasks more quickly, so that the overall execution efficiency of the computing equipment with the data flow architecture is improved.

In a computing device with a data flow architecture, an internal processing unit usually adopts a two-dimensional Mesh or three-dimensional Mesh topology. The following description is mainly given in terms of two-dimensional Mesh topology, and it should be understood that a computing device of three-dimensional Mesh or other topology may still be applied to the computing scheme of the present application.

Embodiment 1:

according to an embodiment of the present invention, referring to fig. 4 and 5, there is provided a computing device (topology of two-dimensional Mesh) including a plurality of processing units interconnected by a network on chip. The network on chip comprises a plurality of routing nodes, each processing unit being directly connected to one routing node and being associated to a local processing unit corresponding to that routing node. According to the direction definition in the two-dimensional Mesh topology in the field, data traffic of a routing node is divided into five directions, namely East (East), West (West), South (South), North (North) and Local (Local), which are respectively abbreviated as E, W, S, N and L. According to one embodiment of the invention, the routing node comprises a routing algorithm module RT and a dedicated arbitration controller SAC for scheduling data destined for the local processing unit in each direction. Preferably, the routing algorithm module RT is configured to: determining the direction to which the currently received data is sent according to a routing algorithm, acquiring the data sent to the local processing unit from each direction, and respectively maintaining a corresponding buffer queue for the data sent from each direction, wherein the buffer queue is a blocking queue. Preferably, the dedicated arbitration controller SAC is configured to: determining the supply priority of a data packet at the head of a queue in a cache queue corresponding to data sent to a local processing unit from each direction, and selecting to send the data packet at the head of one queue of a plurality of queues to the local processing unit according to the supply priority, wherein the supply priority is related to the number of operands which need to arrive when an instruction in the local processing unit, which needs operands in the head data packet, is converted from a ready state. The invention can simply and efficiently select the data packet transmission which can reduce waiting of the local processing unit according to the priority of the data supply, and can more quickly start the processing unit to obtain the data required by the instruction, thereby improving the calculation efficiency. Preferably, the data packet with the highest supply priority in the data packets at the head of the plurality of queues is preferentially selected to be sent to the local processing unit. According to one embodiment of the invention, the routing algorithm module RT is configured to: determining the direction to which the currently received data is sent according to a routing algorithm, acquiring the data sent to the local processing unit from all directions, and respectively maintaining independent cache queues for the data sent from all directions, wherein the cache queues are blocking queues; the dedicated arbitration controller SAC is configured to: determining the supply priority of data packets at the head of a queue in a cache queue corresponding to data sent to a local processing unit in each direction, and preferentially sending the data packet with the highest supply priority in the data packets at the head of a plurality of queues to the local processing unit, wherein the supply priority is determined based on the number of operands which need to arrive when the instruction distance of the operands in the data packets at the head of the local processing unit is converted into the ready state. The technical scheme of the embodiment can at least realize the following beneficial technical effects: the scheme provided by the invention selects a data packet (or called a message packet) which can firstly satisfy the condition that a computing node instruction reaches a Ready state (Ready) from a plurality of cache queues from each port direction to a Local processing unit (Local) for sending, can accelerate the speed of the instruction in the processing unit reaching the Ready state, and avoids the idle waiting problem caused by the lack of the Ready instruction in the processing unit, so that the instruction with the least operand required to wait until reaching the Ready state can be preferentially sent to the Local processing unit, so that the Local processing unit can more quickly execute the computing task corresponding to the instruction, and the computing efficiency is improved; moreover, after the computing task of the instruction is completed, the data corresponding to the computing result can be sent to other processing units executing subsequent tasks more quickly, so that the overall execution efficiency of the computing equipment with the data flow architecture is improved.

Referring to FIG. 5, which illustrates a schematic diagram of functional logic within the processing unit PE and a routing node (Router) according to an embodiment of the present invention, the processing unit sends the status of each instruction in the instruction slot (e.g., instructions inst0, inst1, etc.) to the routing node; it can be seen from the figure that the dedicated arbitration controller is connected to output buffer queues (corresponding to buffer queues) for routing each direction (East, West, South, North) to a local processing unit, the routing node shown in the figure having 4 output buffer queues (queues E, W, S, N) corresponding to local processing units; in other embodiments, the routing nodes may have different numbers of output buffer queues, but invariably each output buffer queue to a local processing unit must be connected to a dedicated arbitration controller. In addition, as can be seen from the figure, the instruction state statistical table is also connected to the special arbitration controller, the special arbitration controller analyzes the data packets at the head of the plurality of output cache queues from each direction to the local processing unit, and according to the analyzed instruction indexes, the supply priority corresponding to the instruction corresponding to each data packet is searched in the instruction state statistical table, and the data packet corresponding to the instruction with the highest priority is selected for priority scheduling.

If the data sent to other directions are adjusted to be in a priority mode, the calculated amount and the reference amount required to be transmitted are too large, so in order to guarantee fairness and guarantee efficiency, as can be seen from fig. 5, the RR routing arbitration mechanism is still adopted for scheduling the data packets from each direction to other output ports. According to an embodiment of the present invention, each routing node is further provided with a polling arbitration controller RR for each direction to send data to other directions, the polling arbitration controller is configured to: data not sent to the local processing unit is scheduled according to the routing arbitration mechanism RR based on the round robin algorithm. The technical scheme of the embodiment can at least realize the following beneficial technical effects: the invention adopts different arbitration mechanisms for the data sent to the local and the non-local, because each routing node is associated with a local processing unit, and only the priority of the data sent to the local processing unit is judged, the required calculated amount and the transmitted reference amount are small, and the logic is simple; and the routing arbitration mechanism RR based on the polling algorithm is still adopted to schedule the data sent to the non-local nodes, so that the problems of complex parameter transmission structure and huge calculation amount caused by the fact that the same routing node needs to acquire instruction state information of a plurality of processing units can be avoided, and the overall execution efficiency is improved.

Referring to fig. 6, a further structural schematic diagram of a routing node according to an embodiment of the present invention is shown, which is configured to:

respectively receiving data from input ports corresponding to a plurality of directions (in a topological structure of two-dimensional Mesh, the plurality of directions comprise local L, east E, west W, south S and north N);

determining the direction to which the data is sent by using a routing algorithm, and respectively maintaining a cache queue for the data sent to different directions in each direction (for example, the data received from east E is determined by the routing algorithm, the data sent to local L, west W, south S and north N are determined (because the data sent to a certain processing unit is not transmitted reversely, the data received from east E is not sent to east E), and respectively maintaining a cache queue for the data sent to local L, west W, south S and north N;

aiming at the data sent to the local processing unit in each direction, a Special Arbitration Controller (SAC) is adopted to carry out scheduling on the corresponding output port;

and aiming at data sent to non-local directions (east E, west W, south S and north N) in all directions, scheduling the data at corresponding output ports by adopting a routing arbitration mechanism RR based on a polling algorithm.

According to an embodiment of the present invention, each routing node acquires the instruction state of each instruction in the instruction slot of the local processing node through a channel of routing data to determine the supply priority of a data packet at the head of a queue in a cache queue corresponding to data sent to the local processing unit in each direction, based on the number of operands that need to be reached after the instruction distance of the operand in the head data packet is converted to the ready state in the local processing unit associated with the routing node.

Since the channel for routing data often needs to route other data, and thus the instruction status is not updated timely enough, according to an embodiment of the present invention, in order to further improve the overall execution efficiency, according to an embodiment of the present invention, a dedicated interconnection bus is provided between each processing unit and the routing node directly connected thereto, for synchronizing the instruction status between the routing node connected thereto and the local processing unit. According to one embodiment of the invention, a dedicated interconnection bus is provided between each processing unit and its directly connected routing node, in addition to the path of the routing data, said dedicated interconnection bus being used only for synchronizing the instruction state between its connected routing node and the local processing unit. The special interconnection bus is used for interconnection between an instruction slot on the processing unit and an instruction state statistical table on the routing node, and is mainly responsible for carrying out instruction state information synchronization between the processing unit and the routing node, each time point of the processing unit synchronizes the arrival state of an instruction operand in the instruction slot to the instruction state statistical table on the routing node through the special interconnection bus, if the number of the instruction slot is N, the bit width of the interconnection bus is 2N bits (bit), wherein the state of each instruction is represented by 2-bit codes (for example, 00: the instruction is in a ready state, 01: the instruction can reach the ready state by needing one operand, 10: the instruction can reach the ready state by needing two operations, and 11: the instruction can reach the ready state by needing three operands).

According to an embodiment of the present invention, an instruction state statistics table is maintained in the routing node, as shown in fig. 7, the instruction state statistics table is used to record an instruction index Idx and an instruction state of each instruction in an instruction slot of the local processing unit, where the instruction state includes the number of operands that need to be reached when the instruction distance is converted into the ready state. Preferably, the instruction state statistical table may store an instruction index Idx of the instruction and the number of operands that the instruction distance needs to arrive before being converted into the ready state, and use the number of operands that the instruction distance needs to arrive before being converted into the ready state as the supply priority of the corresponding header packet in the cache queue. Preferably, the supply priority may be set to be equal to or positively correlated with the number of operands that the instruction has to arrive before leaving the ready state; for example, the instruction returns that 3, 2, 1 operands are converted to the ready state, and the priority of the corresponding data packet is 3, 2, 1); in this arrangement, the lower the supply priority, the higher the priority. However, it should be understood that this is a preferred and not exclusive way, and the number of operands that need to be reached and the supply priority value may be set to be in negative correlation (for example, the instruction returns 3, 2, 1 operands to the ready state, and the supply priorities of the corresponding packets are 1, 2, and 3, respectively); in this setting, the higher the supply priority, the higher the priority level. Referring to fig. 8, the instruction state statistics table is composed of two columns, the first column is an instruction index Idx, the second column is a supply Priority (Priority) of a packet corresponding to an instruction, and the size of the instruction state statistics table is positively correlated to the number of instructions that can be accommodated in an instruction slot of a computing unit. The table is used for counting the arrival state of operands of each instruction in an instruction slot of a processing unit connected with a current route in real time, setting a supply priority for each instruction according to the number of operands which need to arrive at each instruction from a ready state, and generally calculating the maximum of 3-order operations of the instruction, so that the value of the supply priority of the instruction can be 0, 1, 2 and 3, the priority 1 is more than 2 and more than 3, but the supply priority of the state instruction (ready) is not needed by the state instruction 0. It should be understood that the upper limit of the number of operands that the instruction needs to reach from the ready state is merely illustrative, and in some implementation scenarios, the instruction may be calculated to be a 4-purpose operation, a 5-purpose operation, or more, and thus, the value of the corresponding supply priority may be adjusted accordingly.

The principle of the present invention for scheduling packets according to supply priority is better explained in connection with fig. 9 below.

According to one example of the present invention, it is described below in three cycles.

Period 1:

at this time point, the dedicated arbitration controller searches the instruction supply priorities corresponding to the head messages of the four output buffer queues (East, West, South, North) in the instruction state statistical table, wherein the priorities are 3, 2, 1 and 3, and the priority is 1>2>3, so that the data packet of the buffer queue corresponding to the data sent from the South S is selected for scheduling, that is, the head data packet of the buffer queue S is sent to the local processing unit first.

Period 2:

at this time point, the priority levels of the number of supplies of the instructions corresponding to the head messages of the four output buffer queues (East, West, South, North) are 3, 2, 3, and 3, respectively, and the priority level is 2>3, so that the data packet of the buffer queue corresponding to the data sent by the West W is selected for scheduling, that is, the head data packet of the buffer queue W is sent to the local processing unit first.

Period 3:

at this time point, the priority of the number of the instructions corresponding to the head messages of the four output buffer queues (East, West, South, North) is all 3, at this time, the RR polling arbitration mechanism is adopted, and the data packets of the buffer queues corresponding to East E, West W, South S, and North N are sent in sequence for scheduling, that is, the first polling arbitration mechanism is adopted, and the head data packets of the buffer queue E, W, S, N are sent to the local processing unit in sequence. However, it should be understood that when the priority levels of the data packets at the head of the queue in the buffer queues corresponding to the data sent to the local processing unit in all directions are equal, other solutions may also be used, for example, performing a random scheduling to send out the head data packet of a certain buffer queue at random first.

Embodiment 2:

embodiment 2 is different from embodiment 1 in that, in embodiment 2, a plurality of processing units in a computing device are interconnected by a network on chip to form a three-dimensional Mesh structure; the multiple directions may be increased in addition to east, west, south and north directions, as well as in both the upper and lower directions. Thus, separate buffer queues are also maintained for data coming from above and below, respectively. During route arbitration, the supply priority of the data packet at the head of the queue in the cache queue corresponding to the data sent to the local processing unit from the east, west, south, north, up and down corresponding directions is determined, and the data packet with the highest supply priority in the data packets at the head of the plurality of queues is sent to the local processing unit preferentially. Other implementation details are similar to those of embodiment 1, and thus are not described herein again.

Although the present invention only takes two-dimensional Mesh and three-dimensional Mesh as examples, it should be understood that, without conflict of technical principles, the technical solution of the present application is still applicable to any other form of network-on-chip topology, such as two-dimensional Torus, three-dimensional Torus, and the like.

Embodiment 3

The present invention also provides, according to an embodiment of the present invention, a routing method in a computing device for a dataflow architecture, the computing device including a plurality of processing units and a plurality of routing nodes, each processing unit being directly connected to one routing node and being associated to a local processing unit corresponding to that routing node, the plurality of routing nodes being interconnected, the routing method including: at each routing node, acquiring data sent to a local processing unit from each direction and respectively maintaining a corresponding cache queue for the data sent from each direction, wherein the cache queues are blocking queues; determining the supply priority of a data packet at the head of a queue in a cache queue corresponding to data sent to a local processing unit in each direction at each routing node, and selecting to send the data packet at the head of one queue in a plurality of queues to the local processing unit according to the supply priority, wherein the supply priority is related to the number of operands which need to be reached when the instruction distance of the operands in the head data packet needs to be converted into the ready state in the local processing unit. According to one embodiment of the invention, a computing device comprises a plurality of processing units and a plurality of routing nodes, each processing unit being directly connected to a routing node and associated to a local processing unit corresponding to the routing node, and said plurality of processing units establishing connections through a network on chip containing said plurality of routing nodes, said routing method comprising: at each routing node, acquiring data sent to a local processing unit from each direction and respectively maintaining an independent cache queue for the data sent from each direction, wherein the cache queues are blocking queues; and at each routing node, determining the supply priority of the data packets at the head of the queue in the cache queue corresponding to the data sent to the local processing unit from each direction, and preferentially sending the data packet with the highest supply priority in the data packets at the heads of the queues to the local processing unit, wherein the supply priority is determined based on the number of the operands which need to be reached when the instruction distance of the operands in the data packets at the heads needs to be converted into the ready state in the local processing unit. According to an embodiment of the present invention, the routing method further includes: and maintaining an instruction state statistical table at each routing node, wherein the instruction state statistical table is used for recording the instruction index and the instruction state of each instruction in the instruction slot of the local processing unit, and the instruction state comprises the number of operands which need to be reached when the instruction is converted from the ready state to the off state. According to an embodiment of the present invention, the routing method further includes: analyzing the instruction index corresponding to the data packet at the head of the queue, and searching the operand number which is required to arrive when the instruction corresponding to the instruction index is converted into the ready state in the instruction state statistical table to determine the supply priority of the data packet at the head of the queue. According to one embodiment of the present invention, the instruction states in the instruction state statistics table are synchronized by the local processing unit to the routing node via a dedicated interconnect bus, wherein the dedicated interconnect bus is only used for synchronizing the instruction states between the routing node and the local processing unit to which it is connected. According to an embodiment of the invention, the method further comprises: and when the supply priorities of the data packets at the head of the queue in the cache queues corresponding to the data sent to the local processing unit from all directions are equal, performing one round of scheduling according to a routing arbitration mechanism based on a polling algorithm. According to an embodiment of the present invention, the routing method further includes: at each routing node, data not sent to the local processing unit is scheduled according to a round robin algorithm based routing arbitration mechanism.

The method in this embodiment may be applied to the computing device in embodiment 1 or embodiment 2, and details of implementation are described in the computing device and thus are not described here again.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that holds and stores the instructions for use by the instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, such as punch cards or in-groove raised structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the market, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A routing method for use in a computing device of a dataflow architecture, the computing device including a plurality of processing units and a plurality of routing nodes, each processing unit being directly connected to a routing node and being associated with a local processing unit corresponding to the routing node, the plurality of routing nodes being interconnected, the routing method comprising:

at each routing node, acquiring data sent to a local processing unit from each direction and respectively maintaining a corresponding cache queue for the data sent from each direction, wherein the cache queues are blocking queues;

determining the supply priority of a data packet at the head of a queue in a cache queue corresponding to data sent to a local processing unit in each direction at each routing node, and selecting to send the data packet at the head of one queue in a plurality of queues to the local processing unit according to the supply priority, wherein the supply priority is related to the number of operands which need to be reached when the instruction distance of the operands in the head data packet needs to be converted into the ready state in the local processing unit.

2. The routing method according to claim 1, further comprising:

and maintaining an instruction state statistical table at each routing node, wherein the instruction state statistical table is used for recording the instruction index and the instruction state of each instruction in the instruction slot of the local processing unit, and the instruction state comprises the operand number which is required to arrive when the instruction distance is converted into the ready state.

3. The routing method according to claim 2, further comprising:

analyzing the instruction index corresponding to the data packet at the head of the queue, and searching the number of operands which are needed to be reached when the instruction distance corresponding to the instruction index is converted into the ready state in the instruction state statistical table so as to determine the supply priority of the data packet at the head of the queue.

4. The routing method according to claim 2, wherein the instruction states in the instruction state statistics table are synchronized by the local processing unit to the routing node via a dedicated interconnect bus for synchronizing the instruction states only between the routing node and the local processing unit to which it is connected.

5. The routing method according to claim 1, wherein the method further comprises:

when the priority levels of the data packets at the head of the queue in the cache queues corresponding to the data sent to the local processing unit from all directions are equal, performing one round of scheduling according to a routing arbitration mechanism based on a polling algorithm.

6. The routing method according to one of claims 1 to 5, wherein the routing method further comprises:

at each routing node, data not sent to the local processing unit is scheduled according to a round robin algorithm based routing arbitration mechanism.

7. A computing device comprising a plurality of processing units and a plurality of routing nodes, each processing unit being directly connected to a routing node and being associated to a local processing unit corresponding to the routing node, and said plurality of routing nodes being interconnected in a predetermined topology;

each routing node comprises a routing algorithm module and a special arbitration controller for scheduling data sent to the local processing unit from each direction;

the routing algorithm module is configured to: determining the direction to which the currently received data is sent according to a routing algorithm, acquiring the data sent to the local processing unit from all directions, and respectively maintaining corresponding cache queues for the data sent from all directions, wherein the cache queues are blocking queues;

the dedicated arbitration controller is configured to: determining the supply priority of a data packet at the head of a queue in a cache queue corresponding to data sent to a local processing unit from each direction, and selecting to send the data packet at the head of one queue of a plurality of queues to the local processing unit according to the supply priority, wherein the supply priority is related to the number of operands which need to arrive when an instruction in the local processing unit, which needs operands in the head data packet, is converted from a ready state.

8. The computing device of claim 7, wherein each routing node comprises a round robin arbitration controller for each direction to send data to the other direction, the round robin arbitration controller configured to: data not sent to the local processing unit is scheduled according to a round robin algorithm based route arbitration mechanism.

9. The computing device of claim 8, wherein an instruction state statistics table is maintained in the routing node, and the instruction state statistics table is used to record an instruction index and an instruction state of each instruction in an instruction slot of the local processing unit, and the instruction state includes a number of operands that the instruction has to reach before moving to the ready state.

10. The computing device of claim 9, wherein each processing unit is provided with a dedicated interconnect bus between its directly connected routing node for synchronizing instruction state between its connected routing node and the local processing unit.