CN114844757B

CN114844757B - Network-on-chip design method for distributed parallel operation algorithm

Info

Publication number: CN114844757B
Application number: CN202210174904.0A
Authority: CN
Inventors: 黄乐天; 邓子阳
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2023-11-24
Anticipated expiration: 2042-02-24
Also published as: CN114844757A; US20230269200A1

Abstract

The application relates to the technical field of computer algorithms, in particular to a network-on-chip design method for a distributed parallel computing algorithm, which divides the network-on-chip into two layers according to the distributed parallel computing algorithm of the network-on-chip, wherein the network-on-chip comprises a unicast network and a multicast network, the unicast network realizes point-to-point propagation among nodes, and independent computing data required by each computing node is transmitted to each computing node in a unicast mode; the multicast network is a customized multicast network facing the distributed parallel computing algorithm, is used for transmitting common operation data to all operation nodes, realizes the efficient transmission of data packets in the network through the combination of the unicast network and the multicast network, sets a bidirectional replication node or a receiving node at each operation node through designing a multicast tree-shaped transmission architecture facing the distributed parallel computing algorithm, and is different from the conventional multicast network-on-chip in that each node is provided with a multicast transmitting and receiving module, so that the use of on-chip resources is reduced to the greatest extent.

Description

Network-on-chip design method for distributed parallel operation algorithm

Technical Field

The application relates to the technical field of computer algorithms, in particular to a network-on-chip design method for a distributed parallel operation algorithm.

Background

Distributed parallel computing can be defined as an algorithm which has the same operation steps and no data dependence among different computing data in the computing process and can be executed in parallel. Typical distributed operations include distance operations between two coordinate vectors, various matrix multiplications, convolution operations in a deep learning algorithm, and the like.

The distributed parallel operation is characterized in that the operation is dense and decentralizing, the operation among data is independent, and the actual operation efficiency is very low because a large number of operations are involved in the current general purpose processor (CPU) and General Purpose Graphics Processor (GPGPU), so the patent designs an on-chip network architecture aiming at the operations, and the operations are accelerated by adopting a customized hardware acceleration mode.

The most common method for designing a hardware accelerator for distributed parallel operation is to use a plurality of operation units, each unit is responsible for a part of operation, all units operate in parallel together, and then the final result is integrated together. However, the biggest problem brought by the method is that in the process of integrating and storing the calculation result into the storage unit, the number of the operation units is large, so that the control signal of the storage unit is excessively decoded and the combination logic is selected during the result storage, and the time sequence is poor. This affects the highest frequency clock, thereby degrading overall performance.

Aiming at the problem that the parallel operation combination logic delay of a plurality of operation units is overlarge, the interconnection among the operation units is usually carried out in an on-chip network mode instead of a bus or a switching matrix in the industry, and compared with a bus, the on-chip many-core system of the networked communication structure has the advantages that: the method can support concurrent data transmission, has a topology structure which is easier to expand, and has a larger communication bandwidth. The networked communication architecture also provides a rich redundancy resource with more options in reliability design. Network-on-chip is widely contemplated and used as a representative of a networked communication structure. Fig. 1 is a 2D-Mesh structure common to network on chip, which mainly comprises a router, a link, and a network interface, wherein a processing unit may comprise a memory interface, a general processor, a hardware acceleration unit, an IO port, and the like.

The transmission between the network-on-chip is mainly in the form of a receiving and transmitting packet, and the router is a main component of the network-on-chip and is mainly responsible for temporary storage and orientation of the data packet, and can be understood as a transfer station of data transmission in the network. The links connect the components of the network-on-chip into a connected network that implements the transceiving of packets through the connection of the upstream router output register stage and the downstream router input buffer. The network interface is responsible for packaging and transmitting the data of the processing unit and for disassembling and transmitting the packets transmitted by the router to the processing unit.

The network-on-chip data packets are sent by a source node, one or more destination nodes of which may be provided, and when only one destination node is called unicast, a plurality of destination nodes are called multicast. Because the multicast data packet needs to store a plurality of destination node positions, the data packet format is more complex than that of the unicast data packet, and the current common multicast strategy comprises performing multicast operation in a unicast mode, namely sequentially sending the unicast data packet to a plurality of destination nodes, but the scheme is simple to implement, but the problem brought is that the network traffic can be greatly increased. Another way is called virtual circuit multicast tree (VCTM), which adds a routing table to each routing table, and before each multicast is started, the configuration packet of the multicast is sent to the routing table of the corresponding node in a unicast mode, and when the multicast packet is sent, the branching direction and whether the router branches through are configured according to the corresponding same index ID of the routing table. A problem with such a general multicast network is that it increases the packet load in the network and greatly increases the wiring resource consumption of the network on chip.

Both current general purpose processors (CPUs) and general purpose image processors (GPGPUs) have difficulty meeting the real-time requirements of distributed parallel computing type algorithms. Therefore we need to design custom hardware for the characteristics of the algorithm.

The application solves the problem of lower clock frequency caused by overlarge bus interconnection combination logic delay of the traditional hardware accelerator comprising a plurality of operation units by designing the customized network-on-chip oriented to the algorithm, and also solves the problems of low network communication efficiency, high network consumption hardware resources and the like caused by sharing one network by the unicast and the multicast of the general network-on-chip.

Because the network on chip is oriented to the distributed parallel operation type algorithm, the algorithm has a similar operation structure, and the operation can be split into a plurality of groups, for example, a plurality of typical algorithms in the algorithm: all distance calculation among the coordinates is carried out in the two coordinate vectors, and calculation between one coordinate M and different coordinates N is carried out in sequence; multiplication of two matrixes, namely multiplying rows P by different columns Q; an algorithm of the type of convolving … … between the same convolution kernel and different matrices in a convolution operation uses the same data to calculate this characteristic corresponding to the multicast scenario of the network on chip, i.e. only the same operational data is sent from the data receiving node to each operational node. While all nodes of the traditional multicast method can send multicast packets, the method occupies a large amount of on-chip resources in the implementation process, and simultaneously causes redundancy of hardware resources.

In order to save on-chip resources while maximally guaranteeing the multicast efficiency of the distributed parallel computing algorithm realized by the on-chip network, the application provides a novel network structure of a unicast network and a directional multicast network, and designs a multicast network oriented to the distributed parallel computing algorithm on the basis of a common mesh network. The multicast network is a directional multicast network, and multicast data is sent to each operation node by taking a data input node as a source. The application realizes the rapid transmission of the multicast data by designing the tree copy circuit unit aiming at the multicast scene without consuming more on-chip resources, thereby effectively improving the overall communication efficiency of the network.

Disclosure of Invention

First, the technical problem to be solved

The method solves the problems of low network communication efficiency and high network consumption hardware resources caused by the fact that the traditional hardware accelerator comprising a plurality of operation units is low in clock frequency due to overlarge bus interconnection combination logic delay, and meanwhile, the method for designing the network on chip for the distributed parallel operation algorithm is provided.

(II) technical scheme

According to the network-on-chip distributed parallel computing algorithm, the network-on-chip is divided into two layers, including a unicast network and a multicast network, the unicast network realizes point-to-point propagation among nodes, and independent operation data required by each operation node is transmitted to each operation node in a unicast mode; the multicast network is a customized multicast network oriented to a distributed parallel computing algorithm and is used for transmitting common operation data to all operation nodes, and the efficient transmission of data packets in the network is realized through the combination of the unicast network and the multicast network.

As a preferred technical scheme, the multicast network comprises two nodes, namely a two-way copy node and a receiving node, wherein the next stage of each two-way copy node is connected with the two-way copy node or the receiving node, all nodes in the multicast network form a tree node diagram together, each multicast operation is transmitted to the lowest node of all the trees from the topmost node of the tree, and the two-way copy node and the receiving node are reasonably designed, so that better performance can be ensured when the resource usage is smaller.

As a preferable technical scheme, the bidirectional copying node copies and transmits the data packet to two nodes at the lower stage while decoding and storing the data in the multicast packet sent by the upper stage, and the node at the last stage is a receiving node for receiving and decoding the multicast packet and storing the data.

As a preferred technical solution, the whole network-on-chip operation flow is as follows:

s1, when one algorithm operation starts, a data input node receives multicast data and unicast data sent by a sensor, the node packages the multicast data and performs multicast operation through a multicast network, the multicast data is sent to each operation node, and the unicast data is packaged in sequence and is sent to the corresponding operation node through unicast operation in the unicast network;

s2, each operation node starts operation after receiving corresponding multicast data and unicast data, and continuously packages and transmits operation results to the storage node in the operation process until all distributed parallel operation is completed, and the RISC-V processor node accesses the stored data in a unicast network mode.

(III) beneficial effects

The application has the beneficial effects that:

1. the network-on-chip is oriented to a distributed parallel computing type algorithm, and a network-on-chip hardware acceleration scheme of the type of algorithm is provided.

2. The network on chip separates multicast and unicast behaviors by designing independent multicast networks, and solves the problems of large flow and easy network blockage in a single network.

3. By designing a multicast tree-shaped transmission architecture oriented to a distributed parallel computing algorithm, only a two-way replication node or a receiving node is arranged at each operation node, the architecture is different from the traditional multicast network-on-chip, each node is provided with a multicast sending and receiving module, the use of on-chip resources is reduced to the greatest extent, the exponential growth characteristic of the number of the nodes mounted at each stage of the tree-shaped structure also effectively reduces the total time delay of a multicast data packet transmitted from the uppermost stage to the lowermost stage, and the real-time performance of the network-on-chip operation algorithm is effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a typical block diagram of a network on a MESH chip;

FIG. 2 is a diagram of a network architecture on a two-layer sheet;

FIG. 3 is a two-way replication node microarchitecture;

Detailed Description

The application relates to a network-on-chip design method for a distributed parallel operation algorithm, which is further described by referring to the accompanying drawings, and the application is further described by referring to the embodiment:

Further, the multicast network includes two kinds of nodes, namely a two-way copy node and a receiving node, the next stage of each two-way copy node is connected with two-way copy nodes or receiving nodes, all nodes in the multicast network form a tree node diagram together, each multicast operation is transmitted to the lowest node of all the trees from the topmost node of the tree, and the two-way copy node and the receiving node are reasonably designed, so that better performance can be ensured when the resource usage is smaller.

Further, the two-way copying node copies and transmits the data packet to two nodes at the lower level while decoding and storing the data in the multicast packet sent by the upper level, and the node at the last level is a receiving node for receiving and decoding the multicast packet and storing the data.

Further, the whole network-on-chip operation flow is as follows:

Working principle: as shown in fig. 2, the unicast network adopts a Mesh network topology of n×n. The nodes in the unicast network in the network have the following steps: 1. the data input node is responsible for receiving newly detected data transmitted by a sensor or a network upper stage, packaging the data into a unicast data packet and a multicast data packet correspondingly, and transmitting the data packets to the corresponding operation nodes through the unicast network and the multicast network respectively. 2. The node comprises an operation unit, which is responsible for unpacking and storing the data packet after receiving the unicast and multicast data packet sent to the node, then the operation unit calls the data corresponding to the multicast packet and the unicast packet to operate, and packages the calculation result and sends the calculation result to the corresponding storage unit. 3. And the nodes are only responsible for receiving and transmitting packets, and the nodes are only responsible for transmitting the packets in the unicast network according to the destination nodes in the X direction or the Y direction, and do not comprise unpacking and data storage units. 4. The node comprises a storage unit, the node stores all effective results and supports other nodes to send requests to the node, and the node returns a packet containing data to the node after receiving the requests. 5. A node comprising a RISC-V processor, on which a RISC-V processor is mounted, the processor being configured to perform an algorithm other than the operation content of the network-on-chip computing unit, for example, after the network-on-chip performs a convolution operation in a deep learning algorithm, the RISC-V processor may invoke data in a storage node to perform subsequent operations such as pooling, full concatenation, and the like.

The multicast network comprises two-way copy nodes and receiving nodes, the next stage of each two-way copy node is connected with two-way copy nodes or receiving nodes, and all nodes in the multicast network form a tree node diagram together. Each multicast operation is transmitted from the topmost node of the tree to the lowest nodes of all the trees. The micro architecture of the bidirectional replication node is shown In fig. 3, and comprises two parts, namely control logic and a double-port cache, wherein after the control logic receives a start_in signal, the control logic represents that the B end of the double-port cache at the upper level starts to transmit data, then the control logic at the present level sends written addresses and enabling signals to the A port of the double-port cache, and stores the data sent from the upper level until the upper level sends finish_in signal, so that the storage of all data is completed. And then the control logic of the stage can send a start_Out signal and Start sending a read address and a read enabling signal to the B port stored in the dual port until all data sent by the previous stage are sent and then finish_Out signals are sent, and after the stage finishes the multicast operation, the control logic can call the read operation of the A port again, read Out the effective data in the multicast packet, and call the operation unit to Finish the operation by combining the data in the unicast packet.

The above examples are merely illustrative of the preferred embodiments of the present application and are not intended to limit the spirit and scope of the present application, and various modifications and improvements made by those skilled in the art to the technical solutions of the present application should fall within the scope of protection of the present application, and the technical content claimed by the present application is fully described in the claims.

Claims

1. A network-on-chip design method for a distributed parallel operation algorithm is characterized in that: dividing the network on chip into two layers according to a network on chip distributed parallel computing algorithm, wherein the network on chip comprises a unicast network and a multicast network, the unicast network realizes point-to-point propagation among nodes, and independent operation data required by each operation node is transmitted to each operation node in a unicast mode; the multicast network is a customized multicast network facing to a distributed parallel computing algorithm and is used for transmitting common operation data to all operation nodes, the multicast network comprises two nodes, namely a two-way copy node and a receiving node, the next stage of each two-way copy node is connected with two-way copy nodes or receiving nodes, all nodes in the multicast network jointly form a tree node diagram, each multicast operation is transmitted to the lowest node of all trees from the topmost node of the tree, the two-way copy node decodes and stores data in a multicast packet transmitted by the previous stage and simultaneously copies and transmits the data packet to two nodes of the next stage, and the node of the last stage is a receiving node for receiving and decoding the multicast packet and storing the data.

2. The network-on-chip design method for the distributed parallel operation algorithm as claimed in claim 1, wherein: the whole network-on-chip operation flow is as follows: