CN116383126A

CN116383126A - Network-on-chip-based transmission method of deep neural network accelerator

Info

Publication number: CN116383126A
Application number: CN202310346394.5A
Authority: CN
Inventors: 欧阳一鸣; 王佳新; 孙成龙; 王奇; 梁华国
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2023-04-03
Filing date: 2023-04-03
Publication date: 2023-07-04

Abstract

The invention discloses a transmission method of a deep neural network accelerator based on a network-on-chip, which is applied to a two-dimensional grid network with the scale of MxN, wherein the two-dimensional grid network consists of M x N routers, M x (N-1) processing units PE and M on-chip memory units; the router comprises an input port, an output port, an input buffer zone, a route calculation unit, a data packet processing unit, a data bypass, a reconfigurable crossbar, a link reconfiguration unit, a plurality of multiplexers and demultiplexers. The invention can effectively copy the multicast data packet, and provide single-period transmission per hop for the multicast data packet under the condition of not influencing the transmission of the unicast data packet, thereby greatly reducing the number of the data packets transmitted in the network on chip, reducing the transmission delay of the data packets, reducing the classification delay of the neural network and greatly improving the performance of the network.

Description

Network-on-chip-based transmission method of deep neural network accelerator

Technical Field

The invention belongs to the technical field of application of integrated circuit chip design, and particularly relates to a transmission method of a deep neural network accelerator based on a network on chip.

Background

Deep Neural Networks (DNNs) can provide highly accurate recognition and classification results, presenting great advantages in applications in the fields of information processing, pattern recognition, computer vision, etc., but at the cost of high computational complexity and high power consumption. Recent neural network models, such as AlexNe and VGG-16, cannot operate efficiently on existing platforms due to their relatively complex structure and large number of parameters.

Currently, common deep neural network accelerators are based on ASCIs, FPGAs, CPUs and GPUs. ASCIs are designed for specific applications, providing solutions to specific neural network models, but such designs lack flexibility, do not have reconfigurable features, perform poorly in different applications, and incur significant design costs. FPGAs, due to its programmable nature, is more flexible to design than ASCIs-based deep neural network accelerators, but it is still configured for specific neural network models and applications. CPUs are designed for general purpose computing, such as 48-core high-pass Centriq2400 and 72-core intel XeonPhi, and have high flexibility in computation, and can perform complex computation. Since neural network model computation is mostly simple operation, such as multiplication and addition operation, and requires intensive parallel computation, CUPs are not suitable as deep neural network accelerator platforms. GPUs are the most common deep neural network accelerator platforms, possess high computational flexibility, and are capable of performing intensive parallel computations, but also incur significant power consumption overhead.

Network-on-chip based deep neural network accelerators are one suitable choice. The network on chip separates the calculation function from the communication function in the traditional bus chip, has the advantages of high communication efficiency, good expansibility, high reliability and the like, and can decouple the neural network model into data calculation operation and data transmission operation. In the data calculation section, the calculation of the different neural network models may be performed independently of the data flow. In the data transmission section, the network on chip can realize efficient data transmission operation. Meanwhile, the calculation result of one processing unit (ProcessElement, PE) can be directly transmitted to the other processing unit through a network, so that the memory access to the off-chip memory is reduced, and the transmission delay and the power consumption are effectively reduced. The deep neural network accelerator based on the network on chip can meet the parallelism required by calculation and communication of a neural network model, and has higher flexibility and lower power consumption.

Although network-on-chip based deep neural network accelerators have flexible computational and communication characteristics, there are problems with some technologies: firstly, the problem of network bandwidth exhaustion is solved, a traditional unicast transmission data packet transmission mode is adopted by a neural network accelerator, and a large number of data packets carrying the same data can be generated in the network; and secondly, the routing problem of multicast data packets is solved, a large number of multicast data packets exist in the network on chip, and the data packets are required to be efficiently transmitted to each destination node under the condition of not generating deadlock. Third, the problem of multicast data packet transmission delay is that different multicast strategies may cause multicast data packets to bypass or generate a large number of identical data packets, resulting in increased data packet transmission delay.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a transmission method of a deep neural network accelerator based on a network on chip, so that the frequency of exchanging data between a neural network accelerator platform and an external memory can be reduced, single-period transmission per hop is provided for a multicast data packet, the transmission delay of the multicast data packet is reduced, and meanwhile, the method has good expansibility.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

the invention relates to a transmission method of a deep neural network accelerator based on a network-on-chip, which is characterized in that the network-on-chip is a two-dimensional grid network with the scale of MxN, wherein the two-dimensional grid network consists of MxN routers, M x (N-1) processing units PE and M on-chip storage units; wherein M and N are integers greater than 3; the router includes: input and output ports in the east, south, west, north and local directions, an input buffer area, a route calculation unit, a data packet processing unit, a data bypass, a reconfigurable crossbar, a link reconfiguration unit, a plurality of multiplexers and demultiplexers; the multiplexer and the demultiplexer are respectively connected with the input buffer area and the data bypass;

the local input and output ports of the front N-1 row of routers are respectively connected with M× (N-1) processing units PE, and the local input and output ports of the M routers in the N row are respectively connected with M on-chip memory units;

dividing the data packet into a multicast data packet and a unicast data packet; the multicast data packet includes: the type of the data packet, the destination node address vector, the coordinates and the load of the source node; each bit of the destination node vector in the multicast data packet corresponds to a router respectively, if any one bit address of the destination node vector is 1, the router corresponding to the corresponding bit address is a destination router, and if the router corresponding to the corresponding bit address is 0, the router corresponding to the corresponding bit address is not the destination router;

the unicast data packet includes: the type of the data packet, the coordinates of the destination node, the time stamp, the coordinates of the source node and the load;

the transmission method comprises the following steps:

step 1: the upstream router or a processing unit PE connected with the current router sends a data packet to an input port of the current router;

step 2: the data packet processing unit of the current router judges whether the data packet is a multicast data packet or a unicast data packet according to the type of the data packet in the received data packet, if the data packet is the multicast data packet, the step 3-step 6 is carried out, if the data packet is the unicast data packet, the step 7-step 10 is carried out;

step 3: the data packet processing unit of the current router judges whether each address in the node address vector is 0 according to the destination node address vector in the multicast data packet, if so, the multicast data packet is deleted; otherwise, executing the step 4;

step 4, judging whether the multicast data packet reaches a destination router; if so, after changing the value '1' of the bit address corresponding to the destination router in the destination node address vector in the multicast data packet into '0', the data packet processing unit of the current router sends a gating signal, and the step 5 is executed; otherwise, the multicast data packet is sent to the reconfigurable crossbar of the current router, and step 6 is executed.

Step 5: a demultiplexer connected with a data packet processing unit of the current router sends multicast data packets to a data bypass to reach a reconfigurable crossbar switch and a multiplexer of the router respectively according to the gating signals; transmitting the multicast data packet to a local output port by the multiplexer so as to be sent to a processing unit PE connected with a router;

step 6: judging whether a matched output direction exists in the input direction of the multicast data packet by a reconfigurable crossbar switch of the current router; if yes, the multicast data packet is directly sent to the matched output direction, if no, a router corresponding to a first address of 1 in a destination node address vector is searched and used as a next destination router sent by the multicast data packet, and therefore the multicast data packet is output to the next destination router from an output port by using a YX routing algorithm; the link reconfiguration unit of the current router pairs corresponding input and output directions for the transmission of subsequent data packets according to the multicast data packets transmitted by the YX routing algorithm;

step 7: a demultiplexer connected with a data packet processing unit of the current router sends unicast data packets to an input buffer area according to the gating signals, and the input buffer area receives the unicast data packets and stores the unicast data packets in a first-in first-out mode;

step 8: the reconfigurable crossbar of the current router receives the unicast data packet, and a route calculation unit of the current router calculates the output direction of the unicast data packet by adopting a YX route algorithm according to the destination node address in the unicast data packet;

step 9: the reconfigurable crossbar switch of the current router judges whether a matched output direction exists in the input direction of the unicast data packet, and if so, the step 10 is executed; otherwise, the reconfigurable crossbar switch sends the unicast data packet to the output direction calculated by the route calculation unit to be output to the next destination router;

and step 10, judging whether the time stamp of the unicast data packet reaches a threshold value, if so, sending the unicast data packet to the output direction calculated by the route calculation unit by the reconfigurable crossbar switch, otherwise, sending the unicast data packet to the matched output direction by the reconfigurable crossbar switch to be output to the next destination router.

The transmission method of the network-on-chip-based deep neural network accelerator is characterized in that a deep neural network is deployed on M× (N-1) processing units PE, and each row of processing units PE is mapped with neurons of the same layer; the N-1 column router acquires parameters of the deep neural network model and stores the parameters in an on-chip storage unit as a data packet; the parameters required by the calculation of the processing units PE containing the neurons of the same layer are obtained through the processing units PE of the neurons of other layers or the on-chip storage unit, and after the calculation of the data in the data packet by the neurons on all the processing units PE is finished, the final result is sent to the external memory through the router.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention changes the rightmost processing unit of the traditional grid network into the on-chip memory unit, thereby effectively reducing the frequency of exchanging data between the neural network accelerator and the external memory.

2. The invention provides a path-based multicast strategy according to the flow mode of the deep neural network and the mapping of the deep neural network to the network-on-chip, and reduces the quantity of data packets transmitted in an accelerator.

3. In order to effectively reduce the transmission hop count of the multicast data packet, the invention designs an efficient reconfigurable multicast path, which can change the length and the position according to the flow in the network and has higher reuse rate.

4. The router architecture can efficiently copy the multicast data packet, simultaneously provide single-period transmission per hop for the multicast data packet, and effectively reduce the transmission delay of the data packet.

Drawings

FIG. 1 is a schematic diagram of the topology of a network on chip of the present invention;

FIG. 2 is a schematic diagram of a router microarchitecture of the invention;

FIG. 3 is a diagram illustrating a packet structure according to the present invention;

FIG. 4 is a schematic diagram of a multicast packet traversing a router path according to the present invention;

FIG. 5 is a schematic diagram of a reconfigurable crossbar configuration multicast path of the present invention;

FIG. 6 is a schematic diagram of a neural network model to network on chip mapping of the present invention;

fig. 7 is a schematic diagram of multicast traffic in accordance with the present invention.

Detailed Description

In this embodiment, as shown in fig. 1, a transmission method of a deep neural network accelerator based on a network on chip is a two-dimensional mesh network with a size of mxn, which is composed of mxn routers, mx (N-1) processing units PE, and M on-chip memory units; wherein M and N are integers greater than 3; the local input and output ports of the front N-1 row of routers are respectively connected with M× (N-1) processing units PE, and the local input and output ports of the M routers in the N row are respectively connected with M on-chip memory units; the processing unit PE and the on-chip memory unit are connected to each other by a mesh network. The rightmost router of the network is not only used for traffic communication in the chip, but also is responsible for exchanging information with the off-chip main memory.

As shown in fig. 2, the router includes: input and output ports in the east, south, west, north and local directions, an input buffer area, a route calculation unit, a data packet processing unit, a data bypass, a reconfigurable crossbar, a link reconfiguration unit, a plurality of multiplexers and demultiplexers; the multiplexer and the demultiplexer are respectively connected with the input buffer area and the data bypass; the data packet processing unit is mainly responsible for generating multiplexer and demultiplexer strobe signals and modifying destination address vector information of multicast data packets. The route calculation unit finds the appropriate output port for the unicast packet. The link reconfiguration unit configures the reconfigurable crossbar switch so that the reconfigurable crossbar switch can maintain a unidirectional link. The reconfigurable crossbar is not only responsible for serially transmitting unicast data packets to each output port, but also responsible for reconfiguring a fast channel for multicast data packets, so that a dedicated multicast path is formed among a plurality of routers. When the multicast data packet reaches the destination node, the multiplexer of the local output port can directly send the data packet to the local output port, and the replication of the multicast data packet can be realized without the need of the crossbar.

Dividing the data packet into a multicast data packet and a unicast data packet; the multicast data packet includes: the type of the data packet, the destination node address vector, the coordinates and the load of the source node; the unicast data packet includes: the type of the data packet, the coordinates of the destination node, the time stamp, the coordinates of the source node and the load;

the packet formats are shown in fig. 3 as part (a) and part (b), respectively. The fields are explained as follows:

type (2): 1bit,0 indicates that the packet is a unicast packet, and 1 indicates that the packet is a multicast packet.

Destination address of unicast packet: the total of 12 bits is occupied, the position of the destination node is represented by the coordinates of the destination node in the grid network, and the coordinates x and y respectively occupy 6 bits;

timestamp: when the multicast data packet and the unicast data packet contend, the multicast data packet has higher priority, and in order to prevent the unicast data packet from starving, a timestamp field is set for the unicast data packet, and the number of bits of the timestamp is (n-12) bit. When the value of the timestamp is greater than the threshold value set by us, the unicast data packet is preferentially sent.

Multicast data packet destination node vector: n represents the number of nodes in the network, each bit of the destination node vector in the multicast data packet corresponds to a router respectively, if any one bit address of the destination node vector is 1, the router corresponding to the corresponding bit address is a destination router, and if 0, the router corresponding to the corresponding bit address is not the destination router; the packet is deleted when the vector is all 0.

Source address: the unicast and multicast data packets are the same, and the coordinates are used to represent the position of the source node, and the coordinates x and y each occupy 6 bits.

Payload: i.e. the valid data carried by the data packet, the length of this field varies with the specific application.

In specific implementation, the transmission method is carried out according to the following steps:

step 2: the packet processing unit of the current router determines whether it is a multicast packet or a unicast packet according to the type of the received packet, and if it is a multicast packet, it goes to step 3-step 6, as shown in fig. 4. If the data packet is unicast data packet, the step 7-step 10 is entered;

Step 5: the demultiplexer connected with the data packet processing unit of the current router sends the multicast data packet to the data bypass to reach the reconfigurable crossbar switch and the multiplexer of the router respectively according to the gating signal; the multiplexer transmits the multicast data packet to a local output port to be sent to a processing unit PE connected with a router;

step 6: the reconfigurable crossbar switch of the current router judges whether a matched output direction exists in the input direction of the multicast data packet; if yes, the multicast data packet is directly sent to the matched output direction, if no, a router corresponding to the first address of 1 in the destination node address vector is searched and used as a next destination router for sending the multicast data packet, so that the multicast data packet is output to the next destination router from an output port by using a YX routing algorithm; the link reconfiguration unit of the current router pairs corresponding input and output directions for the transmission of subsequent data packets according to the multicast data packets transmitted by the YX routing algorithm;

step 7: the demultiplexer connected with the data packet processing unit of the current router sends unicast data packets to an input buffer area according to the gating signals, and the input buffer area receives the unicast data packets and stores the unicast data packets in a first-in first-out mode;

and step 10, judging whether the time stamp of the unicast data packet reaches a threshold value, if so, sending the unicast data packet to the output direction calculated by the route calculation unit by the reconfigurable crossbar switch, otherwise, sending the unicast data packet to the matched output direction by the reconfigurable crossbar switch, and outputting the unicast data packet to the next destination router.

After the first multicast data packet passes through the above process, the reconfigurable crossbar switch in the router finishes pairing of input and output directions, the subsequent multicast data packet does not pass through the buffer area, route calculation, crossbar switch allocation, virtual channel selection, crossbar switch traversal and other stages, except that the first data packet needs to carry out multicast link construction, other data packets can be directly sent to an output port only through the stage that one data packet processing unit processes, thereby ensuring single-cycle per-hop transmission of the multicast data packet.

Fig. 5 shows the construction process of the multicast link in the accelerator platform when the calculation result of the i-th layer neuron is sent to all neurons of the i+1-th layer. By pairing inputs and outputs of the crossbar switch by reconfigurable crossbar switch configuration in all routers in the path, a path is formed from which multicast packets are directly transmitted to the output ports of the routers without the crossbar switch assignment and crossbar switch traversal stages. Reconfigurable cross switches of all routers with neuron cluster mapping in the ith layer and the (i+1) th layer are reconfigured, and all source nodes and destination nodes are connected in one path, so that the multicast data packet between the source nodes and the destination nodes realizes single-period transmission per hop, and meanwhile, the multicast link has higher reuse rate. And after the calculation of the processing unit PE is finished, a link release signal is sent, and when the input and output directions of the reconfigurable crossbar switch of the router at the upstream of the node are not matched, the input and output directions of the reconfigurable crossbar switch connected with the processing unit PE are matched and released. If not, the upstream router waits to release the pairing of the input and output directions of the reconfigurable crossbar switch, and then releases the pairing of the input and output directions of the reconfigurable crossbar switch of the router.

In a large-scale neural network, if each processing unit only performs calculation of one neuron, the flow in the accelerator platform will be increased sharply, so that the communication time of the processing units in the platform is far longer than the calculation time, and the efficiency of the platform is reduced. Thus, suitable clustering and mapping algorithms are required to balance the communication time and computation time of the processing units. According to the size of the neural network, several neurons of each layer of the neural network are combined into one neuron cluster, and each neuron cluster can only contain neurons of the same layer. The clusters of neurons are then mapped onto the network-on-chip, with each row of the network-on-chip mapping only clusters of neurons from the same layer of the neural network. In the embodiment, a deep neural network is deployed on M× (N-1) processing units PE, and each row of processing units PE is mapped with neurons of the same layer; the N-1 column router acquires parameters of the deep neural network model and stores the parameters in an on-chip storage unit as a data packet. Part (a) in fig. 6 is a four-layer deep neural network, and the output result of one layer of neurons on the deep neural network is used as the input of the next layer of neurons and is sent to all neurons of the next layer. Part (b) in fig. 6 shows the division result of the part (a) neurons, where at most two neurons are divided into one neuron cluster, each of which can only contain neurons from the same layer. After all the neurons of the neural network are divided, the communication of the neurons in the neural network is changed into the communication among the neuron clusters, and then the communication is mapped onto the grid network by taking the neuron clusters as units. Part (c) in fig. 6 represents the mapping of part (b) onto the mesh network, assuming that each router contains only one processing unit, each processing unit processing one cluster of neurons. Each row of the network on chip only maps the neuron clusters of the same layer on the neural network, and the communication among different neuron clusters exchanges information through the router.

Parameters required for calculation by the processing units PE containing neurons of the same layer are obtained through the processing units PE of neurons of other layers or on-chip storage units, and in a deep neural network, multicast traffic often has a fixed communication mode in a work load. The output of the neurons of the previous layer in the deep neural network is used as the input of the neurons of the next layer and is sent to all the neurons of the next layer. Considering the specific traffic pattern of the deep neural network and the mapping from the neuron clusters to the network on chip, when the processing unit where the neuron cluster of one layer is located on the deep neural network sends a data packet to the processing unit where the neuron cluster of the next layer is located, the positions of destination nodes which are represented as the data packet on the two-dimensional grid network are adjacent, and the calculation result of the processing unit PE of the upper layer is sent to all the processing units PE of the next layer through a router. FIG. 7 illustrates a portion of a deep neural network mapping onto a two-dimensional mesh network, gray nodes of each row in the two-dimensional mesh network indicating the presence of a neuron-cluster mapping on the processing unit of the node, the neuron-clusters in each row of processing units being from the same layer of the neural network, blank nodes indicating that the node does not have a neuron-cluster mapping, and shaded nodes indicating on-chip modules connected by routers. The first layer 1 to 3 nodes need to serve as source nodes to respectively send multicast data packets to the second layer destination nodes 6 to 9 nodes. If multicast links are respectively established for

nodes

1, 2 and 3, a lot of time is required for link establishment in the network and the reuse rate of the links is extremely low. Therefore, the multicast link established by the invention can contain all source nodes in the same layer and all destination nodes in the next layer in the same link so as to ensure that the multicast link has higher reuse rate. The node 5 of the memory unit in the connection slice needs to send parameters such as weight to

nodes

11 and 12 mapped by the third layer of neurons, and the acceleration platform can also establish a multicast path to provide single-period transmission per hop for the multicast data packet. Since no other bypass is added in the crossbar, the same router cannot participate in the construction of multicast links with two different directions at the same time in order to ensure the correct transmission of unicast data packets. After the calculation of the data in the data packet is finished by the neurons on all the processing units PE, the final result is sent to the external memory through the router.

For a large number of one-to-many flows in the accelerator, the performance of the platform is greatly reduced. Therefore, the invention provides a path-based multicast transmission mechanism, which greatly utilizes the flow characteristics of the neural network and has good expansibility. Meanwhile, a router architecture is designed, so that multicast data packets can be effectively duplicated, and single-period transmission per hop is provided for the multicast data packets. Simulation results show that the scheme provided by the invention can effectively reduce the classification delay, the average data packet delay and the number of data packets transmitted by the network.

Claims

1. The transmission method of the deep neural network accelerator based on the network-on-chip is characterized in that the network-on-chip is a two-dimensional grid network with the scale of MxN, which consists of M x N routers, M x (N-1) processing units PE and M on-chip storage units; wherein M and N are integers greater than 3; the router includes: input and output ports in the east, south, west, north and local directions, an input buffer area, a route calculation unit, a data packet processing unit, a data bypass, a reconfigurable crossbar, a link reconfiguration unit, a plurality of multiplexers and demultiplexers; the multiplexer and the demultiplexer are respectively connected with the input buffer area and the data bypass;

the transmission method comprises the following steps:

2. The transmission method of the network-on-chip-based deep neural network accelerator according to claim 1, wherein a deep neural network is deployed on m× (N-1) processing units PE, and each row of processing units PE maps neurons of the same layer; the N-1 column router acquires parameters of the deep neural network model and stores the parameters in an on-chip storage unit as a data packet; the parameters required by the calculation of the processing units PE containing the neurons of the same layer are obtained through the processing units PE of the neurons of other layers or the on-chip storage unit, and after the calculation of the data in the data packet by the neurons on all the processing units PE is finished, the final result is sent to the external memory through the router.