CN112734018A

CN112734018A - Neural network hardware accelerator

Info

Publication number: CN112734018A
Application number: CN202011594118.3A
Authority: CN
Inventors: 王佳东; 李远超; 蔡权雄; 牛昕宇
Original assignee: Shandong Industry Research Kunyun Artificial Intelligence Research Institute Co ltd
Current assignee: Shandong Industry Research Kunyun Artificial Intelligence Research Institute Co ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-30

Abstract

The application discloses a neural network hardware accelerator, and belongs to the technical field of hardware acceleration. The neural network hardware accelerator comprises N computing channels and N data sorting channels, wherein the N computing channels are connected with the N data sorting channels in a one-to-one correspondence mode. When the neural network hardware accelerator works, each register R delays input data for one period and then outputs the input data to a corresponding data path, and the time difference of data path output data caused by the delayed output of the register R is complemented through a data sorting channel. In the embodiment of the application, the time span of data input and output of the neural network hardware accelerator can be shortened through the delayed output action of the register R, so that the working frequency of the neural network hardware accelerator is higher.

Description

Neural network hardware accelerator

Technical Field

The application relates to the technical field of hardware acceleration, in particular to a neural network hardware accelerator.

Background

Neural network hardware accelerators typically have multiple data paths for performing calculations on input data to the neural network. And the calculation efficiency can be improved by performing parallel calculation on a plurality of data paths.

In the related art, as the number of data paths increases, the wiring from the input terminal of the neural network to the input terminal of each data path is lengthened. In this case, the input data of the neural network is input to the plurality of data paths at different times, so that the data output times of the plurality of data paths are different, and the operating frequency of the neural network accelerator cannot be increased.

Disclosure of Invention

The embodiment of the application provides a neural network hardware accelerator, and the neural network hardware accelerator is short in data output time span and high in working frequency. The technical scheme is as follows:

in a first aspect, a neural network hardware accelerator is provided, including: the device comprises N computing channels and N data sorting channels, wherein N is an integer greater than or equal to 2, and the N computing channels correspond to the N data sorting channels one by one;

each of the N computation channels comprises a register R and a data path, and an output end of the register R is connected with an input end of the data path;

the input end of the register R of the first computing channel in the N computing channels is used for inputting input data of a neural network, the input end of the register R of the ith computing channel in the N computing channels is connected with the output end of the register R of the (i-1) th computing channel, and i is an integer which is greater than or equal to 2 and less than or equal to N;

and the output end of the data path of each of the N computing channels is connected with the input end of the corresponding data sorting channel, and each of the N data sorting channels is used for delaying the input data for at least one clock cycle and then outputting the delayed data, so that the output ends of the N data sorting channels simultaneously output the data.

In this application, after the register R of the first computing channel inputs data, the register R of the first computing channel may transfer the input data to the data path of the first computing channel and the register R of the second computing channel. After the register R of the second computing channel inputs data, the input data can be transferred to the data path of the second computing channel and the register R of the third computing channel, and so on. In the process, each register R delays the input data by one period and outputs the delayed input data to the corresponding data path, and the time difference of the data path output data caused by the delayed output of the register R is complemented by the data sorting channel. In the embodiment of the application, the time span of inputting data and outputting data of the neural network hardware accelerator can be shortened through the delayed output action of the register R. As the time span of data output by the neural network hardware accelerator becomes shorter, the working frequency of the neural network hardware accelerator is higher, and the peak calculation power is higher.

Optionally, a first data sorting channel of the N data sorting channels has N + X registers R connected in series; the ith data sorting channel in the N data sorting channels is provided with N +1-i + X registers R connected in series; and X is an integer greater than or equal to zero.

Optionally, a transmission time of data between any two adjacent registers R in each of the N data sorting channels is less than one clock cycle.

Optionally, the sum of the transmission time of data between the output end of the data path of the ith computing channel and the input end of the corresponding data sorting channel and the transmission time of data between the output end of the register R of the ith computing channel and the input end of the data path is less than one clock cycle;

the sum of the transmission time of data between the output of the data path of the first computation channel and the input of the corresponding data grooming channel and the transmission time of data between the output of the register R of the first computation channel and the input of the data path is less than one clock cycle.

Optionally, a transmission time of data between the input terminal of the register R of the ith computation channel and the output terminal of the register R of the (i-1) th computation channel is less than one clock cycle.

Optionally, the output ends of the N data sorting channels are connected.

In a second aspect, a neural network hardware accelerator is provided, comprising: the device comprises N computing channels and N-1 data sorting channels, wherein N is an integer greater than or equal to 2, and the first N-1 computing channels in the N computing channels correspond to the N-1 data sorting channels one by one;

and the output end of the data path of each of the first N-1 computing channels is connected with the input end of the corresponding data sorting channel, and each of the N-1 data sorting channels is used for delaying the input data for at least one clock cycle and then outputting the delayed data, so that the output ends of the N-1 data sorting channels and the output end of the data path of the Nth computing channel in the N computing channels simultaneously output data.

In this application, after the register R of the first computing channel inputs data, the register R of the first computing channel may transfer the input data to the data path of the first computing channel and the register R of the second computing channel. After the register R of the second computing channel inputs data, the input data can be transferred to the data path of the second computing channel and the register R of the third computing channel, and so on. In the process, each register R delays the input data by one period and outputs the input data to the corresponding data path, and the time difference of data output by the data path caused by the delayed output of the register R is complemented by the data sorting channels, so that the output ends of the data paths of the N data sorting channels and the Nth computing channel simultaneously output the data. In the embodiment of the application, the time span of inputting data and outputting data of the neural network hardware accelerator can be shortened through the delayed output action of the register R. As the time span of data output by the neural network hardware accelerator becomes shorter, the working frequency of the neural network hardware accelerator is higher, and the peak calculation power is higher.

Optionally, a first data sorting channel of the N-1 data sorting channels has N-1 registers R connected in series; the jth data sorting channel in the N-1 data sorting channels is provided with N-j registers R which are connected in series, wherein j is an integer which is larger than or equal to zero and smaller than or equal to N-1.

Optionally, the output ends of the N-1 data sorting channels are connected to the output end of the data path of the nth computing channel.

Optionally, the neural network hardware accelerator further comprises: an output bus;

the output bus is connected with the output ends of the N-1 data sorting channels, and the output bus is connected with the output end of the data path of the Nth computing channel;

the transmission time of data between the output of the data path of the nth computational channel and the output bus is less than one clock cycle.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a neural network hardware accelerator in the related art;

FIG. 2 is a schematic structural diagram of a first neural network hardware accelerator provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a second neural network hardware accelerator provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a third neural network hardware accelerator provided in an embodiment of the present application.

Wherein, the meanings represented by the reference numerals of the figures are respectively as follows:

the related technology comprises the following steps:

10. a neural network hardware accelerator;

102. an input of a neural network;

110. a data path;

the application:

20. a neural network hardware accelerator;

202. an input of a neural network;

210. calculating a channel;

212. a data path;

220. a data sorting channel;

230. and outputting the bus.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be understood that reference to "a plurality" in this application means two or more. In the description of the present application, "/" means "or" unless otherwise stated, for example, a/B may mean a or B; "and/or" herein is only an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, for the convenience of clearly describing the technical solutions of the present application, the terms "first", "second", and the like are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

Before explaining the embodiments of the present application in detail, an application scenario of the embodiments of the present application will be described.

As shown in fig. 1, in the related art, the neural network hardware accelerator 10 generally has a plurality of data paths 110. The input 102 of the neural network is connected to an input of each data path 110. The input 102 of the neural network is used to input data to the neural network and the input data is routed to the input of each data path 110. Each data path 110 is used to perform calculations on input data. The plurality of data paths 110 are calculated in parallel, and the calculation efficiency can be improved.

However, as the number of data paths 110 increases, the routing of the input 102 of the neural network to the input of the data paths 110 gradually increases, and at this time, the time required for the input data to pass from the input 102 of the neural network to the input of the data paths 110 also gradually increases. This may cause the input data of the neural network to be input to the plurality of data paths 110 at different times, which may cause the data output times of the plurality of data paths 110 to be different, and thus the operating frequency of the neural network accelerator 10 may not be increased, and the peak calculation power may not be increased. For example, if the data transmission time between the data path 110 closest to the input terminal 102 of the neural network and the input terminal 102 of the neural network is 1ns, and the data transmission time between the data path 110 farthest from the input terminal 102 of the neural network and the input terminal 102 of the neural network is 10ns, the operating frequency of the neural network accelerator 10 is 1/10ns, i.e., 100 MHz. If the data transmission time between each data path 110 of the plurality of data paths 110 and the input terminal 102 of the neural network is 1ns, the operating frequency of the neural network accelerator 10 is 1/1ns, i.e., 1000 MHz.

The application provides a neural network hardware accelerator which is short in data output time span and high in working frequency.

The neural network hardware accelerator 20 provided in the embodiments of the present application is explained in detail below.

Fig. 2 is a schematic structural diagram of a neural network hardware accelerator 20 according to an embodiment of the present application. Referring to fig. 2, the neural network hardware accelerator 20 includes N computation channels 210 and N data grooming channels 220. N is an integer greater than or equal to 2.

Specifically, each of the N computation channels 210 is configured to obtain input data of the neural network and perform computation on the input data of the neural network. Where each compute channel 210 of the N compute channels 210 includes a register R and a data path 212. In each compute channel 210, the output of register R of the compute channel 210 is connected to the input of data path 212 of the compute channel 210, and in the compute channel 210, the direction of data transfer is from register R to data path 212. The register R is used for delaying and outputting input data of the neural network, and the delay time may be one clock cycle. In other words, after the register R acquires the input data of the neural network, the input data of the neural network is delayed by one cycle and is transmitted to the data path 212. The data path 212 is used to perform calculations on the input data. Generally, when calculating input data of a neural network, the calculation parameters in different data paths 212 are different. For example, each data path 212 of the N computation channels 210 may perform a convolution calculation on input data of the neural network. When performing convolution calculation, the weights in the set convolution kernels are different for each data path 212. The N computation channels 210 may compute the input data in parallel.

Of the N computation channels 210, the input of the register R of the first computation channel 210 is used for inputting input data of the neural network. The "first" is used for distinguishing from the "ith" described below, and is not used for limiting the position of the calculation channel 210. For ease of understanding, in the embodiment shown in FIG. 2, the first compute channel 210 refers to the compute channel 210 located on the row with the row number "(1)". The input of the register R of the first computation channel 210 may be connected to the input 202 of the neural network, so that input data of the neural network are input. In the N calculation channels 210, the input end of the register R of the ith calculation channel 210 is connected to the output end of the register R of the (i-1) th calculation channel 210, and i is an integer greater than or equal to 2 and less than or equal to N. In other words, the input of register R of the second computational channel 210 is connected to the output of register R of the first computational channel 210; the input of the register R of the third computation channel 210 is connected to the output of the register R of the second computation channel 210; the input of register R of the Nth compute channel 210 is coupled to the output of register R of the (N-1) th compute channel 210. In the embodiment shown in FIG. 2, second compute channel 210 refers to compute channel 210 located on the row with row number "(2)"; the third compute channel 210 refers to the compute channel 210 located at the row with the row number "(3)"; the (N-1) th compute channel 210 refers to the compute channel 210 located at the row with the row number "(N-1)"; the Nth compute channel 210 refers to the compute channel 210 located on the row with the row number "(N)". In the embodiment of the present application, i traverses from 2 to N, i is each of 2 to N in turn.

The N calculation channels 210 correspond to the N data sorting channels 220 one to one. The output end of the data path 212 of each computation channel 210 of the N computation channels 210 is connected to the input end of the corresponding data sorting channel 220, so that after each data path 212 completes computation on the input data, the data is output to the corresponding data sorting channel 220. The data grooming channel 220 is used to defer data input to the data grooming channel 220. In general, the data sorting channel 220 may delay the input data for at least one cycle and output the delayed data, so that the outputs of the N data sorting channels 220 output the data at the same time. In the embodiment of the present application, the data path 212 of the first computing channel 210 is delayed by one clock cycle from the register R of the first computing channel 210 when obtaining the input data of the neural network. The data path 212 of the second computational channel 210 is delayed by two clock cycles by the register R of the first computational channel 210 and the register R of the second computational channel 210 when obtaining input data for the neural network. The data path 212 of the nth computing channel 210 delays the register R of the first computing channel 210 and the register R … … of the second computing channel 210 by N clock cycles when obtaining the input data of the neural network. Therefore, the N data sorting channels 220 need to output data simultaneously by respectively compensating the time difference of data output from the data paths 212.

In this embodiment, when the hardware network accelerator operates, the input end of the register R of the first computing channel 210 may input the input data of the neural network. The input data passes through register R of the first compute channel 210 and is output to data path 212 of the first compute channel 210 and register R of the second compute channel 210. After the register R of the second computing channel 210 obtains the input data, the input data is transmitted to the data path 212 of the second computing channel 210 and the register R … … of the N-1 th computing channel 210 of the register R … … of the third computing channel 210 to obtain the input data, and then the input data is transmitted to the data path 212 of the N-1 th computing channel 210 and the register R of the N-1 th computing channel 210. Each data path 212 may perform calculations on input data and output the calculation results to the data sorting channel 220 corresponding to the calculation channel 210. The N data sorting channels 220 are respectively configured to delay data output by the N computing channels 210, so that the N data sorting channels 220 output data at the same time. The neural network hardware accelerator 20 can shorten the time span of the neural network hardware accelerator 20 for inputting and outputting data through the delayed output function of the register R. As the time span over which the neural network hardware accelerator 20 outputs data becomes shorter, the operating frequency of the neural network hardware accelerator 20 becomes higher and the peak computational power becomes higher.

For example, in the related art, if the data transmission time between the data path 110 closest to the input terminal 102 of the neural network and the input terminal 102 of the neural network is 1ns, and the data transmission time between the data path 110 farthest from the input terminal 102 of the neural network and the input terminal 102 of the neural network is 10ns, the operating frequency of the neural network accelerator 10 is 1/10ns, that is, 100 MHz. In the present application, the input 202 of the neural network is connected to only the data path 212 of the first computing channel 210 and the register R of the second computing channel 210 via the register R of the first computing channel 210; the output of register R of the second computing channel 210 is connected only to the data path 212 of the second computing channel 210 and to register R of the third computing channel 210, and so on. In the neural network hardware accelerator 20, the register R in each computation channel 210 is connected to only two data paths 212, so that the maximum fan-out (the number of data paths 212 to which the register R of one input data is connected) of the neural network hardware accelerator 20 is reduced from N to 2. Meanwhile, by inserting the register R in the wiring for data transmission, the delay effect of the wiring on data transmission can be reduced. In the present application, if the maximum delay between any two registers R is 5ns, the operating frequency of the neural network hardware accelerator 20 of the present application is 1/5ns, i.e. 200 MHz.

In the embodiment of the present application, a first data sorting channel 220 of the N data sorting channels 220 has N + X registers R connected in series; the ith data sorting channel 220 of the N data sorting channels 220 has N +1-i + X registers R connected in series. Wherein X is an integer greater than or equal to zero. The first data sorting channel 220 refers to the data sorting channel 220 corresponding to the first computing channel 210 in the N data sorting channels 220. The ith data sorting channel 220 refers to the data sorting channel 220 corresponding to the ith computing channel 210 in the N data sorting channels 220.

Specifically, as is known from the above description, the data path 212 of the first computing channel 210 is delayed by one clock cycle when obtaining the input data of the neural network, and the data path 212 of the second computing channel 210 is delayed by two clock cycles … … when obtaining the input data of the neural network and the data path 212 of the nth computing channel 210 is delayed by N clock cycles when obtaining the input data of the neural network. Therefore, to compensate for the time difference between the data output from the data paths 212, the N data sorting channels 220 may have a plurality of registers R connected in series.

Fig. 2 illustrates an embodiment where X is equal to zero in the data grooming channel 220. In other words, in the embodiment shown in FIG. 2, the first data sorting pass 220 has N registers R connected in series, the second data sorting pass 220 has N-1 registers R connected in series, and the third data sorting pass 220 has N-2 registers R … … connected in series, with one register R in the Nth data sorting pass 220. Therefore, the input data of the neural network enters the first computing channel 210 and outputs data to the first data sorting channel 220, and the data passes through N +1 registers R connected in series; the input data from the neural network enters the second computing channel 210, the output data from the second data sorting channel 220 also passes through N +1 registers R … … connected in series, the input data from the neural network enters the Nth computing channel 210, the output data from the Nth data sorting channel 220 also passes through N +1 registers R connected in series. Since the input data passes through each of the computation path 210 and the data sorting path 220, the input data passes through N +1 registers R connected in series, and thus the data sorting path 220 can be used for complementing the time difference of the output data of each data path 212.

Fig. 3 illustrates an embodiment where X is equal to two in the data grooming channel 220. In other words, in the embodiment shown in FIG. 3, the first data sorting pass 220 has N +2 registers R connected in series, the second data sorting pass 220 has N +1 registers R connected in series, and the third data sorting pass 220 has N registers R … … connected in series, and the Nth data sorting pass 220 has three registers R. Compared to the embodiment shown in fig. 2, each data sorting channel 220 has a larger number (two) of registers R connected in series, so that the data sorting channels 220 can also be used to complement the time difference of the data output from the data paths 212.

In the embodiment of the present application, please refer to fig. 2 or fig. 3, a data transmission time between any two adjacent registers R in each data sorting channel 220 of the N data sorting channels 220 is less than one clock cycle.

Specifically, in the neural network hardware accelerator 20, the register R enters an input and an output according to a clock signal. That is, data input by the register R at one clock signal is output at the next clock signal. Therefore, the transmission time of data between the two registers R should be less than one clock cycle to avoid the transmission time of data from reducing the operating frequency of the neural network hardware accelerator 10. Therefore, the wiring length between any two adjacent registers R in each data sorting channel 220 of the N data sorting channels 220 should satisfy: when data is transferred between the two registers R, the transfer time is less than one clock cycle.

In the embodiment of the present application, referring to fig. 2 or fig. 3, the sum of the transmission time of data between the output end of the data path 212 of the ith computing channel 210 and the input end of the corresponding data sorting channel 220 and the transmission time of data between the output end of the register R of the ith computing channel 210 and the input end of the data path 212 is less than one clock cycle. The sum of the transfer time of data between the output of the data path 212 of the first computation channel 210 and the input of the corresponding data grooming channel 220, and the transfer time of data between the output of the register R of the first computation channel 210 and the input of the data path 212, is less than one clock cycle.

Specifically, for any one of the computing channels 210 and the corresponding data sorting channel 220, the transmission time of data between the register R of the computing channel 210 and the first register R of the data sorting channel 220 includes: the sum of the transit time of data between the output of data path 212 of computation channel 210 and the input of the corresponding data grooming channel 220, and the transit time of data between the output of register R of computation channel 210 and the input of data path 212. In the embodiment of the present application, the transmission time of data between the two registers R is less than one clock cycle, so as to avoid the transmission time of data from lowering the operating frequency of the neural network hardware accelerator 10. Thus, the sum of the transfer time of data between the output of the data path 212 and the input of the corresponding data grooming channel 220 of each compute channel 210, and the transfer time of data between the output of the register R and the input of the data path 212 of that compute channel 210, is less than one clock cycle. Each of these includes the first and ith, i going through 2 to N.

In the embodiment of the present application, please refer to fig. 2 or fig. 3, the transmission time of data between the input terminal of the register R of the ith computing channel 210 and the output terminal of the register R of the (i-1) th computing channel 210 is less than one clock cycle.

Specifically, in any two computing channels 210, the transmission time of data between two connected registers R is less than one clock cycle, which can prevent the transmission time of data from the output end of the register R of the i-1 th computing channel 210 to the input end of the register R of the i-th computing channel 210 from reducing the operating frequency of the neural network hardware accelerator 10. Thus, in the embodiment of the present application, the data transfer time between the input of the register R of the ith computing channel 210 and the output of the register R of the (i-1) th computing channel 210 is less than one clock cycle.

In the embodiment of the present application, please refer to fig. 2 or fig. 3, the output ends of the N data sorting channels 220 are connected.

Specifically, the output terminals of the N data sorting channels 220 are connected to the output bus 230, so that the N data sorting channels 220 can simultaneously output data to the output bus 230. In general, the bit width of the output bus 230 may be equal to or greater than the sum of the bit widths of the output data of the N data grooming channels 220. Bit width here refers to the amount of data per transfer. For example, N in the neural network hardware accelerator 20 is 32, that is, the neural network hardware accelerator 20 includes 32 computation channels 210 and 32 data grooming channels 220. The bit width of the data output to the data sorting channel 220 by each computing channel 210 is 16 bits, that is, the bit width of the data output to the output bus 230 by each data sorting channel 220 is 16 bits. At this time, the bit width of the output bus 230 is at least 512 bits.

When the bit width of the output bus 230 is 512 bits, it is assumed that the bit width of the output bus 230 includes bits 0 to 511. At this time, the data output by the data sorting channel 220 corresponding to the first computing channel 210 may occupy bits 0 to 15 of the bit width of the output bus 230. The data output by the data sorting channel 220 corresponding to the second computing channel 210 can occupy the data output by the data sorting channel 220 corresponding to the thirty-second computing channel 210 from bit16 to bit31 … … of the bit width of the output bus 230, and can occupy the data output by the data sorting channel 220 corresponding to the thirty-second computing channel 210 from bit496 to bit511 of the bit width of the output bus 230. At this time, the output bus 230 simultaneously transmits the data output by the N data sorting channels 220.

Based on similar inventive concepts, fig. 4 is a schematic structural diagram of another neural network hardware accelerator 20 provided in the embodiments of the present application. Referring to FIG. 4, the neural network hardware accelerator 20 includes N computation channels 210 and N-1 data grooming channels 220. N is an integer greater than or equal to 2.

Specifically, each of the N computation channels 210 is configured to obtain input data of the neural network and perform computation on the input data of the neural network. Where each compute channel 210 of the N compute channels 210 includes a register R and a data path 212. In each compute channel 210, the output of register R of that compute channel 210 is connected to the input of data path 212 of that compute channel 210, such that in that compute channel 210, the direction of transfer of data is from register R to data path 212. The register R is used for delaying and outputting input data of the neural network, and the delay time may be one clock cycle. In other words, after the register R acquires the input data of the neural network, the input data of the neural network is delayed by one cycle and is transmitted to the data path 212. The data path 212 is used to input data for computation. Generally, when calculating input data of a neural network, the calculation parameters in different data paths 212 are different. For example, each data path 212 of the N computation channels 210 may perform a convolution calculation on input data of the neural network. When performing convolution calculation, the weights in the set convolution kernels are different for each data path 212. The N computation channels 210 may compute the input data in parallel.

Of the N computation channels 210, the input of the register R of the first computation channel 210 is used for inputting input data of the neural network. The "first" is used for distinguishing from the "ith" described below, and is not used for limiting the position of the calculation channel 210. For ease of understanding, in the embodiment shown in FIG. 4, the first compute channel 210 refers to the compute channel 210 located on the row with the row number "(1)". The input of the register R of the first computation channel 210 may be connected to the input 202 of the neural network, so that input data of the neural network are input. In the N calculation channels 210, the input end of the register R of the ith calculation channel 210 is connected to the output end of the register R of the (i-1) th calculation channel 210, and i is an integer greater than or equal to 2 and less than or equal to N. In other words, the input of register R of the second computational channel 210 is connected to the output of register R of the first computational channel 210; the input of the register R of the third computation channel 210 is connected to the output of the register R of the second computation channel 210; the input of register R of the Nth compute channel 210 is coupled to the output of register R of the (N-1) th compute channel 210. In the embodiment shown in FIG. 4, second compute channel 210 refers to compute channel 210 located on the row with row number "(2)"; the third compute channel 210 refers to the compute channel 210 located at the row with the row number "(3)"; the (N-1) th compute channel 210 refers to the compute channel 210 located at the row with the row number "(N-1)"; the Nth compute channel 210 refers to the compute channel 210 located on the row with the row number "(N)". In the embodiment of the present application, i is traversed from 2 to N.

Of the N computing lanes 210, the first N-1 computing lanes 210 correspond one-to-one to the N-1 data grooming lanes 220. The output end of the data path 212 of each computation channel 210 in the first N-1 computation channels 210 is connected to the input end of the corresponding data sorting channel 220, so that after each data path 212 completes computation on the input data, the data is output to the corresponding data sorting channel 220. The data grooming channel 220 is used to defer data input to the data grooming channel 220. In general, the data sorting channel 220 may delay the input data for at least one cycle and output the delayed data, such that the outputs of the N-1 data sorting channels 220 output the data simultaneously. In the embodiment of the present application, the data path 212 of the first computing channel 210 is delayed by one clock cycle from the register R of the first computing channel 210 when obtaining the input data of the neural network. The data path 212 of the second computational channel 210 is delayed by two clock cycles by the register R of the first computational channel 210 and the register R of the second computational channel 210 when obtaining input data for the neural network. The data path 212 of the (N-1) th computational channel 210 delays the register R of the first computational channel 210 and the register R … … of the second computational channel 210 by N-1 clock cycles when obtaining input data for the neural network. The data path 212 of the nth computing channel 210 delays the register R of the first computing channel 210 and the register R … … of the second computing channel 210 by N clock cycles when obtaining the input data of the neural network. Therefore, the N-1 data grooming channels 220 need to complement the time difference of the data output from the first N-1 data paths 212, so that the output of the N-1 data grooming channels 220 and the output of the data path 212 of the Nth computing channel 210 of the N computing channels 210 output data at the same time.

In this embodiment, when the hardware network accelerator operates, the input end of the register R of the first computing channel 210 may input the input data of the neural network. The input data passes through register R of the first compute channel 210 and is output to data path 212 of the first compute channel 210 and register R of the second compute channel 210. After the register R of the second computing channel 210 obtains the input data, the input data is transmitted to the data path 212 of the second computing channel 210 and the register R … … of the N-1 th computing channel 210 of the register R … … of the third computing channel 210 to obtain the input data, and then the input data is transmitted to the data path 212 of the N-1 th computing channel 210 and the register R of the N-1 th computing channel 210. Each data path 212 may perform calculations on the input data. After each computation path in each computation channel 210 in the first N-1 computation channels 210 computes input data, the input data is output to a corresponding data sorting channel 220. The N-1 data sorting channels 220 are respectively used for delaying the data output by the N-1 computing channels 210, so that the N-1 data sorting channels 220 and the Nth computing channel 210 output data at the same time. The neural network hardware accelerator 20 can shorten the time span of the neural network hardware accelerator 20 for inputting and outputting data through the delayed output function of the register R. As the time span over which the neural network hardware accelerator 20 outputs data becomes shorter, the operating frequency of the neural network hardware accelerator 20 becomes higher and the peak computational power becomes higher.

In the embodiment of the present application, the first data sorting channel 220 of the N-1 data sorting channels 220 has N-1 registers R connected in series; the jth data grooming channel 220 of the N-1 data grooming channels 220 has N-j registers R connected in series. The first data sorting channel 220 refers to the data sorting channel 220 corresponding to the first computing channel 210 in the N data sorting channels 220. The jth data sorting channel 220 refers to the data sorting channel 220 corresponding to the jth computing channel 210 in the N data sorting channels 220. j is an integer greater than or equal to zero and less than or equal to N-1.

Specifically, as is known from the above description, the data path 212 of the first computing channel 210 is delayed by one clock cycle when obtaining the input data of the neural network, and the data path 212 of the second computing channel 210 is delayed by two clock cycles … … when obtaining the input data of the neural network and the data path 212 of the nth computing channel 210 is delayed by N clock cycles when obtaining the input data of the neural network. Therefore, to complement the time difference of the output data of the first N-1 computing channels 210, the N-1 data sorting channels 220 may respectively have a plurality of registers R connected in series.

As shown in FIG. 4, in the present embodiment, the N-1 th data sorting channel 220 has one register R. The N-2 th data marshalling lane 220 has two registers R … … in series and the first data marshalling lane 220 has N-1 registers R in series. Therefore, the input data of the neural network enters the first computing channel 210 and outputs data to the first data sorting channel 220, and the data passes through N registers R connected in series; the input data from the neural network enters the second computing channel 210, the output data from the second data sorting channel 220 also passes through the N registers R … … connected in series, the input data from the neural network enters the nth computing channel 210, the output data from the nth data sorting channel 220 also passes through the N registers R connected in series. Since the input data passes through each of the computation path 210 and the data sorting path 220, the input data passes through the N registers R connected in series, so that the data sorting path 220 can compensate the time difference of the output data of each data path 212.

In the embodiment of the present application, referring to fig. 4, in order to avoid the transmission time of data from decreasing the operating frequency of the neural network hardware accelerator 10, the transmission time of data between any two adjacent registers R in each of the N-1 data sorting channels 220 is less than one clock cycle.

Specifically, as is known from the above description, in the neural network hardware accelerator 20, the register R enters an input and an output according to a clock signal. That is, data input by the register R at one clock signal is output at the next clock signal. Therefore, the transmission time of data between the two registers R should be less than one clock cycle to avoid the transmission time of data from reducing the operating frequency of the neural network hardware accelerator 10. Therefore, the wiring length between any two adjacent registers R in each data sorting channel 220 of the N-1 data sorting channels 220 should satisfy: when data is transferred between the two registers R, the transfer time is less than one clock cycle.

In the embodiment of the present application, referring to fig. 4, the sum of the transmission time of data between the output end of the data path 212 of the jth computing channel 210 and the input end of the corresponding data sorting channel 220 and the transmission time of data between the output end of the register R of the jth computing channel 210 and the input end of the data path 212 is less than one clock cycle. The sum of the transfer time of data between the output of the data path 212 of the first computation channel 210 and the input of the corresponding data grooming channel 220, and the transfer time of data between the output of the register R of the first computation channel 210 and the input of the data path 212, is less than one clock cycle. j traverses 2 to N-1. In other words, for any one of the computing channels 210 and the corresponding data sorting channel 220, the transmission time of data between the register R of the computing channel 210 and the first register R of the data sorting channel 220 is less than one clock cycle.

In the embodiment of the present application, referring to fig. 4, the transmission time of data between the input terminal of the register R of the ith computing channel 210 and the output terminal of the register R of the (i-1) th computing channel 210 is less than one clock cycle. In other words, in any two compute channels 210, the transit time of data between two consecutive registers R is less than one clock cycle, thereby avoiding the transit time of data from the output of register R of the i-1 th compute channel 210 to the input of register R of the i-th compute channel 210 from reducing the operating frequency of the neural network hardware accelerator 10.

In the embodiment of the present application, referring to FIG. 4, the output terminals of the N-1 data sorting channels 220 are connected to the output terminal of the data path 212 of the Nth computing channel 210.

Specifically, the outputs of the N-1 data grooming channels 220 and the outputs of the data paths 212 of the Nth computing channel 210 may each be connected to an output bus 230, such that the outputs of the N-1 data grooming channels 220 and the outputs of the data paths 212 of the Nth computing channel 210 are connected by the data bus 230. In general, the bit width of the output bus 230 may be equal to or greater than the sum of the bit widths of the output data of the N compute channels 210. For example, N in the neural network hardware accelerator 20 is 32, that is, the neural network hardware accelerator 20 includes 32 computation channels 210 and 31 data sorting channels 220. The bit width of the data output to the data sorting channel 220 by each of the first thirty computing channels 210 is 16 bits, that is, the bit width of the data output to the output bus 230 by each of the data sorting channels 220 is 16 bits. The thirty-second computing channel 210 outputs data to the output bus 230 with a bit width of 16 bits. At this time, the bit width of the output bus 230 is at least 512 bits. When the bit width of the output bus 230 is 512 bits, it is assumed that the bit width of the output bus 230 includes bits 0 to 511. At this time, the data output by the data sorting channel 220 corresponding to the first computing channel 210 may occupy bits 0 to 15 of the bit width of the output bus 230. The data output by the data sorting channel 220 corresponding to the second computing channel 210 may occupy the data output by the data sorting channel 220 corresponding to the thirty-first computing channel 210 from bit16 to bit31 … … of the bit width of the output bus 230, and may occupy the data output by the data sorting channel 220 corresponding to the thirty-second computing channel 210 from bit480 to bit498 of the bit width of the output bus 230. The data output by the thirty-second computing channel 210 may occupy bits 496 through 511 of the bit width of the output bus 230. At this time, the output bus 230 simultaneously transmits the data output from the N-1 data sorting channels 220 and the data output from the Nth computing channel 210.

Further, to avoid the transit time of data from reducing the operating frequency of the neural network hardware accelerator 10, the transit time of data between the output of the data path 212 of the nth compute channel 210 and the output bus 230 is less than one clock cycle.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced, or some technical features may be combined; the modifications, substitutions and combinations are not intended to limit the spirit and scope of the claims and should be construed as being included in the appended claims.

Claims

1. A neural network hardware accelerator, comprising: n computing channels (210) and N data sorting channels (220), wherein N is an integer greater than or equal to 2, and the N computing channels (210) and the N data sorting channels (220) are in one-to-one correspondence;

each compute channel (210) of the N compute channels (210) includes a register R and a data path (212), an output of the register R being connected to an input of the data path (212);

the input end of the register R of the first computing channel (210) in the N computing channels (210) is used for inputting input data of a neural network, the input end of the register R of the ith computing channel (210) in the N computing channels (210) is connected with the output end of the register R of the (i-1) th computing channel (210), and i is an integer which is greater than or equal to 2 and less than or equal to N;

the output end of the data path (212) of each computing channel (210) in the N computing channels (210) is connected with the input end of the corresponding data sorting channel (220), and each data sorting channel (220) in the N data sorting channels (220) is used for delaying the input data for at least one clock cycle and then outputting the delayed data, so that the output ends of the N data sorting channels (220) simultaneously output the data.

2. The neural network hardware accelerator of claim 1, wherein a first data grooming channel (220) of the N data grooming channels (220) has N + X registers R in series; the ith data sorting channel (220) in the N data sorting channels (220) is provided with N +1-i + X registers R connected in series; and X is an integer greater than or equal to zero.

3. The neural network hardware accelerator of claim 2, wherein a data transfer time between any two adjacent registers R in each data grooming channel (220) of the N data grooming channels (220) is less than one clock cycle.

4. The neural network hardware accelerator according to any one of claims 1 to 3, wherein the sum of the transmission time of data between the output of the data path (212) of the ith computation channel (210) and the input of the corresponding data grooming channel (220), and the transmission time of data between the output of the register R of the ith computation channel (210) and the input of the data path (212), is less than one clock cycle;

the sum of the transit time of data between the output of the data path (212) of said first computation channel (210) and the input of the corresponding data grooming channel (220), and the transit time of data between the output of the register R of said first computation channel (210) and the input of the data path (212), is less than one clock cycle.

5. A neural network hardware accelerator according to any one of claims 1 to 3, wherein the data transfer time between the input of register R of the ith computation channel (210) and the output of register R of the (i-1) th computation channel (210) is less than one clock cycle.

6. The neural network hardware accelerator of claim 1, wherein outputs of the N data grooming channels (220) are connected.

7. A neural network hardware accelerator, comprising: n computing channels (210) and N-1 data sorting channels (220), wherein N is an integer greater than or equal to 2, and the first N-1 computing channels (210) in the N computing channels (210) correspond to the N-1 data sorting channels (220) one by one;

the output end of the data path (212) of each computing channel (210) in the first N-1 computing channels (210) is connected with the input end of the corresponding data sorting channel (220), and each data sorting channel (220) in the N-1 computing channels (220) is used for delaying the input data for at least one clock cycle and then outputting the delayed data, so that the output end of the N-1 data sorting channel (220) and the output end of the data path (212) of the Nth computing channel (210) in the N computing channels (210) output data simultaneously.

8. The neural network hardware accelerator of claim 7, wherein a first data grooming channel (220) of the N-1 data grooming channels (220) has N-1 registers R in series; the jth data sorting channel (220) of the N-1 data sorting channels (220) is provided with N-j registers R which are connected in series, and j is an integer which is larger than or equal to zero and smaller than or equal to N-1.

9. The neural network hardware accelerator of claim 7 or 8, wherein outputs of the N-1 data grooming channels (220) are connected to outputs of the data path (212) of the nth computation channel (210).

10. The neural network hardware accelerator of claim 9, further comprising: an output bus (230);

the output bus (230) is connected with the output ends of the N-1 data sorting channels (220), and the output bus (230) is connected with the output end of the data path (212) of the Nth computing channel (210);

the data transfer time between the output of the data path (212) of the nth compute channel (210) and the output bus (230) is less than one clock cycle.