CN111967587A

CN111967587A - Arithmetic unit array structure for neural network processing

Info

Publication number: CN111967587A
Application number: CN202010728621.7A
Authority: CN
Inventors: 韩军; 张权; 张永亮
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2020-11-20
Anticipated expiration: 2040-07-27
Also published as: CN111967587B

Abstract

The invention relates to an arithmetic unit array structure facing neural network processing. The system consists of an arithmetic unit module and a local bus module. The single operation unit module is responsible for completing one-dimensional convolution operation, the local bus module upwards transmits the intermediate result, the intermediate result is accumulated, the operation of two-dimensional convolution is completed, the write back of the intermediate result is reduced, and the overall energy efficiency ratio of the system is improved. The operation unit module is internally provided with a plurality of register files, and simultaneously performs the super-two-dimensional convolution operation of a plurality of convolution kernels, so that the data reuse degree is further improved, and the write back of intermediate results is reduced. The operation unit array adopts a self-organizing mode, receives a control signal from the top layer, automatically calculates the space position of the current operation unit required by the completion of two-dimensional convolution operation according to the ID configuration of the adjacent operation units by the local bus module, then automatically completes the receiving and sending of data and related operation, and has certain autonomy. The invention can improve the calculation efficiency in the neural network processing.

Description

Arithmetic unit array structure for neural network processing

Technical Field

The invention belongs to the technical field of integrated circuit design, and particularly relates to an arithmetic unit array structure for neural network processing.

Background

Deep convolutional neural networks are well applied to important fields such as computer vision, speech recognition and robot control, but various applications also continuously put higher requirements on the precision and complexity of neural network algorithms, so that the algorithm implementation faces a series of challenging problems. Although the traditional processor architecture makes certain progress, the problems of low data reuse degree, poor energy efficiency ratio and the like caused by direct communication of data between arithmetic units exist. In order to improve the above problems, in recent years, researchers have designed a space-based processor architecture based on array parallelism, and with proper data flow strategies, data reuse degree and operation speed of a neural network algorithm can be significantly improved.

Convolution operation is the most basic operation in a neural network algorithm, and for the current deep convolution neural network, the convolution operation with huge calculation amount is generally needed. Convolution operations, i.e., tensor operations, are described by mathematical expressions-that is,

the key to the realization is that the weights of a plurality of convolution kernels and the numerical value of the input characteristic diagram are subjected to multiply-accumulate operation. If the direct solution is directly carried out according to the mode of the operation formula, along with the improvement of the complexity of the neural network algorithm and the increase of the data calculation amount, the direct solution method can frequently store read and write data from the outside, and the energy efficiency ratio of the system is greatly reduced. The other method is to adopt an adaptive data flow strategy, fix a certain data type and reduce the data reading and writing times. The adaptive data stream can select a proper storage hierarchy to access data, and energy consumption caused by memory access is minimized. The arithmetic unit array matched with the data flow strategy is a common hardware implementation mode and is beneficial to the implementation of an input bus and an output bus, so that the transmission efficiency of data is greatly improved. The common data flow strategies comprise Weight fixing (WS, Weight static), Output fixing (OS, Output static) and Row fixing (Row static), the data reuse degree of the Weight can only be improved by using the Weight fixed data flow strategy, the read-write times of the intermediate result can only be reduced by using the Output fixed data flow strategy, and the read-write times of three types of data can not only be improved by using the Row fixed data flow strategy, but also can be reduced. And row fixed data flow strategyA plurality of convolution kernels can be introduced into the operation unit to carry out super-two-dimensional operation, and the data reuse degree and the write-back frequency of the intermediate result are further improved. The design provides a row fixed data flow strategy, and a two-dimensional operation unit array structure is adopted to complete the realization of a neural network algorithm.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide an arithmetic unit array structure facing neural network processing, which adopts a row fixed data flow strategy and two-dimensional arithmetic unit array arrangement, can realize the maximization of data multiplexing degree,

the invention provides an arithmetic unit array structure facing neural network processing, wherein: the system comprises an arithmetic unit module and a local bus module, wherein the input end of the local bus module is connected with an ID value;

the operation unit module is divided into a top operation unit module, a middle operation unit module and a bottom operation unit module, and the local bus module is positioned in the vertical direction of the operation unit module; the operation unit module consists of a state machine, a register file and a multiply-accumulate unit module, wherein the register file comprises an input excitation register file, a weight register file and an intermediate result register file, the data request input end of the state machine is connected with the input end of the register file, and the register file and the multiply-accumulate unit are in bidirectional interaction;

the operation unit module is the most basic unit for completing convolution operation, is responsible for receiving input data from the input local bus module, completes one-dimensional convolution operation, sends an intermediate result to the local bus module for upward transmission and completes accumulation of the intermediate result according to different array positions of the operation unit module, and finally obtains output excitation;

the local bus module is responsible for unidirectional interaction of intermediate results between adjacent operation units in the vertical direction and calculates the spatial position of the operation units according to the ID values of the adjacent operation unit modules;

the local bus module obtains a group of enabling signals according to the ID value of the operation unit module in the vertical direction, and feeds the enabling signals back to the operation unit module connected with the local bus module, so that the spatial position of each operation unit module is calculated. According to the enabling signal, the operation unit module reads input data and carries out a convolution operation unit to obtain an intermediate result of one-dimensional convolution operation; then the operation unit module at the bottom transmits the middle result to the middle operation unit module along the local bus, the middle operation unit module accumulates the middle result and the input middle result, and transmits the accumulated middle result to the operation unit module at the top; the top arithmetic unit module accumulates the intermediate result of the top arithmetic unit module and the intermediate result from the storage system and the bottom arithmetic unit module to finally obtain the output excitation.

In the invention, the operation unit module internally comprises register files of three data types to form the lowest-level storage level of a processor architecture, input data is fixed in the registers, the data reuse degree is improved, and the read-write times of the input data are reduced, thereby reducing the power consumption caused by memory access. And temporarily storing an intermediate result obtained by multiply-accumulate operation, and reducing the read-write times of the intermediate result, thereby further improving the energy efficiency ratio of the system.

The invention has the beneficial effects that: in the design, a row fixed data stream strategy is adopted, so that the write back of an intermediate result in the two-dimensional convolution operation process is avoided, the read-write times of the intermediate result are reduced, and the high energy efficiency ratio is realized. Meanwhile, in the implementation process of the arithmetic unit, the storage hierarchy of the register file is introduced, so that the function is ensured to be correct, the data reuse degree is further improved, the write-back frequency of an intermediate result is reduced, and the energy efficiency ratio of the system is further improved.

Drawings

Fig. 1 is a basic block diagram of an arithmetic unit array structure.

Fig. 2 is a block diagram of the arithmetic unit.

FIG. 3 is a layout diagram of spatial locations of the operation cells.

FIG. 4 is a schematic diagram of window sliding data rearrangement inside an arithmetic unit. Wherein: (a) inputting a stimulus register for the arithmetic unit, and (b) inputting a stimulus register shift operation for the arithmetic unit.

Reference numbers in the figures: 1 is an arithmetic unit module, 2 is a local bus module, 3 is a register file, 4 is an input excitation register file, 5 is a weight register file, 6 is an intermediate result register file, 7 is a state machine, 8 is a multiplication and accumulation unit module, 9 is a top arithmetic unit module, 10 is an intermediate arithmetic unit module, and 11 is a bottom arithmetic unit module.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings.

Example 1: the basic block diagram of the arithmetic unit array structure is shown in fig. 1. The working process of the design is as follows: firstly, the local bus module 2 obtains a set of enabling signals according to the ID value of the operation unit module 1 in the vertical direction, and feeds the enabling signals back to the operation unit module 1 connected with the local bus module, thereby calculating the spatial position of each operation unit module 1. According to the enabling signal, the operation unit module 1 reads input data and carries out a convolution operation unit to obtain an intermediate result of one-dimensional convolution operation; then the bottom arithmetic unit module 11 transmits the intermediate result to the intermediate arithmetic unit module 10 along the local bus, the intermediate arithmetic unit module 10 accumulates the intermediate result of itself and the input intermediate result, and transmits the accumulated intermediate result to the top arithmetic unit module 9; the top arithmetic unit module 9 accumulates the intermediate result of itself with the intermediate result from the storage system and the bottom arithmetic unit module 11 to finally obtain the output excitation.

The structure of the arithmetic unit module 1 is shown in fig. 2, and the arithmetic unit module 1 includes 6 FIFO interfaces, wherein the read FIFO interface corresponds to the input excitation, the weight, the intermediate result in the channel direction and the intermediate result from the bottom arithmetic unit module 11, and the write FIFO interface corresponds to the intermediate result transmitted upward and the intermediate result written back to the storage system, and all these FIFOs constitute the data transmission channel of the arithmetic unit module 1 and the local bus module 2. The enable signal from the local bus module 2 controls the arithmetic unit module 1 to read and write the intermediate result FIFO, and the relative position of the arithmetic unit module 1 in the software model in the spatial array can be realized by mixing and matching the enable signal, and fig. 3 lists the combination of the spatial position of the arithmetic unit corresponding to the match of the enable signal. It should be noted that any enable collocation combination other than the FIG. 2 style is illegal. The register files 3 of three data types are introduced into the operation unit module 1 to form the bottommost layer of the storage system, so that the data reuse degree can be improved, the read-write times of a high-level storage unit can be reduced, and the data access power consumption overhead can be reduced. Table 1 is a convolution operation unit internal register file information table. Table 1 lists the data bit widths and depths for the three register files. The depth of the data input into the excitation register file 4 is 12, and the value is equal to the height of the convolution operation array, namely the maximum size of a convolution kernel, and the maximum window width corresponding to each convolution operation can be expanded according to design requirements. The reason why the depth of the weight register file 5 is 224 and the numerical value is large is that a plurality of convolution kernels need to be stored at the same time in order to support the super-two-dimensional operation inside the operation unit. The data depth of the intermediate result register file 6 is 24, each position represents an output excitation in a different channel direction, and this value limits the maximum number of convolution kernels that can be fixed inside the one-dimensional convolution operation unit. The data transmission path and the internal storage hierarchy of the convolution operation unit provide data guarantee for operation. The internal work flow is composed of a state machine 7, one-dimensional convolution operation and data reading are sequentially completed through jumping of the state machine 7, wherein: IDLE represents IDLE state; SET _ CONFIG represents the reading of configuration parameters, including the line width of input excitation, the line width of convolution kernel, the horizontal direction step of convolution operation and the number of convolution kernels for performing super-two-dimensional operation; reading a corresponding quantity of weights from the FIFO and placing the weights into an internal register file according to the configuration parameters by the READ _ FILTER; the READ _ WINDOW _ DATA is responsible for reading input stimuli, and when the state is jumped to for the first time, the input stimuli with the line width of the convolution kernel needs to be READ, and then the input stimuli are according to the step parameters. A corresponding number of input stimuli are read and placed into the internal register file. It should be noted here that the data read according to the stride parameter window sliding will cover the data discarded due to the window sliding, and increase the multiplexing degree of the input stimuli, but the sliding will cause the input stimuli to change in the relative position of the register file, so a shifter is introduced inside the register to rearrange the position of the window data to the correct position, and the data rearrangement process is shown in fig. 4. The disable _ DATA & ALU is responsible for placing bet DATA from the internal registers into the arithmetic unit for multiply-accumulate operations. According to the classification of the operation unit module 1, the operation unit module 11 at the bottom writes the intermediate result back to the FIFO for transmission upwards, and the operation unit modules 10 at the middle and the operation unit modules 9 at the top temporarily write the result back to the register file 3. PSUM _ READ _ MEM & ACCUM _ PSUM _ MEM accumulates intermediate results inside the arithmetic units along the channel direction, and only jumps when the arithmetic unit is positioned at the top, other types of arithmetic units belong to illegal jumps.

TABLE 1

Register type	Data bit width	Depth of field
			4	INT8	12
5	INT8	224
			6	INT32	24

PSUM _ READ _ LOCAL & ACCUM _ PSUM _ LOCAL READs the intermediate result from the FIFO for accumulation, which state jumps only on the top and intermediate arithmetic units. FINISH represents the end of the current operation, and the relevant register is reset.

Claims

1. An arithmetic unit array structure oriented to neural network processing is characterized in that: the system comprises an arithmetic unit module and a local bus module, wherein the input end of the local bus module is connected with an ID value;

the local bus module obtains a group of enabling signals according to the ID value of the operation unit module in the vertical direction, and feeds the enabling signals back to the operation unit module connected with the local bus module, so that the spatial position of each operation unit module is calculated; according to the enabling signal, the operation unit module reads input data and carries out a convolution operation unit to obtain an intermediate result of one-dimensional convolution operation; then the operation unit module at the bottom transmits the middle result to the middle operation unit module along the local bus, the middle operation unit module accumulates the middle result and the input middle result, and transmits the accumulated middle result to the operation unit module at the top; the top arithmetic unit module accumulates the intermediate result of the top arithmetic unit module and the intermediate result from the storage system and the bottom arithmetic unit module to finally obtain the output excitation.