CN111967587B

CN111967587B - Method for constructing operation unit array structure facing neural network processing

Info

Publication number: CN111967587B
Application number: CN202010728621.7A
Authority: CN
Inventors: 韩军; 张权; 张永亮
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2024-03-29
Anticipated expiration: 2040-07-27
Also published as: CN111967587A

Abstract

The invention relates to a method for constructing an operation unit array structure facing neural network processing. The system consists of an operation unit module and a local bus module. The single operation unit module is responsible for completing one-dimensional convolution operation, the local bus module transmits the intermediate result upwards, intermediate result accumulation is carried out, two-dimensional convolution operation is completed, writing back of the intermediate result is reduced, and the overall energy efficiency ratio of the system is improved. The operation unit module is internally provided with a plurality of register files, and simultaneously performs super two-dimensional convolution operation of a plurality of convolution kernels, so that the data multiplexing degree is further improved, and the write-back of intermediate results is reduced. The operation unit array adopts a self-organizing mode, receives a control signal from the top layer, automatically calculates the space position of the current operation unit required by two-dimensional convolution operation by the local bus module according to the ID configuration of the adjacent operation unit, and then automatically completes the receiving and transmitting of data and the related operation, thereby having certain autonomy. The invention can improve the calculation efficiency in the neural network processing.

Description

Method for constructing operation unit array structure facing neural network processing

Technical Field

The invention belongs to the technical field of integrated circuit design, and particularly relates to a method for constructing an operation unit array structure facing neural network processing.

Background

The deep convolutional neural network is well applied to important fields such as computer vision, voice recognition and robot control, but various applications also continuously put higher requirements on the accuracy and complexity of the neural network algorithm, so that the implementation of the algorithm faces a series of challenging problems. Although the conventional processor architecture has advanced, there are problems of low data multiplexing and poor energy efficiency caused by direct communication of data between the arithmetic units. In order to improve the problems, researchers in recent years design space processor architectures based on array parallelism, and the data multiplexing degree and the operation speed of a neural network algorithm can be remarkably improved by matching with a proper data flow strategy.

Convolution operation is the most basic operation in the neural network algorithm, and for the current deep convolution neural network, the convolution operation with huge calculation amount is generally required. Convolution operations, i.e., tensor operations, are described by mathematical expressions,

the key of the realization is that the weights of a plurality of convolution kernels are multiplied and accumulated with the numerical value of the input feature map. If the method is directly solved in a mode of the operation formula, along with the increase of the complexity of the neural network algorithm and the increase of the data calculation amount, the method for directly solving can frequently read and write data from external storage, and greatly reduces the energy efficiency ratio of the system. The other method is to adopt an adaptive data flow strategy to fix a certain data type and reduce the data reading and writing times. The adaptive data flow can select a proper storage hierarchy to access the data, and the energy consumption caused by access is minimized. The operation unit array matched with the data flow strategy is a common hardware implementation mode and is beneficial to the realization of an input bus and an output bus, so that the data transmission efficiency is greatly improved. The common data flow strategies include weight fixing (WS, weight Stationary), output fixing (OS, output Stationary) and Row fixing (Row fixing), compared with the three data flow strategies, the weight fixing data flow strategy can only improve the data multiplexing degree of weight, the output fixing data flow strategy can only reduce the reading and writing times of intermediate results, and the Row fixing data flow strategy can not only simultaneously improve the multiplexing degree of three types of data, but also reduce the reading and writing times of data. And a row fixed data flow strategy can introduce a plurality of convolution kernels into the operation unit to perform super-two-dimensional operation, so that the data multiplexing degree and the write-back times of intermediate results are further improved. The design provides a two-dimensional operation unit array structure based on a row fixed data flow strategy to complete the realization of a neural network algorithm.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a method for constructing an operation unit array structure facing to neural network processing, which adopts a row fixed data flow strategy and two-dimensional operation unit array arrangement, can realize the maximization of data multiplexing degree,

the invention provides a method for constructing an operation unit array structure facing neural network processing, wherein: the device comprises an operation unit module and a local bus module, wherein the input end of the local bus module is connected with an ID value;

the operation unit module is divided into a top operation unit module, a middle operation unit module and a bottom operation unit module, and the local bus module is positioned in the vertical direction of the operation unit module; the operation unit module consists of a state machine, a register file and a multiplication and accumulation unit module, wherein the register file comprises an input excitation register file, a weight register file and an intermediate result register file, the data request input end of the state machine is connected with the input end of the register file, and the register file and the multiplication and accumulation unit are in bidirectional interaction;

the operation unit module is used for completing the most basic unit of convolution operation, is responsible for receiving input data from the input local bus module, completing one-dimensional convolution operation, sending an intermediate result to the local bus module for upward transmission and completing accumulation of the intermediate result according to different positions of the operation unit module array, and finally obtaining output excitation;

the local bus module is responsible for unidirectional interaction of intermediate results between adjacent operation units in the vertical direction and calculates the spatial position of the operation units according to the ID numerical values of the adjacent operation unit modules;

the local bus module obtains a group of enabling signals according to the ID values of the operation unit modules in the vertical direction and feeds back the enabling signals to the operation unit modules connected with the local bus module, so that the space position of each operation unit module is calculated. According to the enabling signal, the operation unit module reads the input data and carries out convolution operation to obtain an intermediate result of one-dimensional convolution operation; then the middle result is upwards transmitted to the middle operation unit module along the local bus by the bottom operation unit module, the middle operation unit module accumulates the middle result of the middle operation unit module and the input middle result, and the accumulated middle result is upwards transmitted to the top operation unit module; the top operation unit module accumulates the intermediate result with the intermediate result from the storage system and the bottom operation unit module to finally obtain output excitation.

In the invention, the operation unit module internally comprises three data types of register files to form the lowest storage level of the processor architecture, and the input data is fixed in the registers, thereby improving the data multiplexing degree and reducing the read-write times of the input data, thereby reducing the power consumption caused by memory access. And temporarily storing the intermediate result obtained by multiply-accumulate operation, and reducing the read-write times of the intermediate result, thereby further improving the energy efficiency ratio of the system.

The invention has the beneficial effects that: in the design, a row fixed data flow strategy is adopted, so that the write back of an intermediate result in the two-dimensional convolution operation process is avoided, the read-write times of the intermediate result are reduced, and the high energy efficiency ratio is realized. Meanwhile, in the implementation process of the operation unit, a storage layer of a register file is introduced, so that the data multiplexing degree is further improved and the write-back times of intermediate results are reduced while the correct function is ensured, and the energy efficiency ratio of the system is further improved.

Drawings

Fig. 1 is a basic block diagram of an arithmetic unit array structure.

Fig. 2 is a block diagram of an arithmetic unit module.

FIG. 3 is a layout of the spatial locations of the arithmetic units.

Fig. 4 is a schematic diagram of rearrangement of sliding data in the window of the operation unit. Wherein: (a) Inputting a stimulus register for the arithmetic unit, (b) inputting a stimulus register shift operation for the arithmetic unit.

Reference numerals in the drawings: 1 is an operation unit module, 2 is a local bus module, 3 is a register file, 4 is an input excitation register file, 5 is a weight register file, 6 is an intermediate result register file, 7 is a state machine, 8 is a multiplication accumulation unit module, 9 is a top operation unit module, 10 is an intermediate operation unit module, and 11 is a bottom operation unit module.

Description of the embodiments

The invention is further illustrated by the following examples in connection with the accompanying drawings.

Example 1: the basic block diagram of the operation unit array structure is shown in fig. 1. The working process of the design is as follows: first, the local bus module 2 obtains a set of enable signals according to the ID values of the arithmetic unit modules 1 in the vertical direction, and feeds back the enable signals to the arithmetic unit modules 1 connected thereto, thereby calculating the spatial position of each arithmetic unit module 1. According to the enabling signal, the operation unit module 1 reads the input data, and carries out convolution operation unit to obtain an intermediate result of one-dimensional convolution operation; then the middle result is transmitted upwards to the middle operation unit module 10 along the local bus by the bottom operation unit module 11, the middle operation unit module 10 accumulates the middle result of the middle operation unit module and the input middle result, and the accumulated middle result is transmitted upwards to the top operation unit module 9; the top arithmetic unit module 9 adds up its own intermediate results with intermediate results from the memory system and the bottom arithmetic unit module 11, finally obtaining output stimuli.

The operation unit module 1 is structured as shown in fig. 2, and includes 6 FIFO interfaces, wherein the read FIFO interfaces respectively correspond to the input stimulus, the weight, the intermediate result in the channel direction and the intermediate result from the bottom operation unit module 11, the write FIFO interfaces correspond to the intermediate result transmitted upward and the intermediate result written back to the memory system, and all these FIFOs constitute the data transmission channels of the operation unit module 1 and the local bus module 2. The enabling signal of the local bus module 2 controls the operation unit module 1 to read and write the intermediate result FIFO, the relative position of the operation unit module 1 in the space array in the software model can be realized through the mixed collocation of the enabling signals, and the combination of the matching of the enabling signals corresponding to the space position of the operation unit is listed in FIG. 3. It should be noted that any combination of enabling matches other than the pattern of fig. 2 is illegal. The register file 3 with three data types is introduced into the operation unit module 1 to form the bottommost layer of the storage system, so that the data multiplexing degree can be improved, the read-write times of a high-level storage unit can be reduced, and the data access power consumption overhead can be reduced. Table 1 lists the data bit widths and depths of the three register files. The data depth of the input excitation register file 4 is 12, and the value is equal to the height of the convolution operation array, that is, the maximum size of the convolution kernel, and the maximum window width corresponding to each convolution operation can be expanded according to design requirements. The depth of the weight register file 5 is 224, and the reason for the larger value is that a plurality of convolution kernels need to be stored simultaneously in order to support the super two-dimensional operation in the operation unit. The intermediate result register file 6 has a data depth of 24, each representing the output stimulus in a different channel direction, which value limits the maximum number of fixable convolution kernels inside the one-dimensional convolution operation unit. The data transmission path and the internal storage hierarchy of the convolution operation unit provide data assurance for operation. The internal workflow is composed of a state machine 7, and one-dimensional convolution operation and data reading are sequentially completed through the jump of the state machine 7, wherein: IDLE represents an IDLE state; set_config represents the reading of configuration parameters, including the line width of input excitation, the line width of convolution kernel, the step of convolution operation in the horizontal direction and the number of convolution kernels for performing super-two-dimensional operation; read_filter READs the corresponding number of weights from FIFO according to configuration parameters and places them into internal register file; READ_WINDOW_DATA is responsible for reading the input stimulus, the first time to jump to this state, the input stimulus of the convolution kernel linewidth size needs to be READ, followed by a stride parameter. A corresponding number of read input stimuli are placed into the internal register file. It should be noted that, the data read according to the window sliding of the stride parameter will cover the data discarded due to the window sliding, so as to increase the multiplexing degree of the input excitation, but the sliding will cause the relative position of the input excitation in the register file to change, so that a shifter is introduced into the register to rearrange the window data into the correct position, and the data rearrangement process is shown in fig. 4. The distrip_data & ALU is responsible for placing the internal register bet DATA into the arithmetic unit for multiply accumulate operations. Depending on the classification of the arithmetic unit modules 1, the intermediate results from the arithmetic are written back into the FIFO by the bottom arithmetic unit module 11 for upward transmission, while the intermediate and top arithmetic unit modules 10 and 9 temporarily write the calculated results back into the register file 3.PSUM_READ_MEM & ACCUM_PSUM_MEM accumulates intermediate results inside the arithmetic units in the channel direction, and only jumps when the arithmetic unit is located at the top, other types of arithmetic units belong to illegal jumps.

TABLE 1

Register type	Data bit width	Depth of
			4	INT8	12
5	INT8	224
			6	INT32	24

PSUM_READ_LOCAL & ACCUM_PSUM_LOCAL READs intermediate results from the FIFO for accumulation, which state jumps only at the top and intermediate arithmetic units. FINISH represents the end of the current operation, resetting the relevant registers.

Claims

1. A method for constructing an operation unit array structure facing neural network processing is characterized in that: the device comprises an operation unit module and a local bus module, wherein the input end of the local bus module is connected with an ID value;

the local bus module obtains a group of enabling signals according to the ID values of the operation unit modules in the vertical direction and feeds back the enabling signals to the operation unit modules connected with the local bus module, so that the space position of each operation unit module is calculated; according to the enabling signal, the operation unit module reads the input data and carries out convolution operation to obtain an intermediate result of one-dimensional convolution operation; then the middle result is upwards transmitted to the middle operation unit module along the local bus by the bottom operation unit module, the middle operation unit module accumulates the middle result of the middle operation unit module and the input middle result, and the accumulated middle result is upwards transmitted to the top operation unit module; the top operation unit module accumulates the intermediate result with the intermediate result from the storage system and the bottom operation unit module to finally obtain output excitation.