CN109948774B

CN109948774B - Neural network accelerator based on network layer binding operation and implementation method thereof

Info

Publication number: CN109948774B
Application number: CN201910070755.1A
Authority: CN
Inventors: 黄立文; 谭展宏; 陈小柏; 虞志益
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2022-12-13
Anticipated expiration: 2039-01-25
Also published as: CN109948774A

Abstract

The invention discloses a neural network accelerator based on network layer binding operation and an implementation method thereof. The invention combines and packs the middle layer with large data volume, selects the layer with relatively small data volume for output, reduces the output data volume, only needs to read the data of the off-chip memory once during the first convolution operation, can complete the multi-layer binding calculation through one-time data reading, effectively reduces the use of the off-chip memory, also reduces the data access operation, improves the working efficiency, simultaneously reduces the redundant state turnover on the circuit, further reduces the power consumption and the cost, and can be widely applied to the technical field of deep neural networks.

Description

Neural network accelerator based on network layer binding operation and implementation method thereof

Technical Field

The invention relates to the technical field of deep neural networks, in particular to a neural network accelerator based on network layer binding operation and an implementation method thereof.

Background

The convolutional neural network is widely applied to the fields of computer vision, speech recognition, natural language processing and the like due to high precision and high performance.

The operation of the convolutional neural network has the characteristics of being calculation intensive and storage intensive. The large number of computations and data read operations can increase the workload of the processor. Therefore, people transplant the operation of the convolutional neural network to a GPU, an FPGA or even an ASIC. The GPU has the advantage of high parallelism, but very high power consumption. While ASICs are low power and high performance, they are long development cycle, high cost, and are not easy to reconfigure the hardware. The FPGA is a compromise between the GPU and the ASIC, is a design platform with low cost and short development period, and can also perform special optimization design on neural network operation, thereby achieving the high-performance operation effect.

In the traditional neural network accelerator, an accelerator is established to iteratively process each layer by layer, and then the next layer is iteratively processed after one layer is completely processed, so that the data of a middle layer is very large, and the off-chip storage needs to be accessed for multiple times. This operation, however, is very costly.

In summary, the current neural network accelerator has the following disadvantages:

first, the power consumption overhead is large. The power consumption overhead due to off-chip memory access is very large.

Second, the time overhead is large. Since the access to the outside of the chip cannot be performed with the effect of instantaneous reading, several time periods pass from the generation of the access signal to the final data retrieval.

Thirdly, the economic overhead is large. Off-chip memory access has brought about the use of DRAM (e.g., DDR). The cost of DDR is relatively high and if the DDR requirements can be reduced in a design, the cost for the product can be greatly reduced.

Disclosure of Invention

To solve the above technical problems, the present invention aims to: the neural network accelerator based on network layer binding operation and the implementation method thereof are low in cost, high in efficiency and low in power consumption.

The technical scheme adopted by the invention on the one hand is as follows:

the neural network accelerator based on the network layer binding operation comprises:

the off-chip memory module is used for storing the picture data acquired from the camera and preset weight parameters;

the weight parameter caching module is used for storing preset weight parameters read from the off-chip storage module;

the characteristic value cache module is used for storing input picture data of the first convolution layer, an output characteristic value of the middle layer and an input characteristic value of the next convolution layer;

the calculation unit array module is used for executing convolution operation according to the input characteristic value and the weight parameter;

the weight parameter register module is used for storing the weight parameters read from the weight parameter cache module;

the characteristic value register module is used for storing the output characteristic value of the middle layer obtained by convolution operation;

the local addition tree module is used for accumulating the partial sums of all the output channels;

the pooling calculation module is used for performing pooling calculation on the result of the convolution operation;

and the output channel addition tree module is used for accumulating the output channels of the computing unit array.

Further, the method also comprises the following steps:

and the cache control module is used for controlling the weight parameter to be read into the weight parameter register module.

Further, the cache control module is also used for controlling the characteristic value to be read into the characteristic value register module.

Further, the number of the computing unit arrays is 8.

Further, the convolutional neural network comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a third convolutional layer, a fourth convolutional layer, a fifth convolutional layer, a third pooling layer, a first full-connection layer, a second full-connection layer and a third full-connection layer;

wherein the dimension of the input characteristic value of the second convolution layer is 27 × 27 × 96, and the dimension of the output characteristic value of the second convolution layer is 27 × 27 × 256;

the dimension of the input characteristic value of the third convolutional layer is 13 × 13 × 256, and the dimension of the output characteristic value of the third convolutional layer is 13 × 13 × 384;

the input characteristic value dimension of the fourth convolution layer is 15 × 15 × 384, and the output characteristic value dimension of the fourth convolution layer is 13 × 13 × 384;

the input characteristic value dimension of the fifth convolutional layer is 15 × 15 × 384, and the output characteristic value dimension of the fifth convolutional layer is 13 × 13 × 384.

The technical scheme adopted by the other aspect of the invention is as follows:

the implementation method of the neural network accelerator based on the network layer binding operation comprises the following steps:

storing picture data acquired from a camera and preset weight parameters through an off-chip memory module;

storing preset weight parameters read from the off-chip storage module through a weight parameter cache module;

storing the input image data of the first convolution layer, the output characteristic value of the middle layer and the input characteristic value of the next convolution layer through a characteristic value cache module;

performing convolution operation according to the input characteristic value and the weight parameter through the computing unit array;

storing the weight parameters read from the weight parameter cache module through a weight parameter register module;

accumulating the partial sums of each output channel through a local addition tree module;

performing pooling calculation on the result of the convolution operation through a pooling calculation module;

and accumulating the output channels of the computing unit array through an output channel addition tree module.

Further, the method also comprises the following steps:

and the buffer control module controls the weight parameters to be read into the weight parameter register module.

Further, the method also comprises the following steps:

and the characteristic value is controlled to be read into the characteristic value register module through the cache control module.

Further, the method also comprises the following steps:

splicing the four computing unit arrays into a large computing unit array;

performing convolution calculation on the input characteristic value through a large calculation unit array obtained by splicing;

and performing pooling treatment on the result of the convolution calculation to obtain an input feature map of the next convolution layer.

Further, the method also comprises the following steps:

reading input data of the first convolution layer from an off-chip memory;

executing a first convolution operation of the first convolution layer through the computing unit array;

performing first pooling operation on the result of the first convolution operation through a pooling calculation module, and storing the result of the first pooling operation in a characteristic value cache module;

reading input data of the second convolution layer from the characteristic value cache module;

performing first bundling operation on the second convolution operation, the third convolution operation and the second pooling operation to obtain an output result of the second pooling operation, and storing the output result of the second pooling operation in a characteristic value cache module;

reading input data of the fourth convolution layer from the characteristic value cache module;

and performing second bundling operation on the fourth convolution operation, the fifth convolution operation and the third pooling operation to obtain an output result of the third pooling operation.

The invention has the beneficial effects that: the invention combines and packs the middle layer with large data volume, selects the layer with relatively small data volume for output, reduces the output data volume, stores the characteristic value obtained by convolution operation through the characteristic value register module, does not need to read through an off-chip memory, only needs to read the data of the off-chip memory once during the first convolution operation, can complete multi-layer binding calculation through one-time data reading, effectively reduces the use of the off-chip memory, also reduces data access operation, improves the working efficiency, and simultaneously reduces the state overturn of redundancy on a circuit, thereby reducing the power consumption and the cost.

Drawings

FIG. 1 is a schematic diagram of a neural network accelerator based on network layer binding according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a first implementation step of the embodiment of the invention;

fig. 3 is a schematic diagram of a second implementation procedure of the embodiment of the invention.

Detailed Description

The invention will be further explained and explained with reference to the drawings and the embodiments in the description. The step numbers in the embodiments of the present invention are set for convenience of illustration only, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adaptively adjusted according to the understanding of those skilled in the art.

Referring to fig. 1, an embodiment of the present invention provides a neural network accelerator based on network layer binding operation, including:

the weight parameter register module is used for storing the weight parameters read from the weight parameter caching module;

Referring to fig. 1, a further preferred embodiment further includes:

Referring to fig. 1, as a further preferred embodiment, the cache control module is further configured to control a feature value reading-in feature value register module.

Referring to fig. 1, in a further preferred embodiment, the number of the calculation unit arrays is 8.

Further preferably, the convolutional neural network comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a third convolutional layer, a fourth convolutional layer, a fifth convolutional layer, a third pooling layer, a first fully-connected layer, a second fully-connected layer and a third fully-connected layer;

Based on the neural network accelerator shown in fig. 1, an embodiment of the present invention further provides an implementation method of a neural network accelerator based on network layer binding operation, including the following steps:

storing the weight parameters read from the weight parameter caching module through a weight parameter register module;

Further as a preferred embodiment, the method further comprises the following steps:

and controlling the characteristic value to be read into the characteristic value register module through the cache control module.

splicing the four calculation unit arrays into a large calculation unit array;

reading input data of the first convolution layer from an off-chip memory;

The invention designs a multi-layer binding operation accelerator aiming at a network layer aiming at the requirements of low power consumption and low cost of a mobile terminal accelerator, and takes AlexNet as an implementation case. The embodiment is realized by utilizing XilinxZYNQUltrascale + ZCU development platform 102, which has 2500 DSP resource numbers and is suitable for the bundled network computing mode of the invention.

Specifically, taking AlexNet as an example, the basic structure of the network of the present embodiment is, in order, a first convolution layer Conv1, a first Pooling calculation layer Pooling1, a second convolution layer Conv2, a second Pooling calculation layer Pooling2, a third convolution layer Conv3, a fourth convolution layer Conv4, a fifth convolution layer Conv5, a third Pooling calculation layer Pooling3, and three final fully-connected layers. Wherein Conv represents the convolutional layer and Pooling represents the Pooling layer.

Firstly, reading Conv1 layer input from an off-chip memory, executing Conv1 convolution operation in a computing unit array, executing Pooling1 Pooling operation in a Pooling computing module, and storing operation results in a characteristic value caching module;

and then, reading Conv2 input from the characteristic value cache module, then binding Conv2, pooling2 and Conv3 for operation, finally directly obtaining Conv3 output, and storing the operation result in the characteristic value cache module.

Then, the input of the Conv4 layer is read from the characteristic value cache module, and binding operation is performed on the Conv4, the Conv5 and the Pooling3, and finally the output of the Pooling3 is directly obtained.

The invention selects the layer with relatively less data volume to output, and compared with the mode of outputting the data of each layer to the off-chip memory by the prior method, the invention reduces the output data volume.

Because the data volume output by different convolution layers and pooling layers is different, the invention selects the layer with relatively less data volume to output and stores the layer into the characteristic value cache module; for the layer with larger data quantity, the data is not directly output, but the next layer is continuously calculated (assuming that the data quantity of the next layer is relatively less), and the next layer outputs the data again and stores the data into the characteristic value cache module. For example, in the present embodiment, since the data amount of Conv2 is larger than that of Conv3, conv2 is not output, but Conv2 is output after being bundled with Conv3 and pooling 2.

In addition, in the present embodiment, the amounts of data output by Conv4 and Conv5 are the same, and in the present embodiment, the merge binding operation is performed on both of them, and the effect of "selecting a layer having a relatively small amount of data to output" is not achieved, but the effect of "reducing the amount of data output (storing in an off-chip memory for each layer)" is achieved.

As shown in fig. 1, the neural network accelerator of this embodiment includes an off-chip memory module, a weight parameter cache module, a cache control module, a feature value cache module, a computing unit array module, a weight parameter register module, a feature value register module, a local addition tree module, a pooling computing module, and an output channel addition tree module.

The off-chip memory module is used for storing the image data acquired from the camera and the trained weight parameters;

the weight parameter caching module stores the weight parameters read from the off-chip memory module;

the cache control module is used for controlling the weight parameter to be read into the weight parameter register module;

the characteristic value cache module is used for storing input picture data of a first layer, intermediate layer output characteristic values returned from the computing unit array and input characteristic values of a later layer;

the cache control module is used for controlling the characteristic value to be read into the characteristic value register module;

the computing unit array module is used for inputting convolution operation of the characteristic value and the weight;

the weight parameter register module is used for storing read weight parameters;

the characteristic value register module is used for storing and storing the output characteristic value of the middle layer;

the local addition tree module is used for accumulating the partial sum of each output channel;

the pooling computing module is used for pooling computing;

and the output channel addition tree module is used for adding the output channels of the 8 calculation unit arrays.

As shown in fig. 2, the implementation of the Conv4, conv5 and Pooling3 network layer binding operation mode is as follows:

the input eigenvalue dimension of the two-layer convolution operations of Conv4 and Conv5 is 15 × 15 × 384, and the output eigenvalue dimension is 13 × 13 × 384.

1) For Conv4, the present embodiment adopts a data multiplexing strategy with stable output:

firstly, all input channels CH X1, CH X2, … …, CH Xk of Conv4 and convolution kernels corresponding to a Conv4 layer are read in one time, and multiplication and addition operation is carried out in the same computing unit Array 15 × 15PE Array, so that a complete output channel of Conv4, such as CH Y1, can be obtained;

then, repeating the operation Yn times to obtain all output channels of Conv4, namely CH Y1, CHY2, … …, CH Yn;

the output channels of Conv4 are all stored in the characteristic value register module of the middle layer on the chip, and do not need to be stored in an additional off-chip memory.

2) For Conv5, the present embodiment adopts a data multiplexing strategy with stable input:

firstly, taking an output channel obtained by Conv4 as an input channel of Conv5, such as CH Y1, obtaining a convolution kernel corresponding to a Conv5 layer, and performing multiplication and addition operation in the same computing unit Array 15 × 15PE Array to obtain a group of partial sums of all output channels of Conv5, namely Psum CH Z1, psum CH Z2, … … and Psum CH Zm;

then, after the 8 calculation unit arrays are calculated, 8 groups of Psum CH Z1, psum CH Z2, … … and Psum CH Zm can be obtained, and all the parts are accumulated to obtain the output of the final M channel;

since there are 384 groups of input channels in the Conv5, the operation can be completed by repeating 384/8=48 times, and all the output channels of the Conv5 are obtained.

The invention designs a binding network layer operation mode based on stable output and stable input, can realize two layers of continuous convolution operation, after the former layer finishes the whole middle output channel, the latter layer can directly carry out continuous operation without storing data on a calculation array on an off-chip memory, obtains parts and results of a plurality of groups of output channels of the latter layer, and continuously accumulates to realize the binding network layer operation mode. The invention can save the operation of writing data to the off-chip memory once, reduce unnecessary operation, improve the calculation efficiency and reduce the power consumption in the calculation process.

As shown in fig. 3, the implementation of the Conv2, pooling2 and Conv3 network layer bundling operation mode design is as follows:

the input eigenvalue dimension of Conv2 is 27 × 27 × 96, and the output eigenvalue dimension is 27 × 27 × 256;

the input eigenvalue dimension of Conv3 is 13 × 13 × 256, and the output eigenvalue dimension is 13 × 13 × 384;

since the size of the output feature map of Conv2 is larger than that of the computing unit Array, it is necessary to tile four arrays into a large computing unit Array, i.e. tile four 15 × 15PE arrays, so that the convolution operation of Conv2 can be completed by applying the output stable data stream as expected.

After the operation of the Conv2 is completed, the input feature map of the Conv3 is output by the pooling module, and the size of the input feature map of the Conv3 is 13 × 13, at this time, the embodiment can completely map the input feature map of the Conv3 to one of the computing unit arrays, that is, 15 × 15PE Array, and further execute the operation of the Conv3 through the Array.

In summary, the accelerator designed by the invention comprises an off-chip memory module, a weight parameter cache module, a cache control module, a characteristic value cache module, a calculation unit array module, a weight parameter register module, a characteristic value register module, a local addition tree module, a Pooling calculation module and an output channel addition tree module, and a network layer binding operation mode based on Conv2, powing 2 and Conv3 binding operation and Conv4 and Conv5 Powing 3 binding operation is designed in detail.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The method for realizing the neural network accelerator based on the network layer binding operation is characterized by comprising the following steps of: the method comprises the following steps:

accumulating the output channels of the computing unit array through an output channel addition tree module;

further comprising the steps of:

reading input data of the first convolution layer from an off-chip memory;

2. The method for implementing a neural network accelerator based on network layer binding operation as claimed in claim 1, wherein: further comprising the steps of:

3. The method for implementing a neural network accelerator based on network layer binding operation as claimed in claim 2, wherein:

further comprising the steps of:

4. The method for implementing a neural network accelerator based on network layer binding operation as claimed in claim 1, wherein: further comprising the steps of:

splicing the four computing unit arrays into a large computing unit array;