CN109635944B

CN109635944B - Sparse convolution neural network accelerator and implementation method

Info

Publication number: CN109635944B
Application number: CN201811582530.6A
Authority: CN
Inventors: 刘龙军; 李宝婷; 孙宏滨; 梁家华; 任鹏举; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2020-10-27
Anticipated expiration: 2038-12-24
Also published as: CN109635944A

Abstract

A sparse convolution neural network accelerator and a realization method thereof are disclosed, wherein the connection weight of a sparse network in an off-chip DRAM is read into a weight input buffer area, decoded by a weight decoding unit and stored in a weight on-chip global buffer area; reading neurons into a neuron input buffer area, decoding the read neurons through a neuron decoding unit, and storing the decoded neurons in a global buffer area on a neuron sheet; determining a calculation mode of a PE calculation unit array according to configuration parameters of a current layer of a neural network, and retransmitting the decoded arranged neurons and connection weights to the PE calculation unit; calculating a product of the neuron and the connection weight; in the accelerator, all multipliers in the PE unit are replaced by shifters, and all basic modules can be configured according to network calculation and hardware resources, so that the accelerator has the advantages of high speed, low power consumption, small resource occupation and high data utilization rate.

Description

Sparse convolution neural network accelerator and implementation method

Technical Field

The invention belongs to the technical field of deep neural network acceleration calculation, and relates to a sparse convolution neural network accelerator and an implementation method thereof.

Background

The superior performance of Deep Neural Networks (DNNs) comes from its ability to obtain efficient representations of the input space from a large amount of data using statistical learning and to extract high-level features from the raw data. However, these algorithms are computationally intensive, and as the number of network layers is increased, the DNN demands more and more storage resources and computing resources, which makes it difficult to deploy the DNN on embedded devices with limited hardware resources and tight power consumption budget. The DNN model compression technology is beneficial to reducing the memory occupation of a network model, reducing the computation complexity, reducing the system power consumption and the like, can improve the operation efficiency of the model on one hand, and is beneficial to platforms with higher requirements on speed, such as cloud computing and the like; another aspect facilitates deployment of the model to embedded systems.

Compared with a von Neumann computer architecture, the neural network computer takes neurons as basic computing and storage units to construct a novel distributed storage and computing architecture. The Field Programmable Gate Array (FPGA) not only has programmability and flexibility of software, but also has high throughput and low latency characteristics of an Application Specific Integrated Circuit (ASIC), and becomes a hot carrier for current DNN hardware accelerator design Application. Most of the existing neural network model researches focus on improving the network identification precision, even a lightweight network model ignores the calculation complexity during hardware acceleration, and the floating point type data representation causes excessive consumption of hardware resources. Although the DNN accelerator schemes are numerous, there are many more problems to be solved or optimized, such as more efficient data multiplexing, pipeline design of hardware computation, and on-chip memory resource planning.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a high-energy-efficiency sparse convolutional neural network accelerator and an implementation method thereof, so that the DNN connection weight has the characteristics of sparseness and friendly hardware calculation.

In order to realize the purpose, the invention is realized by the following technical scheme:

a sparse convolutional neural network accelerator, comprising: the device comprises an off-chip DRAM, a neuron input buffer area, a neuron decoding unit, a neuron coding unit, a neuron output buffer area, an on-neuron global buffer area, a weight input buffer area, a weight decoding unit, an on-weight global buffer area, a PE (provider edge) computing unit array, an activation unit and a pooling unit; wherein,

the off-chip DRAM is used for storing the data of the compressed neurons and the connection weights;

the neuron input buffer area is used for caching compressed neuron data read from the off-chip DRAM and transmitting the compressed neuron data to the neuron decoding unit;

the neuron decoding unit is used for decoding the compressed neuron data and transmitting the decoded neurons to a global buffer area on a neuron sheet;

the neuron coding unit is used for compressing and coding data in the global buffer area on the neuron sheet and then transmitting the data to the neuron output buffer area;

the neuron output buffer area is used for storing the received compressed neuron data into an off-chip DRAM;

the global buffer area on the neuron sheet is used for caching the decoded neurons and intermediate results obtained by the calculation of the PE array, and carrying out data communication with the PE calculation unit array;

the activation unit is used for activating the result calculated by the PE calculation unit array and then transmitting the result to the pooling unit or the global buffer area on the neuron sheet; the pooling unit is used for downsampling the result after the activation unit is activated;

the weight value input buffer area is used for caching compressed connection weight data read from the off-chip DRAM and transmitting the compressed connection weight data to the weight value decoding unit;

the weight decoding unit is used for decoding the compressed connection weight data and transmitting the decoded connection weight to a global buffer area on a weight chip;

the weight on-chip global buffer area is used for caching the decoded connection weight and carrying out data communication with the PE computing unit array;

and the PE computing unit array is used for computing by adopting a convolution computing mode or a full-connection computing mode.

A method for realizing a sparse convolutional neural network accelerator comprises the steps of preprocessing data, storing the data in a DRAM (dynamic random access memory), reading connection weights of a sparse network stored in the DRAM outside a chip into a weight input buffer area, decoding the connection weights by a weight decoding unit, and storing the connection weights in a weight on-chip global buffer area; reading the neurons stored in the off-chip DRAM into a neuron input buffer area, decoding the read neurons through a neuron decoding unit, and storing the decoded neurons into a global buffer area on a neuron chip; determining a calculation mode of a PE calculation unit array according to configuration parameters of a current layer of a neural network, and sending the decoded arranged neurons and connection weights to corresponding PE calculation units; after entering PE, the Neuron and the connection Weight data are respectively cached in Neuron _ Buffer and Weight _ Buffer, and when Neuron _ Buffer and Weight _ Buffer are fully filled for the first time, the product of the Neuron and the connection Weight is calculated; wherein, Neuron _ Buffer is a Neuron Buffer area in PE, Weight _ Buffer is a Weight Buffer area in PE;

if the result obtained by the calculation of the PE calculation unit array is an intermediate result, temporarily storing the intermediate result into a global buffer area on the neuron sheet, and waiting for the next time of accumulation with other intermediate results calculated by the PE calculation unit array; if the final result of the convolutional layer is obtained, the convolutional layer is sent to an activation unit, then the activation result is sent to a pooling unit, and the activation result is stored back to a global buffer area on the neuron sheet after down-sampling; if the final result of the full connection layer is obtained, the final result is sent to an activation unit and then directly stored back to a global buffer area on the neuron sheet; data stored in the global buffer area on the neuron chip is compressed and coded by a neuron coding unit according to a sparse matrix compression coding storage format and then is sent to a neuron output buffer area, and then is stored back to an off-chip DRAM and is read in when the next layer of calculation is waited; the above steps are circulated until the fully connected layer of the last layer outputs a one-dimensional vector, and the length of the vector is the number of classification categories.

The invention has the further improvement that the specific process of preprocessing the data is as follows:

1) pruning the connection weight to make the neural network sparse;

2)，2ⁿconnection weight clustering: firstly, according to the distribution condition of the network parameters after pruning, selecting 2 according to the quantization bit width in the range of the parameter setⁿThe formal clustering center calculates the Euclidean distance between the residual non-zero parameters and the clustering center, and clusters the non-zero parameters to the clustering center with the shortest Euclidean distance to the non-zero parameters, wherein n is a negative integer, a positive integer or zero;

3) and for the sparse network clustered in the step 2), compressing the connection weight of the sparse network and the characteristic diagram neurons by adopting a sparse matrix compression coding storage format and then storing the compressed sparse network and the characteristic diagram neurons in an off-chip DRAM.

The further improvement of the invention is that the storage format of the sparse matrix compression coding is specifically as follows: the coding of the convolutional layer characteristic diagram and the full-link layer neuron is divided into three parts, wherein each 5-bit represents the number of 0 between the current non-zero neuron and the last non-zero neuron, each 16-bit represents the numerical value of the non-zero element, the last 1-bit is used for marking whether the non-zero element in the convolutional layer characteristic diagram is changed or not or whether the non-zero element of the full-link layer is changed, and the whole storage unit codes three non-zero elements and the relative position information of the three non-zero elements; if the middle of one storage unit contains the last non-zero element of a certain row of the convolutional layer characteristic diagram or the last non-zero element of a certain layer of input neuron of the full-link layer, the following 5-bit and 16-bit are all complemented by 0, and the last 1-bit is marked in the position 1; the coding of the connection weight is also divided into three parts, wherein each 4-bit represents the number of 0 between the current non-zero connection weight and the last non-zero connection weight, each 5-bit represents a non-zero connection weight index, the last 1-bit is used for marking whether the non-zero connection weight in the convolutional layer corresponds to the same input characteristic diagram or whether the non-zero connection weight in the full connection layer corresponds to the same output neuron, and the whole storage unit codes seven non-zero connection weights and the relative position information thereof; similarly, if the middle of one storage unit contains the last non-zero connection weight corresponding to the same input feature map in the convolutional layer or the last non-zero connection weight corresponding to the same output neuron in a certain layer of the fully-connected layer, the following 4-bit and 5-bit are all complemented by 0, and the last 1-bit is marked with a position 1.

The invention has the further improvement that when the calculation mode of the PE calculation unit array is convolution calculation, if the size of a convolution window is k multiplied by k and the sliding step length of the convolution window is s, Neuron _ Buffer and Weight _ Buffer which are filled up respectively store Neuron 1-k and connection Weight 1-k from first to last according to the first-in first-out principle;

if the sliding step length s of the convolution window is equal to 1, directly discarding the neuron 1 after calculating the product of the neuron 1 and the connection Weight 1 in the first clock period, and transferring the connection Weight 1 to the rearmost side of Weight _ Buffer; reading in a new neuron k +1 in a second clock period, simultaneously calculating the product of the neuron 2 and the connection weight 2, and then transferring the neuron 2 and the connection weight 2 to the rearmost of respective buffers; after the product of the neuron 3 and the connection weight 3 is calculated in the third clock cycle, the neuron 3 and the connection weight 3 are all transferred to the rearmost of the respective buffers; in the same way, until the calculation of the PE unit in the kth clock period finishes the multiplication and accumulation of one convolution window line by line, adjusting the sequence of numerical values in the Buffer according to the step size s while outputting the result, and waiting for calculating the next convolution window;

if the step size s is larger than 1, directly discarding the neuron 1 after calculating the product of the neuron 1 and the connection Weight 1 in the first clock cycle, and transferring the connection Weight 1 to the rearmost side of Weight _ Buffer; reading in a new neuron k +1 in a second clock period, directly discarding the neuron 2 after calculating the product of the neuron 2 and the connection Weight 2, and transferring the connection Weight 2 to the rearmost side of Weight _ Buffer; in the same way, until the s clock period, reading in a new neuron k + s-1, directly discarding the neuron s after calculating the product of the neuron s and the connection Weight s, and transferring the connection Weight s to the rearmost side of Weight _ Buffer; reading in a new neuron k + s in the (s + 1) th clock cycle, simultaneously calculating the product of the neuron s +1 and the connection weight s +1, and then transferring the neuron s +1 and the connection weight s +1 to the rearmost part of each Buffer; after the product of the neuron s +2 and the connection weight s +2 is calculated in the (s + 2) th clock cycle, the neuron s +2 and the connection weight s +2 are all transferred to the rearmost of the respective buffers; and repeating the steps until the calculation of the PE unit in the kth clock period finishes the multiplication and accumulation of one convolution window line, adjusting the sequence of numerical values in the Buffer according to the step size s while outputting the result, and waiting for calculating the next convolution window.

The invention has the further improvement that when the calculation mode of the PE calculation unit array is full connection calculation, if the input neurons calculated by each PE unit are m, Neuron _ Buffer and Weight _ Buffer are respectively stored with 1-m neurons and 1-m connection weights from first to last according to the first-in-first-out principle after being filled with the neurons and the Weight _ Buffer; directly discarding the connection weight 1 after calculating the product of the Neuron 1 and the connection weight 1 in a first clock period, and transferring the Neuron 1 to the rearmost of a Neuron _ Buffer; reading in a new connection weight m +1 in a second clock cycle, calculating the product of the Neuron 2 and the connection weight 2, directly discarding the connection weight 2, and transferring the Neuron 2 to the rearmost of the Neuron _ Buffer; and repeating the steps until the calculation of the PE unit in the mth clock period finishes the intermediate result of one output neuron, adjusting the sequence of the numerical values in the Buffer while outputting the result, and waiting for calculating the next group of output neurons.

Compared with the prior art, the invention has the following beneficial effects:

the invention adopts a sparse matrix compression coding storage format, so that the data of different layers of the neural network have a uniform coding format, and the respective computing characteristics of the convolution layer and the full connection layer in the neural network are considered, thereby saving the storage resource to the greatest extent and simplifying the complexity of a coding and decoding circuit during hardware design. The invention also carries out corresponding optimization design aiming at the data stream cached by the neuron and the connection weight in the PE unit, provides a data stream mode in the PE with high data reuse rate, and ensures that the calculation is more efficient while reducing the storage space. The invention has the advantages of flexible operation, simple coding and decoding and better compression effect, each input neuron adopts 16-bit fixed point representation, and the connection weight adopts 5-bit representation.

Furthermore, after the data preprocessing process in the invention, the residual non-zero connection weight of the neural network is changed into 2ⁿOr 2ⁿIn combination, therefore, the multipliers in the PE calculation units in the present invention are optimized as more resource-efficient shifters.

Further, the invention includes pruning, 2ⁿClustering and compression coding. The compressed LeNet, AlexNet and VGG respectively reduce the storage consumption by about 222 times, 22 times and 68 times compared with the original network model. In the accelerator, all multipliers in the PE unit are replaced by shifters, and all basic modules can be configured according to network calculation and hardware resources, so that the accelerator has the advantages of high speed, low power consumption, small resource occupation and high data utilization rate.

Drawings

FIG. 1 is an overall flow diagram of neural network model preprocessing.

FIG. 2 is 2ⁿFlow chart of connecting weight clustering.

Fig. 3 is a compression encoding storage format.

FIG. 4 is an overall architecture diagram of the neural network hardware accelerator of the present invention.

FIG. 5 is a schematic diagram of a PE array configured in a convolution calculation mode.

FIG. 6 is a schematic diagram of data flow distribution when the PE array is configured in a convolution calculation mode.

FIG. 7 is a diagram illustrating a PE array configured in a fully-connected computing mode.

FIG. 8 is a schematic diagram of data flow allocation when the PE array is configured in a fully-connected computing mode.

FIG. 9 is a diagram illustrating the internal structure of a PE unit and the basic column unit of a PE array.

FIG. 10a is a data flow of neurons and connection weights in PE when convolution calculation.

FIG. 10b is a data flow of neurons and connection weights in PE for full connection computation.

FIG. 11 is a configurable layout of the present invention.

Detailed Description

The technical solution of the present application will be clearly and completely described below with reference to the accompanying drawings.

The connection weight in the present invention is also referred to as a weight.

Referring to fig. 4, the neural network accelerator of the present invention includes: the device comprises an off-chip DRAM, a neuron input buffer area, a neuron decoding unit, a neuron coding unit, a neuron output buffer area, a weight input buffer area, a weight decoding unit, a neuron on-chip global buffer area, a weight on-chip global buffer area, a PE computing unit array, an activation unit and a pooling unit.

And the off-chip DRAM is used for storing the data of the compressed neurons and the connection weights.

And the neuron input buffer is used for caching the compressed neuron data read from the off-chip DRAM and transmitting the data to the neuron decoding unit.

And the neuron decoding unit is used for decoding the compressed neuron data and transmitting the decoded neurons to a global buffer area on a neuron sheet.

And the weight value input buffer area is used for buffering compressed connection weight data read from the off-chip DRAM and transmitting the compressed connection weight data to the weight value decoding unit.

And the weight decoding unit is used for decoding the compressed connection weight data and transmitting the decoded connection weight to the global buffer area on the weight chip.

And the neuron coding unit is used for compressing and coding the calculated data in the global buffer zone on the neuron chip and then transmitting the data to the neuron output buffer zone.

And the neuron output buffer is used for storing the received compressed neuron data into an off-chip DRAM.

And the global buffer area on the neuron sheet is used for caching the decoded neurons and intermediate results obtained by the calculation of the PE array, and carrying out data communication with the PE calculation unit array.

And the global buffer area on the weight piece is used for buffering the decoded connection weight and carrying out data communication with the PE computing unit array.

And the activation unit is used for activating the result calculated by the PE calculation unit array and transmitting the result to the pooling unit or the global buffer area on the neuron sheet. Common activation functions include Sigmoid, Tanh, ReLU, and the like, wherein the Sigmoid function is the most commonly used one in a typical convolutional neural network, and the ReLU function is the most frequently used activation function at present, because the activation function can enable a training result to be rapidly converged and output a lot of zero values. For the Sigmoid activated function, the hardware module design of the Sigmoid function is realized by selecting a piecewise linear approximation mode. The basic idea of piecewise linear approximation is to approximate the activation function itself with a series of broken line segments. During specific implementation, the parameters k and b of each section of broken line can be stored in a lookup table, and the values between breakpoints are obtained by substituting the corresponding parameters k and b according to the formula (1).

y＝kx+b (1)

It should be noted that the Sigmoid function is centrosymmetric about the (0,0.5) point, so that only the parameters of the fitted line segment of the positive half axis of the x-axis need to be stored, and the function value of the negative half axis of the x-axis can be obtained through calculation of 1-f (x). Since the value of f (x) is already very close to 1 when x >8, the present invention only fits the part of x from 0 to 8, and the activation function outputs are all 1 when x is greater than 8. Firstly, comparing an input x value with a segment broken line x interval to obtain a corresponding interval index, then obtaining calculation parameters k and b of a corresponding line segment from a lookup table according to the index, and finally obtaining a final activation output result through calculation of a multiplier and an adder.

And the pooling unit is used for down-sampling the result after the activation unit is activated. In a convolutional neural network, after an activation function activates a convolution calculation result, pooling layer down-sampling is performed. Common pooling ways are average pooling, maximum pooling, spatial pyramid pooling, etc. The present invention designs hardware for the most common max-pooling and average pooling. And the maximum pooling only needs to send all convolution results needing pooling into a comparator for comparison, and the maximum value is the maximum pooling output result. When the average pooling is calculated, all convolution results needing pooling are firstly sent to an adder for accumulation, and then the accumulated result is divided by the number of all input convolution values to obtain an average pooling output result. In the average pooling, the most common pooling window size is generally 2, and in designing a hardware circuit, the division operation of the last step of the average pooling can be replaced by a shift operation, that is, when the pooling window size is 2, the accumulation result only needs to be shifted to the right by two bits.

And the PE computing unit array can be configured in a convolution computing mode or a full-connection computing mode. The PE computation element array comprises a plurality of PE elements, and each column of PE elements is called a basic column element. The main input ports of each PE unit are input neurons (16bits) and connection weights (5bits) corresponding to the input neurons, and the two data are respectively cached in Neuron buffers (in PE) and Weight buffers (in PE Weight Buffer buffers) after entering the PE unitRegion), the product of Neuron and connection Weight begins to be calculated after Neuron _ Buffer and Weight _ Buffer are filled for the first time. After the data preprocessing process in the invention, the residual non-zero connection weight of the neural network is changed into 2ⁿOr 2ⁿIn combination, therefore, the multipliers in the PE calculation units in the present invention are optimized as more resource-efficient shifters.

Referring to fig. 9, the Neuron and the connection Weight are read out from Neuron _ Buffer and Weight _ Buffer, respectively, and it is first determined whether the Neuron and the connection Weight are 0, and if one or both of the two data are 0, the calculation enable signal en is set to 0, otherwise, it is set to 1. If the enable signal en is 0, the shift operation is not performed, and the shift calculation result is directly given to 0 and sent to the adder for accumulation. The Count _ out is used for counting the shift-accumulation operation times calculated by the PE unit, one PE unit calculates the shift-accumulation of one row of each convolution window during convolution calculation, and one PE unit calculates the shift-accumulation of one output neuron connected with one group of input neurons during full connection calculation. The output result obtained by the basic column unit of the PE calculation unit array is stored in Regfile, the adding operation number is selected by a data selector (Select) and sent to an adder corresponding to the column of PE units for accumulation operation, and the corresponding offset value bias is added to obtain the final column output result, Colunm _ out.

Because the data multiplexing and the calculation of the convolution calculation mode and the full-connection calculation mode are different, the specific data flow modes in the PE unit aiming at the convolution calculation process and the full-connection calculation process are slightly different.

In the convolution calculation process, a Convolution Neural Network (CNN) is a forward propagation network structure formed by stacking convolution layers with similar structures, interlayer calculation has independence and each layer of calculation has high similarity, so that the characteristic of convolution layer calculation can be abstractly integrated to design a general convolution calculation mode of a PE calculation unit array.

In the full-connection calculation process, the invention designs a parameterization configurable full-connection calculation mode of the PE calculation unit array after simulating the calculation characteristics of a full-connection layer and a convolution layer in a neural network and considering a series of practical conditions such as full-connection layer structures, data calculation pipelines, system clock cycles, hardware resources and the like in the convolution neural networks with different structures.

Referring to fig. 11, the sizes of convolution kernels, the number of input feature maps, the number of output feature maps, the number of input neurons and the number of output neurons of different fully-connected layers of various convolution neural network models are different, and in order to make the arithmetic unit have stronger universality, the configurable design of the neural network hardware accelerator is optimized by respectively selecting a convolution calculation mode, a fully-connected calculation mode, an activation function, a pooling mode and the like. The configurable design of key modules such as the PE computing unit array, the activation unit and the pooling unit is realized by parameterizing some key feature sizes of the convolutional neural network.

Table 1 lists the configuration parameters and parameter meanings of the basic units in some key modules when the code is implemented. When the estimate _ layer is 0, the current PE array needs to calculate a first layer of a convolutional layer, wherein 1 represents that the current PE array needs to calculate a middle layer of the convolutional layer, 2 represents that the current pooling layer is calculated, and 3 represents that the current PE array needs to calculate a full-connection layer; when connet _ mu _ or _ ad _ PE is 0, the PE unit performs multiplication, the PE unit performs addition for 1, and performs multiply-accumulate for 2; when the active _ type is 0, selecting a Sigmoid activation function, and selecting a ReLU activation function when the active _ type is 1; when the pooling _ type is 0, the maximum pooling mode is selected, and 1 represents the average pooling mode. The parameters can be configured in a top-level file of a program, a user can modify the configuration file according to the input and output sizes of different computing layers in the convolutional neural network, the selection of an activation function and the selection of a pooling mode, a network model which the user wants can be built, and the neural network mapping with corresponding scale and requirement can be quickly generated on hardware.

Table 1 code implementation configuration parameters and parameter meanings of base units in some key modules

An implementation method based on the sparse convolutional neural network accelerator comprises the following steps:

the data preprocessing comprises the following steps:

step 1), pruning the connection weight to make the neural network sparse;

step 2), 2ⁿConnection weight clustering: firstly, according to the distribution condition of the network parameters after pruning, selecting 2 according to the quantization bit width in the range of the parameter setⁿA formal clustering center, calculating the Euclidean distance between the residual non-zero parameters and the clustering center, and clustering the non-zero parameters to the clustering center with the shortest Euclidean distance to the non-zero parameters, wherein n can be a negative integer, a positive integer or zero;

and 3) compressing the connection weight of the sparse network and the characteristic diagram neurons in the sparse network clustered in the step 2) by adopting a sparse matrix compression coding storage format and then storing the compressed connection weight and characteristic diagram neurons in an off-chip DRAM. The specific storage format is as follows: the coding of the convolutional layer characteristic diagram and the full-link layer neuron is divided into three parts, wherein each 5-bit represents the number of 0 between the current non-zero neuron and the last non-zero neuron, each 16-bit represents the numerical value of the non-zero element, the last 1-bit is used for marking whether the non-zero element in the convolutional layer characteristic diagram is subjected to line swapping or not or whether the non-zero element in the full-link layer is subjected to layer swapping or not, and the whole storage unit can code three non-zero elements and the relative position information of the three non-zero elements. If the middle of one storage unit contains the last non-zero element of a certain row of the convolutional layer characteristic diagram or the last non-zero element of a certain layer of input neuron of the full-link layer, the following 5-bit and 16-bit are all complemented by 0, and the last 1-bit is marked in the position 1. The coding of the connection weight is also divided into three parts, wherein each 4-bit represents the number of 0 between the current non-zero connection weight and the last non-zero connection weight, each 5-bit represents a non-zero connection weight index, the last 1-bit is used for marking whether the non-zero connection weight in the convolutional layer corresponds to the same input characteristic diagram or whether the non-zero connection weight in the full connection layer corresponds to the same output neuron, and the whole storage unit can code seven non-zero connection weights and the relative position information thereof. Similarly, if the middle of one storage unit contains the last non-zero connection weight corresponding to the same input feature map in the convolutional layer or the last non-zero connection weight corresponding to the same output neuron in a certain layer of the fully-connected layer, the following 4-bit and 5-bit are all complemented by 0, and the last 1-bit is marked with a position 1.

Step 4), reading the connection weight of the sparse network stored in the off-chip DRAM into a weight input buffer area, decoding the connection weight by a weight decoding unit, and storing the connection weight in a weight on-chip global buffer area; reading the neurons stored in the off-chip DRAM into a neuron input buffer area, decoding the read neurons through a neuron decoding unit, and storing the decoded neurons into a global buffer area on a neuron chip; determining a calculation mode of a PE calculation unit array according to configuration parameters (see table 1) of a current layer of a neural network, and sending decoded arranged neurons and connection weights to corresponding PE calculation units; the two data are respectively cached in Neuron _ Buffer and Weight _ Buffer after entering PE, and when Neuron _ Buffer and Weight _ Buffer are filled for the first time, the product of Neuron and connection Weight is calculated.

When the calculation mode of the PE calculation unit array is configured as convolution calculation, if the size of a convolution window is k multiplied by k and the sliding step length of the convolution window is s, Neuron _ buffers and Weight _ buffers which are filled up respectively store neurons 1 to k and connection weights 1 to k from first to last according to the first-in first-out principle.

If the sliding step length s of the convolution window is equal to 1, directly discarding the neuron 1 after calculating the product of the neuron 1 and the connection Weight 1 in the first clock period, and transferring the connection Weight 1 to the rearmost side of Weight _ Buffer; reading in a new neuron k +1 in a second clock period, simultaneously calculating the product of the neuron 2 and the connection weight 2, and then transferring the neuron 2 and the connection weight 2 to the rearmost of respective buffers; after the product of the neuron 3 and the connection weight 3 is calculated in the third clock cycle, the neuron 3 and the connection weight 3 are all transferred to the rearmost of the respective buffers; and repeating the steps until the calculation of the PE unit in the kth clock period finishes the multiplication and accumulation of one convolution window line, adjusting the sequence of numerical values in the Buffer according to the step size s while outputting the result, and waiting for calculating the next convolution window.

When the calculation mode of the PE calculation unit array is configured to be full connection calculation, if each PE unit calculates m input neurons, Neuron _ buffers and Weight _ buffers are filled and then store neurons 1 to m and connection weights 1 to m from first to last according to a first-in first-out principle. Directly discarding the connection weight 1 after calculating the product of the Neuron 1 and the connection weight 1 in a first clock period, and transferring the Neuron 1 to the rearmost of a Neuron _ Buffer; reading in a new connection weight m +1 in a second clock cycle, calculating the product of the Neuron 2 and the connection weight 2, directly discarding the connection weight 2, and transferring the Neuron 2 to the rearmost of the Neuron _ Buffer; and repeating the steps until the calculation of the PE unit in the mth clock period finishes the intermediate result of one output neuron, adjusting the sequence of the numerical values in the Buffer while outputting the result, and waiting for calculating the next group of output neurons.

If the result obtained by the calculation of the PE calculation unit array is an intermediate result, temporarily storing the intermediate result into a global buffer area on the neuron sheet, and waiting for the next time of accumulation with other intermediate results calculated by the PE calculation unit array; if the final result of the convolutional layer is obtained, the convolutional layer is sent to an activation unit, then the activation result is sent to a pooling unit, and the activation result is stored back to a global buffer area on the neuron sheet after down-sampling; if the final result is the final result of the full connection layer, the final result is sent to an activation unit and then directly stored back to the global buffer area on the neuron chip. And the data stored in the global buffer area on the neuron chip is compressed and coded by the neuron coding unit according to a sparse matrix compression coding storage format and then is sent to a neuron output buffer area, and then is stored back to the off-chip DRAM and is read in when the next layer of calculation is waited. The above steps are circulated until the fully connected layer of the last layer outputs a one-dimensional vector, and the length of the vector is the number of classification categories.

The invention has the advantages of flexible operation, simple coding and decoding and better compression effect, each input neuron adopts 16-bit fixed point representation, and the connection weight adopts 5-bit representation.

FIG. 1 is an overall flow chart of neural network model preprocessing, the first step is called pruning of connection weights, removing connections with smaller connection weights from a trained model, and then training to restore the network recognition accuracy to an initial state; the second step is called 2ⁿClustering by connecting weights, performing clustering operation of fixing clustering centers on the model trained in the first step, and replacing all non-zero connecting weights by the clustering centers; the third step is called connection weight compression coding.

FIG. 2 is an implementation of the present invention 2ⁿAnd (4) connecting the specific flow chart of the weight clustering. Training after the pruned sparse network is obtained until the recognition accuracy rate does not rise any more, and then selecting N2 sparse networks according to the parameter distribution at the momentⁿAnd the clustering center of the form clusters all the weights to the clustering center with the minimum Euclidean distance to the clustering center, judges whether the network accuracy is lost or not, continues training if the network accuracy is lost, and repeats training until the accuracy does not rise any more, and so on until the network accuracy is not lost after clustering.

Fig. 3 shows the sparse matrix compression coding storage format of the present invention.

FIG. 4 shows the overall architecture of a neural network hardware accelerator, which includes an off-chip DRAM, a neuron input buffer, a neuron decoding unit, a neuron encoding unit, a neuron output buffer, a weight input buffer, a weight decoding unit, an on-neuron global buffer, an on-weight on-chip global buffer, a PE computing unit array, an activation unit, and a pooling unit.

FIG. 5 is a schematic diagram of the PE array configured in the convolution calculation mode, in which a dotted frame is an array of n rows and n columns of PE units, each column of PE units is referred to as a basic column unit, and an addition operand Selector (SEL) is provided for selecting whether the calculation result of the PE unit in the corresponding column is fed to an accumulator for accumulation. All the PE units in fig. 5 correspond to the same input feature map, where the PE units in the same row receive neurons in the same row of the input feature map and convolution kernel connection weights in the same row corresponding to different output feature maps, the PE units in the same column receive neurons in different rows of the input feature map that need convolution calculation and connection weights in different rows of the same convolution kernel corresponding to the output feature map, and finally, calculate at the same time to obtain t consecutive neurons in the same position of n output feature maps, where t is m/k, where k is the size of a convolution kernel.

Fig. 6 illustrates a specific data flow allocation manner of the PE array in the convolution calculation mode of the present invention, where the left side is the convolution kernel corresponding to the first output feature map and the input feature map that needs to be subjected to convolution calculation, and it is assumed that the size k of the convolution kernel is 5, and the size m of the PE array is 5. 5 connection weights per row of the convolution kernel corresponding to the first output feature map are set, and the 5 connection weights are respectively marked as W_r1～W_r5The 5 sets of connection weights are sent to the first column of the PE array, respectively. That is, the example 5 columns of PE units need to receive different convolution kernels for 5 output feature maps, respectively. The input feature map is of size NxN, and the N lines of neuron data are denoted as Row₁～Row_NAnd the neuron data which needs to be input into the PE unit in each convolution calculation is recorded as N_k1～N_k5These 5 groups were combinedThe input neuron data are respectively sent to PE units in the same row of the PE array. That is, the first row of PE units shares N_k1Neuronal data, second row of PE units sharing N_k2Neuron data, and so on, finally, the 5 × 5 PE array in the example simultaneously calculates 1 output neuron value at the same position of 5 output feature maps.

FIG. 7 is a schematic diagram of the general PE array in a fully connected computing mode, in which a dotted frame is an array of m rows and n columns of PE units, each column of PE units is referred to as a basic column unit, and an addition number Selector (SEL) is provided for selecting whether the computation results of the PE units in the corresponding column are fed to an accumulator for accumulation. The invention finally selects a parallel mode of grouping and recalculating all input neurons and output neurons to realize maximum data multiplexing. Therefore, all the PE units in fig. 7 correspond to different parts of all the input neurons, where the PE units in the same row receive the connection weights of the same group of input neurons and different output neurons connected thereto, and the PE units in the same column receive the connection weights of different groups of input neurons and different output neurons connected thereto, and finally, n consecutive output neurons are obtained by simultaneous calculation.

Fig. 8 illustrates a specific data flow allocation manner of the PE array in the full-connection calculation mode of the present invention, where the left side is a full-connection layer, and assuming that 20 input neurons and 10 output neurons are provided, the connection weights of the full-connection layer of the layer are 200 in total, and the PE array size m is 4, and n is 5. Dividing all input neurons into 4 groups of 5 input neurons, and recording the groups as N₁～N₄Each row of the PE array shares the same set of input neurons. That is, in the example, 4 rows of PE units are required to receive corresponding 4 sets of input neurons, respectively, with the first row of PE units sharing N₁Neuronal data, second row of PE units sharing N₂Neuron data, and so on, the m-th row of PE units share N_mNeuronal data. Since the number of columns in the PE array is 5, all output neurons are divided into 2 groups, each denoted as F₁、F₂And 5 output neurons are calculated in each group. N is a radical of₁～N₄Group input neurons and F₁The connection weights of the first output neurons in the group are denoted as W1₁₁～W1₁₄And F₁The connection weights of the second output neurons in the group are denoted as W1₂₁～W1₂₄By analogy with F₂The connection weights of the fifth output neuron group are denoted as W2₅₁～W2₅₄There are 40 sets of connection weights, 5 per set. The 40 sets of connection weights are respectively sent to the computing units in the PE array, and the first-column PE units respectively receive W1₁₁～W1₁₄The second column of PE units respectively receive W1₂₁～W1₂₄And so on, the fifth column of PE units respectively receive W1₅₁～W1₅₄The connection weight of. Finally, the 4 × 5 PE array in the example was calculated simultaneously to yield F ₁5 output neuron values of the group.

Fig. 9 shows the internal structure of a single PE compute unit and the data flow between the elementary column units of m PE compute units in the PE array. The entire PE array is ultimately made up of n such elementary column units. As can be seen from the figure, the main input ports of each PE unit are input neurons (16bits) and connection weights (5bits) corresponding to the input neurons, and the two data are respectively cached in Neuron _ Buffer and Weight _ Buffer after entering the PE unit. In addition, there are data input valid signals, PE unit calculation enable signals, etc., which are not shown in the figure.

Fig. 10a and 10b show examples of data flows of PE units in the convolution calculation mode and the full-connection calculation mode, respectively. Assuming that the size of a configured convolution window in the convolution calculation mode is 5 × 5, the sliding step length is 1, 5 input neurons calculated by each PE unit in the full-connection calculation mode are calculated, and the calculation is started after Neuron _ Buffer and Weight _ Buffer are filled.

The data flow of the convolution calculation mode is shown in fig. 10a, the product of neuron 1 and connection Weight 1 is calculated in the first clock cycle, then neuron 1 is directly discarded, and connection Weight 1 is transferred to the back of Weight _ Buffer; reading in a new neuron 6 in a second clock cycle, calculating the product of the neuron 2 and the connection weight 2, and then transferring the neuron 2 and the connection weight 2 to the rearmost part of each Buffer; after the product of the neuron 3 and the connection weight 3 is calculated in the third clock cycle, the neuron 3 and the connection weight 3 are all transferred to the rearmost of the respective buffers; and repeating the steps until the calculation of the PE unit in the fifth clock period finishes the multiplication and accumulation of one convolution window and one row, adjusting the sequence of numerical values in the Buffer while outputting the result, and waiting for calculating the next convolution window.

Fig. 10b shows a data flow example of the full-connection calculation mode, in which after calculating the product of Neuron 1 and connection weight 1 in the first clock cycle, connection weight 1 is directly discarded, and Neuron 1 is transferred to the rearmost of Neuron _ Buffer; reading in new connection weight 6 in a second clock cycle, calculating the product of the Neuron 2 and the connection weight 2, directly discarding the connection weight 2, and transferring the Neuron 2 to the rearmost of the Neuron _ Buffer; and repeating the steps until the PE unit completes the calculation of the intermediate result of one output neuron in the fifth clock period, adjusting the sequence of the numerical values in the Buffer while outputting the result, and waiting for calculating the next group of output neurons.

FIG. 11 is a configurable design of the present invention. In the convolution calculation mode, the size of the PE array can be configured according to the size of a convolution kernel and the number of output characteristic graphs, and the number of PE arrays which can be calculated in parallel can be configured according to the available hardware resources and the number of input characteristic graphs. Each PE array corresponds to a convolution calculation of an input feature map, and the corresponding adder array is used to accumulate the output values of the PE units and the corresponding bias values bias. When the number of input channels is not 1, the output neurons calculated by each PE array are intermediate results of the corresponding output feature map, and therefore, an adder array is required in addition to the PE arrays to accumulate the intermediate results calculated by all the PE arrays, and the output of the adder array is the final output result of the corresponding output feature map neurons.

The invention designs a high-energy-efficiency parameterized and configurable hardware accelerator according to an effective neural network model compression algorithm and aiming at the characteristics of compressed sparse convolutional neural network data. The invention provides a hardware friendly application to model separationThe neural network model compression algorithm during line training comprises pruning and 2ⁿClustering and compression coding. The compressed LeNet, AlexNet and VGG respectively reduce the storage consumption by about 222 times, 22 times and 68 times compared with the original network model. Then, in the hardware accelerator designed by the invention, all multipliers in the PE unit are replaced by shifters, and all basic modules can be configured according to network computation and hardware resources. Therefore, the method has the advantages of high speed, low power consumption, small resource occupation and high data utilization rate.

Claims

1. A sparse convolutional neural network accelerator, comprising: the device comprises an off-chip DRAM, a neuron input buffer area, a neuron decoding unit, a neuron coding unit, a neuron output buffer area, an on-neuron global buffer area, a weight input buffer area, a weight decoding unit, an on-weight global buffer area, a PE (provider edge) computing unit array, an activation unit and a pooling unit; wherein,

2. A realization method based on the sparse convolutional neural network accelerator as claimed in claim 1 is characterized in that data is preprocessed and stored in DRAM, connection weights of the sparse network stored in the off-chip DRAM are read into a weight input buffer area, and are decoded by a weight decoding unit and stored in a weight on-chip global buffer area; reading the neurons stored in the off-chip DRAM into a neuron input buffer area, decoding the neurons by a neuron decoding unit, and storing the neurons in a global buffer area on a neuron chip; determining a calculation mode of a PE calculation unit array according to configuration parameters of a current layer of a neural network, and sending the decoded arranged neurons and connection weights to corresponding PE calculation units; after entering PE, the Neuron and the connection Weight data are respectively cached in Neuron _ Buffer and Weight _ Buffer, and when Neuron _ Buffer and Weight _ Buffer are fully filled for the first time, the product of the Neuron and the connection Weight is calculated; wherein, Neuron _ Buffer is a Neuron Buffer area in PE, Weight _ Buffer is a Weight Buffer area in PE;

3. The method for implementing the sparse convolutional neural network accelerator as claimed in claim 2, wherein the specific process for preprocessing the data is as follows:

1) pruning the connection weight to make the neural network sparse;

2)，2ⁿconnection weight clustering: firstly, according to the distribution condition of the network parameters after pruning, selecting 2 according to the quantization bit width in the range of the parameter setⁿThe clustering center of the form calculates the Euclidean distance between the residual non-zero parameters and the clustering center, and clusters the non-zero parameters to the clustering center with the shortest Euclidean distance to the non-zero parameters, wherein n is a negative integer, a positive integer or zero;

4. The method for implementing the sparse convolutional neural network accelerator as claimed in claim 2, wherein the sparse matrix compression coding storage format is specifically: the coding of the convolutional layer characteristic diagram and the full-link layer neuron is divided into three parts, wherein each 5-bit represents the number of 0 between the current non-zero neuron and the last non-zero neuron, each 16-bit represents the numerical value of the non-zero element, the last 1-bit is used for marking whether the non-zero element in the convolutional layer characteristic diagram is changed or not or whether the non-zero element of the full-link layer is changed, and the whole storage unit codes three non-zero elements and the relative position information of the three non-zero elements; if the middle of one storage unit contains the last non-zero element of a certain row of the convolutional layer characteristic diagram or the last non-zero element of a certain layer of input neuron of the full-link layer, the following 5-bit and 16-bit are all complemented by 0, and the last 1-bit is marked in the position 1; the coding of the connection weight is also divided into three parts, wherein each 4-bit represents the number of 0 between the current non-zero connection weight and the last non-zero connection weight, each 5-bit represents a non-zero connection weight index, the last 1-bit is used for marking whether the non-zero connection weight in the convolutional layer corresponds to the same input characteristic diagram or whether the non-zero connection weight in the full connection layer corresponds to the same output neuron, and the whole storage unit codes seven non-zero connection weights and the relative position information thereof; similarly, if the middle of one storage unit contains the last non-zero connection weight corresponding to the same input feature map in the convolutional layer or the last non-zero connection weight corresponding to the same output neuron in a certain layer of the fully-connected layer, the following 4-bit and 5-bit are all complemented by 0, and the last 1-bit is marked with a position 1.

5. The method for implementing the sparse convolutional neural network accelerator as claimed in claim 2, wherein when the calculation mode of the PE calculation unit array is convolution calculation, if the convolution window size is k × k and the convolution window sliding step length is s, Neuron _ Buffer and Weight _ Buffer after being filled respectively store neurons 1 to k and connection weights 1 to k from first to last according to the first-in first-out principle;

6. The method for implementing the sparse convolutional neural network accelerator as claimed in claim 2, wherein when the calculation mode of the PE calculation unit array is full-connection calculation, if the number of input neurons calculated by each PE unit is m, Neuron _ Buffer and Weight _ Buffer are respectively stored with 1 to m neurons and 1 to m connection weights from first to last according to the first-in-first-out principle after being filled with the input neurons and the Weight _ Buffer; directly discarding the connection weight 1 after calculating the product of the Neuron 1 and the connection weight 1 in a first clock period, and transferring the Neuron 1 to the rearmost of a Neuron _ Buffer; reading in a new connection weight m +1 in a second clock cycle, calculating the product of the Neuron 2 and the connection weight 2, directly discarding the connection weight 2, and transferring the Neuron 2 to the rearmost of the Neuron _ Buffer; and repeating the steps until the calculation of the PE unit in the mth clock period finishes the intermediate result of one output neuron, adjusting the sequence of the numerical values in the Buffer while outputting the result, and waiting for calculating the next group of output neurons.