CN110780923A

CN110780923A - Hardware accelerator applied to binary convolution neural network and data processing method thereof

Info

Publication number: CN110780923A
Application number: CN201911055498.0A
Authority: CN
Inventors: 杜高明; 涂振兴; 陈邦溢; 杨振文; 张多利; 宋宇鲲; 李桢旻
Original assignee: Hefei Polytechnic University
Current assignee: Hefei University of Technology; Hefei Polytechnic University
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-02-11
Anticipated expiration: 2039-10-31
Also published as: CN110780923B

Abstract

The invention discloses a hardware accelerator applied to a binary convolution neural network and a data processing method thereof, wherein the hardware accelerator comprises: the device comprises a neural network parameter storage module, a matrix generator module, a convolution calculation array module, a pooling layer module and a global control module; a neural network parameter storage module is pre-stored with a binary gray level picture and a neural network training parameter of each layer; the matrix generator is responsible for preparing input data of convolution operation; the convolution calculation array is responsible for convolution calculation of the convolution layer; the pooling layer module is responsible for pooling the output of the convolutional layer; the global control module is responsible for controlling the normal work of the whole system. The invention aims to further improve the operation speed of the convolutional neural network, reduce the resources and the computing resources consumed by the network deployment on a hardware platform and simultaneously reduce the power consumption of the network operation.

Description

Hardware accelerator applied to binary convolution neural network and data processing method thereof

Technical Field

The invention belongs to the field of artificial intelligence hardware design, and particularly relates to an accelerator applied to a binary convolution neural network and a data processing method thereof.

Background

The convolutional neural network is derived from an artificial neural network. As a multilayer perception network, the method has strong adaptability to various image transformation forms such as rotation, scaling down or enlargement of images, translation of images and the like, and can quickly extract image features. The method adopts a weight sharing network structure, the structure has strong similarity with a biological neural network structure, the structure reduces the number of weights, thereby reducing the complexity of a network model, the advantages are more obvious when multidimensional images are input to the network, the images can be directly used as the input of the network, and the complicated characteristic extraction and data reconstruction processes in the traditional recognition algorithm are avoided.

Although the numbers are simple in strokes, the types are only ten. However, the digital strokes are simple and have relatively small differences, and the shapes of the handwritten numbers are different, so that the recognition difficulty is increased, and the precision is low. The recognition accuracy can be greatly improved by using the convolutional neural network, and the recognition rate can reach 99.33 percent at present. However, in some recognition processes, a fast recognition speed and a high accuracy are required.

Although the convolutional neural network has the advantages, a large number of convolution calculations are inevitably required, and the higher the image resolution is, the smaller the convolution step size is, and the more calculation resources are required. Meanwhile, the input pictures and the intermediate result characteristic diagrams of the traditional convolutional network in the training stage and the testing stage are represented by full-precision data, a large amount of DSP (digital signal processor) resources are needed for realizing multiplication and addition of the data, and the time spent in the training stage is unacceptable.

At the present stage, people deploy a convolutional neural network on an FPGA to realize acceleration of the network, and many researchers quantize the weight parameters and the data of the intermediate characteristic diagram into 1-bit data to greatly reduce the storage space occupied by the training parameters. However, in many technologies adopted by the methods, the feature graph input to the first layer still adopts full-precision data, a special computing unit needs to be designed for the first layer computing, and the methods are poor in universality, high in resource consumption and large in power consumption loss; the traditional convolution adopts the strategies of multiply-accumulate and popcount algorithm, so that the internal resources of the FPGA are not utilized to the maximum extent, and the calculation period is increased; meanwhile, the strategy of convolution calculation of edge complement +1 or-1 is adopted, edge data still needs to be calculated, the calculation amount is increased, and therefore hardware resource consumption is increased and the calculation period is increased.

Disclosure of Invention

The invention provides a hardware accelerator applied to a binary convolutional neural network and a data processing method thereof based on the prior art, and aims to improve the training speed of the convolutional neural network, reduce resources and computing resources consumed by network deployment on a hardware platform, and reduce the power consumption of network operation.

The technical scheme adopted by the invention to achieve the aim is as follows:

the invention relates to a hardware accelerator applied to a binary convolution neural network, which is characterized in that: the binary convolution neural network comprises: k layers of binarization convolution layers, K layers of activation function layers, K batch normalization layers, K layers of pooling layers and a layer of full-connection classification output layer; combining the number of the training parameters in the batch standardization layer into one;

the binary convolution neural network trains an input binary gray picture with the size of NxM to obtain a neural network training parameter of each layer;

the hardware acceleration circuit comprises: the device comprises a neural network parameter storage module, a matrix generator module, a convolution calculation array module, a pooling layer module and a global control module;

the neural network parameter storage module is pre-stored with the input binary gray level picture and the neural network training parameters of each layer; the neural network training parameters include: weight parameters and batch standardization parameters; the stored input binary gray level picture represents a pixel point "-1" in the binary gray level picture by data "0", and represents a pixel point "1" in the binary gray level picture by data "1";

the matrix generator module consists of a data input circuit, a data register group of a characteristic diagram and a data output circuit;

defining the current layer number in the binary convolution neural network as k, and initializing k to be 1;

under the k-th layer calculation of the binary convolution neural network, the global control module sends a first matrix generation starting signal as a first sliding calculation of a plurality of rows in the input characteristic diagram to the matrix generator module; if k is 1, the input feature map is an input binary gray level picture, and if k is more than 1, the input feature map is an intermediate feature map of each layer;

after the data input circuit receives the matrix generation starting signal, the data input circuit respectively processes according to the type of the current sliding calculation:

if the type of the current sliding calculation is the upper edge calculation, reading data of a first row of input feature maps under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in the data register group when a rising edge of a first period comes; when the rising edge of a second period comes, storing the data in the first register into a second register, reading out the data of a second row input characteristic diagram under the current sliding calculation from the neural network parameter storage module, storing the data into a first register in the data register group, and outputting the data in the first register and the second register by the data output circuit;

if the type of the current sliding calculation is non-edge calculation, reading data of a first row of input feature maps under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in the data register group when a rising edge of a first period comes; when the rising edge of a second period comes, storing the data in the first register into a second register, reading out the data of a second row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in the data register group; when the rising edge of a third period comes, storing the data in the second register into a third register, storing the data in the first register into the second register, reading out the data of a third row input characteristic diagram under the current sliding calculation from the neural network parameter storage module, storing the data into a first register in the data register group, and outputting the data in the first register, the second register and the third register by the data output circuit;

if the type of the current sliding calculation is the lower edge calculation, when a rising edge of a first period comes, data of a last-but-third-row input feature map under the current sliding calculation stored in the third register is shifted out, data of a last-but-second-row input feature map under the current sliding calculation stored in the second register is stored in the third register, data of a last-but-first-row input feature map under the current sliding calculation stored in the first register is stored in the second register, and data in the second register and data in the third register are output by the data output circuit;

the convolution calculation array module consists of a parameter reading module, a 6:2 compression tree module and an output interface module;

under the k-th layer calculation of the binary convolution neural network, the parameter reading module judges and acquires data under the current sliding calculation output by the data output circuit as input data, reads neural network training parameters under the current sliding calculation from the neural network parameter storage module and then sends the neural network training parameters to the 6:2 compression tree module;

the 6:2 compression tree module consists of an exclusive nor gate array, a popcount module and an adder;

the XNOR gate array receives input data and neural network training parameters under the current sliding calculation and respectively processes the input data and the neural network training parameters according to the type of the current sliding calculation to obtain the XNOR result:

if the type of the current sliding calculation is the upper edge calculation and the left edge calculation, taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or, and taking the bits [7:8] of the weight parameter and the bits [0:1] of the data given by the first register as the same or;

if the type of the current sliding calculation is upper edge calculation and is not left-right edge calculation, taking the same OR operation of the bits [3:5] of the weight parameter and the bits [ a: a +2] of the data of the feature diagram given by the second register, and taking the same OR operation of the bits [6:8] of the weight parameter and the bits [ a: a +2] of the data given by the first register, wherein a is more than or equal to 0 and less than or equal to M;

if the type of the current sliding calculation is the upper edge calculation and the right edge calculation, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by the first register;

if the type of the current sliding calculation is non-upper and lower edge calculation and left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:2] of the data given by a third register as an exclusive OR operation, taking the bits [4:5] of the weight parameter and the bits [0:2] of the data given by a second register as an exclusive OR operation, and taking the bits [7:8] of the weight parameter and the bits [0:2] of the data given by a first register as an exclusive OR operation;

if the type of the current sliding calculation is non-upper and lower edge calculation and non-left and right edge calculation, taking an exclusive OR operation on [0:2] bits of the weight parameter and [ c: c +2] bits of data given by a third register, taking an exclusive OR operation on [3:5] bits of the weight parameter and [ c: c +2] bits of data given by a second register, and taking an exclusive OR operation on [6:8] bits of the weight parameter and [ c: c +2] bits of data given by a first register, wherein c is more than or equal to 0 and less than or equal to M;

if the type of the current sliding calculation is non-upper and lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by a third register, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by a second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by a first register;

if the type of the current sliding calculation is the lower edge calculation and the left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:1] of the data given by the third register as the same or operation, and taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or operation;

if the type of the current sliding calculation is lower edge calculation and non-left and right edge calculation, performing exclusive OR operation on the [0:2] bit of the weight parameter and the [ b: b +2] bit of the data given by the third register, and performing exclusive OR operation on the [3:5] bit of the weight parameter and the [ b: b +2] bit of the data given by the second register, wherein b is more than or equal to 0 and less than or equal to M;

if the type of the current sliding calculation is lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by the third register, and taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register;

the popcount module counts '1' in the same or result to obtain a counting result;

the adder accumulates the counting result and the batch standardization parameters, and the sign bit of the obtained accumulation result is used as an output characteristic diagram of the k-th layer binary convolution layer;

the k-th pooling layer performs pooling on the output characteristic map of the k-th binarization convolution layer, and an obtained pooling result is used as an input characteristic map of a k + 1-th binarization convolution layer;

and if K is equal to K, taking the output result of the K-1 layer pooling layer as the input feature map of the fully-connected classified output layer, calculating the result at the K layer of the binarization convolutional neural network, and taking the accumulated result obtained by the adder and the sign bit as the recognition result of the hardware accelerator on the numbers in the binarization gray level picture.

The invention relates to a data processing method of a hardware accelerator applied to a binary convolution neural network, which is characterized by comprising the following steps:

step 1, constructing a binary convolution neural network comprises the following steps: k layers of binarization convolution layers, K layers of activation function layers, K batch normalization layers, K layers of pooling layers and a layer of full-connection classification output layer;

step 2, combining four training parameters E [ A ], Var [ A ], β and gamma in the batch standardization layer into one parameter C by using a batch standardization transformation formula shown in formula (1):

in the formula (1), gamma is a standardized extended parameter, β is a standardized translation parameter, A is the number of a small batch during training, Var [ A ] is the average value of each small batch input in a batch standardized layer during one network training, E [ A ] is the variance value of each small batch input in a scalar standardized layer during one network training, epsilon is a preset offset parameter before network training, and the bit width of a single input value of Vec _ Len;

step 3, acquiring a gray level digital picture with the size of NxM and carrying out binarization processing to obtain an input binarization gray level picture; the input binary gray picture represents a pixel point "-1" in the binary gray picture by using data "0", and represents a pixel point "1" in the binary gray picture by using data "1";

step 4, training the input binary gray level picture by using the binary convolution neural network to obtain a neural network training parameter of each layer; the neural network training parameters include: weight parameters and batch standardization parameters;

step 5, constructing the hardware acceleration circuit comprises: the device comprises a neural network parameter storage module, a matrix generator module, a convolution calculation array module, a pooling layer module and a global control module; storing the input binary gray level picture and the neural network training parameters of each layer into the neural network parameter storage module;

step 6, defining the current layer number in the binary convolution neural network as k, and initializing k as 1;

step 7, under the k-th layer calculation of the binary convolution neural network, the global control module sends a first matrix generation starting signal as a first sliding calculation of a plurality of rows in the input characteristic diagram to the matrix generator module; if k is 1, the input feature map is an input binary gray level picture, and if k is more than 1, the input feature map is an intermediate feature map of each layer;

and 8, after receiving the matrix generation starting signal, the matrix generator module respectively processes according to the type of the current sliding calculation:

if the type of the current sliding calculation is the upper edge calculation, reading data of a first row of input feature maps under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in a self data register group when a rising edge of a first period comes; when the rising edge of a second period comes, storing the data in the first register into a second register, reading out the data of a second row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into the first register;

if the type of the current sliding calculation is non-edge calculation, reading data of a first row of input feature maps under the current sliding calculation from the neural network parameter storage module and storing the data into the first register when a rising edge of a first period comes; when the rising edge of a second period comes, storing the data in the first register into a second register, reading out the data of a second row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into the first register; when the rising edge of a third period comes, storing the data in the second register into a third register, storing the data in the first register into the second register, reading out the data of a third row input feature map under the current sliding calculation from the neural network parameter storage module, and storing the data into the first register;

if the type of the current sliding calculation is the lower edge calculation, when the rising edge of the first period comes, the data of the input feature diagram in the last-but-third row stored in the third register under the current sliding calculation is shifted out, the data of the input feature diagram in the last-but-second row under the current sliding calculation in the second register is stored in the third register, and the data of the input feature diagram in the last-but-first row under the current sliding calculation in the first register is stored in the second register;

step 9, under the k-th layer calculation of the binary convolution neural network, the convolution calculation array module judges and acquires data under the current sliding calculation output by the matrix generator module as input data, and reads the neural network training parameters under the current sliding calculation from the neural network parameter storage module, so as to respectively process according to the type of the current sliding calculation, and obtain the same or the same result:

step 10, counting '1' in the same or result to obtain a counting result;

step 11, accumulating the counting result and the batch standardized parameters by adopting the formula (2) to obtain an accumulated result Y ₂And after the sign bit is taken, an intermediate result is obtained;

in the formula (2), X _0,1Is a normalized single input value, popcount (X) _0,1) Representing an input value X _0,1The number of middle "1";

an obtained accumulation result step 12, taking the intermediate result as an output characteristic diagram of the k-th layer binarization convolution layer, and performing pooling processing on the output characteristic diagram of the k-th layer binarization convolution layer by using a k-th layer pooling layer to obtain a k-th layer pooling result;

step 13, taking the K-th pooling result as an input feature map of a K + 1-th binary convolution layer, assigning K +1 to K, judging whether K > K is satisfied, and if so, indicating that the output result of the K-th pooling layer is the input feature map of the fully-connected classified output layer;

and step 14, processing the input characteristic diagram of the fully-connected classified output layer in steps 7 to 11 to obtain an intermediate result as a recognition result of the hardware accelerator on the numbers in the binary gray level picture.

Compared with the prior art, the beneficial technical effects of the invention are as follows:

1. the invention provides a method for carrying out binarization processing on a training and testing data set picture, unifies a hardware overall framework, enables floating point numbers not to exist in the network training and testing process, obtains results through simple logic operation of convolution operation, reduces storage space occupied by input data and resources consumed by convolution operation in the process of extracting features from an input picture, enables the network to be lighter and accelerates the operation speed of the network to a certain extent.

2. The invention further changes the existing batch standardization conversion formula, adds the characteristics of all binaryzation processing of the characteristic parameters and the weight parameters, and finally converts the result which can be obtained only through a complex batch standardization calculation formula into a standardization result which can be obtained through a simple hardware comparison circuit. In the conversion process, 4 training parameters which are standardized in batches are combined into one parameter, so that when the hardware stores the batch standardized parameters, the hardware does not need to store 4 training parameters and only needs to prestore one parameter, precious hardware storage resources are saved, and the hardware design is simplified.

3. The invention further provides a 6:2 compression tree technology with higher utilization rate of hardware resources on the basis of the existing 6:3 compression tree. By the proposed 6:2 compression technology and combining with the calculation of the edge skipping calculation array, the resources of the LUT can be fully utilized when the same or array carries out convolution operation. When calculating the non-edge array, 3 calculation units with 6 inputs and 2 outputs are needed in the process of extracting the features of a feature map by a convolution kernel, namely 3 LUTs are needed, for the characteristic extraction process of non-edge angles, because of the scheme design of skipping edge array data, only 6bit data parameter operation is calculated at one time, therefore, 2 calculation units with 6 inputs and 2 outputs are needed, namely 2 LUTs are needed, and for the feature extraction of the upper left corner, the upper right corner, the lower left corner and the lower right corner, the convolution calculation only has 4bit of input, thus, 2 6-in 2-out calculation units, also 2 LUTs, are also required, resulting in the convolution calculation for these four positions not being fully utilized with LUTs, however, the calculation resources for the fee are relatively small compared with the resources used in the whole process, and the resource waste is less than that of the existing 6:3 compression technology.

4. The invention solves the problems of increased data volume and occupation of LUT (look-up table) resources caused by completion of the conventional convolutional neural network deployed on the FPGA in a zero padding mode during edge calculation by adopting an independently designed edge unit calculation array. The embodiment adopts a unique edge unit design, so that each layer of feature diagram data stored in the DDR does not need to be stored in a zero padding mode, and occupied storage resources are reduced. The edge cell designs have 4 types, and are divided into upper, lower, left and right edge cell designs, and the specific edge calculation process is shown in fig. 6 and 7.

In summary, unlike the conventional BNN network input which is a full-precision gray-scale image, the present invention performs binarization on the input gray-scale image; based on the existing 6:3 compression tree technology, a 6:2 compression tree technology which saves resources more and calculates with high efficiency is designed, and edge skipping unit fast element design is combined, so that the calculated data volume is reduced, and the calculation speed is increased; the calculation process is optimized in the batch standardization calculation process, so that the result can be obtained through simple comparison, and operations such as multiplication, addition and the like are reduced.

Drawings

FIG. 1 is a flow chart of a convolutional neural network for handwritten word recognition as employed in the present invention;

FIG. 2 is a diagram of a hardware architecture of a binary convolutional neural network employed in the present invention;

FIG. 3 is a diagram of a parameter memory employed by the present invention;

FIG. 4 is a data flow diagram of a matrix generator employed by the present invention;

FIG. 5 is a 6:2 compression tree structure diagram employed by the present invention;

FIG. 6 is a data flow diagram designed using a top edge skipping unit in accordance with the present invention;

FIG. 7 is a data flow diagram designed using a lower edge skipping unit in accordance with the present invention;

FIG. 8 is a flow chart of the convolution of the present invention.

Detailed Description

In this embodiment, the binarizing convolutional neural network includes: k layers of binarization convolution layers, K layers of activation function layers, K batch normalization layers, K layers of pooling layers and a layer of full-connection classification output layer; combining the number of the training parameters in the batch standardization layer into one;

the convolutional neural network used in this embodiment is a handwritten number recognition network, and the structure diagram is shown in fig. 1, and the structure of the convolutional neural network includes an input layer, two convolutional layers, two pooling layers, and a full-link layer. The first layer of calculation is convolutional layer calculation, the input layer is 784 neural nodes, the size of a convolutional kernel is 3 multiplied by 3, the convolution step length is 1, 16 characteristic graphs of 28 multiplied by 28 are output, and 12544 neural nodes are total; the second layer is calculated for a pooling layer, the input layer is 16 characteristic maps of 28 × 28 output by the first convolutional layer, and the pooling kernel adopts a characteristic map of 2 × 2 in size and 2 in step size and outputs 16 characteristic maps of 14 × 14, so that 3136 neurons are total; the third layer is convolution layer calculation, the input layer is 16 14 × 14 feature maps output by the pooling layer, the size of a convolution kernel adopts 3 × 3, the convolution step size is 1, 32 14 × 14 feature maps are output, and 6272 neurons are output; the fourth layer is the calculation of the pooling layer, the input layer is 32 characteristic maps of 14 × 14 output by the second convolution layer, the pooling kernel adopts 2 × 2 in size and 2 in step length, 32 characteristic maps of 7 × 7 are output, and 1568 neurons are obtained in total; the fifth layer is a full connection layer calculation, the input layer is 32 7 multiplied by 7 characteristic graphs output by the second pooling layer, and 10 neurons are output; activation operations and batch normalization operations are performed between the convolutional layer and the pooling layer.

And the binarization convolutional neural network trains the input binarization gray level picture with the size of NxM. This example of implementation is primarily done using the MNIST (Mixed National Institute of Standards and technology database) handwriting library, which is a library of handwritten numbers trained by Karan research, Kyoho laboratories and New York university. The whole training library comprises a training library and a testing library, wherein the training library comprises 60000 handwritten digital images, and the testing library comprises 10000 images. The MNIST handwritten digital image is 28 x 28 in size, i.e. 784 in number of input layer nodes. Obtaining a neural network training parameter of each layer after training is finished;

a hardware accelerator for application to a binarized convolutional neural network comprising: a neural network parameter storage module, a matrix generator module, a convolution calculation array module, a pooling layer module and a global control module, wherein a specific hardware structure block diagram is shown in fig. 2;

the neural network parameter storage module is pre-stored with input binary gray level pictures and neural network training parameters of each layer; the neural network training parameters include: weight parameters and batch standardization parameters; the stored input binary gray level picture represents a pixel point '-1' in the binary gray level picture by data '0', and represents a pixel point '1' in the binary gray level picture by data '1'; the data of the stored characteristic map, the weight parameters and the batch standardized data mode are shown in FIG. 3.

The data storage sequence of the feature map is (nchw), that is, the data corresponding to each address in the DDR is a row of data of the feature map. Storing a first line of data of a first characteristic diagram at an initial address, storing a second line of data of the first characteristic diagram at a second address, sequentially storing all data of the first characteristic diagram, continuously storing the first line of data of a second characteristic diagram, and continuously storing all data of the characteristic diagram of the layer according to the rules;

the convolution kernel data is stored in the order of (nc (w × h)), since the convolution kernel size is fixed to 3 × 3, while the weight parameter is quantized to 1bit, therefore, the data size of each channel of one convolution kernel is only 9 bits, so that the data of the first channel of the first convolution kernel is spliced into 9-bit data according to three lines of data (the first line of 3-bit data is placed at the upper three bits, the second line of 3-bit data is stored at the rear three bits, and the third line of 3-bit data is stored at the lower three bits) to occupy the first address mode for storage, the second address stores the 9-bit data of the second channel of the first convolution kernel, after all the data of the first convolution kernel are stored in sequence, and then storing the data of the first channel of the second convolution kernel, storing the data of the second channel of the second convolution kernel at the next address, and sequentially storing all the convolution kernel data of the layer.

The storage sequence of the batch standardized parameters is as follows: and storing the standardized parameters corresponding to the first convolution kernel at a certain initial address, then storing the standardized parameters corresponding to the second convolution kernel, and sequentially storing all the standardized parameters of the layer.

under the k-th layer calculation of the binary convolution neural network, the global control module sends a first matrix generation starting signal as a first sliding calculation of a plurality of rows in the input characteristic diagram to the matrix generator module; if k is 1, the input feature map is an input binary gray level picture, and if k is greater than 1, the input feature map is an intermediate feature map of each layer;

after receiving the matrix generation start signal, the data input circuit respectively processes according to the type of the current sliding calculation, and the process is as shown in fig. 4:

if the type of the current sliding calculation is the upper edge calculation, reading data of a first row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in the data register group when a rising edge of a first period comes; when the rising edge of the second period comes, the data in the first register is stored in the second register, the data of the second row input characteristic diagram under the current sliding calculation is read from the neural network parameter storage module and stored in the first register in the data register group, and the data in the first register and the data in the second register are output by the data output circuit;

if the type of the current sliding calculation is non-upper and lower edge calculation, reading data of a first row of input feature maps under the current sliding calculation from a neural network parameter storage module and storing the data into a first register in a data register group when a rising edge of a first period comes; when the rising edge of the second period comes, the data in the first register is stored into the second register, and then the data of the second row of input characteristic diagram under the current sliding calculation is read out from the neural network parameter storage module and stored into the first register in the data register group; when the rising edge of the third period comes, storing the data in the second register into a third register, storing the data in the first register into the second register, reading out the data of the third row input characteristic diagram under the current sliding calculation from the neural network parameter storage module, storing the data into the first register in the data register group, and outputting the data in the first register, the second register and the third register by the data output circuit;

if the type of the current sliding calculation is the lower edge calculation, when the rising edge of the first period comes, the data of the input feature diagram in the last-but-third row stored in the third register under the current sliding calculation is shifted out, the data of the input feature diagram in the last-but-second row under the current sliding calculation in the second register is stored in the third register, the data of the input feature diagram in the last-but-first row under the current sliding calculation in the first register is stored in the second register, and the data in the second register and the data in the third register are output by the data output circuit;

the specific structure of the 6:2 compression tree is shown in fig. 5, the 3-bit weight parameters and the data of the feature map corresponding to the 3bit are divided into a group and input into a calculation unit, and the same or operation and the popcount calculation are completed to obtain the convolution result of the operation.

The xnor gate array receives input data and neural network training parameters under the current sliding calculation, and respectively processes according to the type of the current sliding calculation to obtain the xnor result, and the upper edge calculation and the lower edge calculation are respectively shown in fig. 6 and 7:

the adder accumulates the counting result and the batch standardization parameters, the sign bit of the obtained accumulation result is taken as an output characteristic diagram of the k-th layer binary convolution layer, and the whole calculation flow is shown in fig. 8;

the k-th pooling layer performs pooling on the output characteristic map of the k-th binarization convolution layer, and the obtained pooling result is used as an input characteristic map of the k + 1-th binarization convolution layer;

and if K is equal to K, taking the output result of the K-1 layer pooling layer as an input feature map of the fully-connected classified output layer, calculating the result at the K layer of the binarization convolutional neural network, and taking the accumulated result obtained by the adder and the sign bit as the recognition result of the hardware accelerator on the numbers in the binarization gray level picture.

In this embodiment, a data processing method of a hardware accelerator applied to a binary convolutional neural network is performed as follows:

step 2, combining 4 training parameters E [ A ], Var [ A ], β and gamma in the batch standardization layer into a parameter C by using a batch standardization transformation formula shown in formula (1):

in the formula (1), gamma is a standardized extended parameter, β is a standardized translation parameter, A is the number of a small batch during training, Var [ A ] is the average value of each small batch input in a batch standardization layer during one network training, E [ A ] is the variance value of each small batch input in a scalar standardization layer during one network training, epsilon is an offset parameter preset before network training, and Vec _ Len is the bit width of a single input value, wherein E [ A ], Var [ A ], gamma and β can be integrated into a variable C through the formula (1);

step 3, acquiring a gray level digital picture with the size of NxM and carrying out binarization processing to obtain an input binarization gray level picture; the input binary gray picture uses data '0' to represent pixel point '-1' in the binary gray picture, and uses data '1' to represent pixel point '1' in the binary gray picture;

step 4, training the input binary gray level picture by using a binary convolution neural network to obtain a neural network training parameter of each layer; the neural network training parameters include: weight parameters and batch standardization parameters;

step 5, constructing a hardware acceleration circuit comprises: the device comprises a neural network parameter storage module, a matrix generator module, a convolution calculation array module, a pooling layer module and a global control module; storing the input binary gray level picture and the neural network training parameters of each layer into a neural network parameter storage module;

step 7, under the k-th layer calculation of the binary convolution neural network, the global control module sends a first matrix generation starting signal as a first sliding calculation of a plurality of rows in the input characteristic diagram to the matrix generator module; if k is 1, the input feature map is an input binary gray level picture, and if k is greater than 1, the input feature map is an intermediate feature map of each layer;

if the type of the current sliding calculation is the upper edge calculation, reading data of a first row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in a self data register group when a rising edge of a first period comes; when the rising edge of the second period comes, the data in the first register is stored into the second register, and then the data of the second row of input characteristic diagram under the current sliding calculation is read out from the neural network parameter storage module and stored into the first register;

if the type of the current sliding calculation is non-edge calculation, reading data of a first row of input feature maps under the current sliding calculation from a neural network parameter storage module and storing the data into a first register when a rising edge of a first period comes; when the rising edge of the second period comes, the data in the first register is stored into the second register, and then the data of the second row of input characteristic diagram under the current sliding calculation is read out from the neural network parameter storage module and stored into the first register; when the rising edge of the third period comes, storing the data in the second register into the third register, storing the data in the first register into the second register, reading out the data of the third row input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into the first register;

step 9, under the k-th layer calculation of the binary convolution neural network, the convolution calculation array module judges and acquires data under the current sliding calculation output by the matrix generator module as input data, and reads the neural network training parameters under the current sliding calculation from the neural network parameter storage module, so that the processing is respectively carried out according to the type of the current sliding calculation to obtain the same or the same result:

step 10, counting '1' in the same or result to obtain a counting result;

step 11, simplifying the formula (2) into the formula (3) by using the formula (1), and accumulating the counting result and the batch standardized parameters by using the formula (3) to obtain an accumulated result Y ₂And after the sign bit is taken, an intermediate result is obtained:

in the formulae (2) and (3), X _0,1Is a normalized single input value, popcount (X) _0,1) Representing an input value X _0,1The number of "1" s in (A).

Step 12, taking the intermediate result as an output characteristic diagram of the k-th layer of the binary convolution layer, and performing pooling processing on the output characteristic diagram of the k-th layer of the binary convolution layer by utilizing the k-th layer of the pooling layer to obtain a k-th layer of the pooling result;

step 13, taking the K layer pooling result as an input feature map of a K +1 layer binary convolution layer, assigning K +1 to K, judging whether K > K is satisfied, and if so, indicating that the output result of the K layer pooling layer is the input feature map of the fully-connected classification output layer;

and step 14, processing the input characteristic diagram of the fully-connected classification output layer in steps 7-11 to obtain an intermediate result as a recognition result of the hardware accelerator on the numbers in the binary gray level image.

Claims

1. A hardware accelerator applied to a binary convolution neural network is characterized in that: the binary convolution neural network comprises: k layers of binarization convolution layers, K layers of activation function layers, K batch normalization layers, K layers of pooling layers and a layer of full-connection classification output layer; combining the number of the training parameters in the batch standardization layer into one;

2. A data processing method of a hardware accelerator applied to a binary convolution neural network is characterized by comprising the following steps:

step 10, counting '1' in the same or result to obtain a counting result;

step 11, adopting the formula (2) to carry out standardization on counting results and the batchThe numbers are accumulated to obtain an accumulated result Y ₂And after the sign bit is taken, an intermediate result is obtained;