CN110780923A - Hardware accelerator applied to binary convolution neural network and data processing method thereof - Google Patents

Hardware accelerator applied to binary convolution neural network and data processing method thereof Download PDF

Info

Publication number
CN110780923A
CN110780923A CN201911055498.0A CN201911055498A CN110780923A CN 110780923 A CN110780923 A CN 110780923A CN 201911055498 A CN201911055498 A CN 201911055498A CN 110780923 A CN110780923 A CN 110780923A
Authority
CN
China
Prior art keywords
register
data
calculation
bits
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911055498.0A
Other languages
Chinese (zh)
Other versions
CN110780923B (en
Inventor
杜高明
涂振兴
陈邦溢
杨振文
张多利
宋宇鲲
李桢旻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Hefei Polytechnic University
Original Assignee
Hefei Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Polytechnic University filed Critical Hefei Polytechnic University
Priority to CN201911055498.0A priority Critical patent/CN110780923B/en
Publication of CN110780923A publication Critical patent/CN110780923A/en
Application granted granted Critical
Publication of CN110780923B publication Critical patent/CN110780923B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a hardware accelerator applied to a binary convolution neural network and a data processing method thereof, wherein the hardware accelerator comprises: the device comprises a neural network parameter storage module, a matrix generator module, a convolution calculation array module, a pooling layer module and a global control module; a neural network parameter storage module is pre-stored with a binary gray level picture and a neural network training parameter of each layer; the matrix generator is responsible for preparing input data of convolution operation; the convolution calculation array is responsible for convolution calculation of the convolution layer; the pooling layer module is responsible for pooling the output of the convolutional layer; the global control module is responsible for controlling the normal work of the whole system. The invention aims to further improve the operation speed of the convolutional neural network, reduce the resources and the computing resources consumed by the network deployment on a hardware platform and simultaneously reduce the power consumption of the network operation.

Description

Hardware accelerator applied to binary convolution neural network and data processing method thereof
Technical Field
The invention belongs to the field of artificial intelligence hardware design, and particularly relates to an accelerator applied to a binary convolution neural network and a data processing method thereof.
Background
The convolutional neural network is derived from an artificial neural network. As a multilayer perception network, the method has strong adaptability to various image transformation forms such as rotation, scaling down or enlargement of images, translation of images and the like, and can quickly extract image features. The method adopts a weight sharing network structure, the structure has strong similarity with a biological neural network structure, the structure reduces the number of weights, thereby reducing the complexity of a network model, the advantages are more obvious when multidimensional images are input to the network, the images can be directly used as the input of the network, and the complicated characteristic extraction and data reconstruction processes in the traditional recognition algorithm are avoided.
Although the numbers are simple in strokes, the types are only ten. However, the digital strokes are simple and have relatively small differences, and the shapes of the handwritten numbers are different, so that the recognition difficulty is increased, and the precision is low. The recognition accuracy can be greatly improved by using the convolutional neural network, and the recognition rate can reach 99.33 percent at present. However, in some recognition processes, a fast recognition speed and a high accuracy are required.
Although the convolutional neural network has the advantages, a large number of convolution calculations are inevitably required, and the higher the image resolution is, the smaller the convolution step size is, and the more calculation resources are required. Meanwhile, the input pictures and the intermediate result characteristic diagrams of the traditional convolutional network in the training stage and the testing stage are represented by full-precision data, a large amount of DSP (digital signal processor) resources are needed for realizing multiplication and addition of the data, and the time spent in the training stage is unacceptable.
At the present stage, people deploy a convolutional neural network on an FPGA to realize acceleration of the network, and many researchers quantize the weight parameters and the data of the intermediate characteristic diagram into 1-bit data to greatly reduce the storage space occupied by the training parameters. However, in many technologies adopted by the methods, the feature graph input to the first layer still adopts full-precision data, a special computing unit needs to be designed for the first layer computing, and the methods are poor in universality, high in resource consumption and large in power consumption loss; the traditional convolution adopts the strategies of multiply-accumulate and popcount algorithm, so that the internal resources of the FPGA are not utilized to the maximum extent, and the calculation period is increased; meanwhile, the strategy of convolution calculation of edge complement +1 or-1 is adopted, edge data still needs to be calculated, the calculation amount is increased, and therefore hardware resource consumption is increased and the calculation period is increased.
Disclosure of Invention
The invention provides a hardware accelerator applied to a binary convolutional neural network and a data processing method thereof based on the prior art, and aims to improve the training speed of the convolutional neural network, reduce resources and computing resources consumed by network deployment on a hardware platform, and reduce the power consumption of network operation.
The technical scheme adopted by the invention to achieve the aim is as follows:
the invention relates to a hardware accelerator applied to a binary convolution neural network, which is characterized in that: the binary convolution neural network comprises: k layers of binarization convolution layers, K layers of activation function layers, K batch normalization layers, K layers of pooling layers and a layer of full-connection classification output layer; combining the number of the training parameters in the batch standardization layer into one;
the binary convolution neural network trains an input binary gray picture with the size of NxM to obtain a neural network training parameter of each layer;
the hardware acceleration circuit comprises: the device comprises a neural network parameter storage module, a matrix generator module, a convolution calculation array module, a pooling layer module and a global control module;
the neural network parameter storage module is pre-stored with the input binary gray level picture and the neural network training parameters of each layer; the neural network training parameters include: weight parameters and batch standardization parameters; the stored input binary gray level picture represents a pixel point "-1" in the binary gray level picture by data "0", and represents a pixel point "1" in the binary gray level picture by data "1";
the matrix generator module consists of a data input circuit, a data register group of a characteristic diagram and a data output circuit;
defining the current layer number in the binary convolution neural network as k, and initializing k to be 1;
under the k-th layer calculation of the binary convolution neural network, the global control module sends a first matrix generation starting signal as a first sliding calculation of a plurality of rows in the input characteristic diagram to the matrix generator module; if k is 1, the input feature map is an input binary gray level picture, and if k is more than 1, the input feature map is an intermediate feature map of each layer;
after the data input circuit receives the matrix generation starting signal, the data input circuit respectively processes according to the type of the current sliding calculation:
if the type of the current sliding calculation is the upper edge calculation, reading data of a first row of input feature maps under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in the data register group when a rising edge of a first period comes; when the rising edge of a second period comes, storing the data in the first register into a second register, reading out the data of a second row input characteristic diagram under the current sliding calculation from the neural network parameter storage module, storing the data into a first register in the data register group, and outputting the data in the first register and the second register by the data output circuit;
if the type of the current sliding calculation is non-edge calculation, reading data of a first row of input feature maps under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in the data register group when a rising edge of a first period comes; when the rising edge of a second period comes, storing the data in the first register into a second register, reading out the data of a second row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in the data register group; when the rising edge of a third period comes, storing the data in the second register into a third register, storing the data in the first register into the second register, reading out the data of a third row input characteristic diagram under the current sliding calculation from the neural network parameter storage module, storing the data into a first register in the data register group, and outputting the data in the first register, the second register and the third register by the data output circuit;
if the type of the current sliding calculation is the lower edge calculation, when a rising edge of a first period comes, data of a last-but-third-row input feature map under the current sliding calculation stored in the third register is shifted out, data of a last-but-second-row input feature map under the current sliding calculation stored in the second register is stored in the third register, data of a last-but-first-row input feature map under the current sliding calculation stored in the first register is stored in the second register, and data in the second register and data in the third register are output by the data output circuit;
the convolution calculation array module consists of a parameter reading module, a 6:2 compression tree module and an output interface module;
under the k-th layer calculation of the binary convolution neural network, the parameter reading module judges and acquires data under the current sliding calculation output by the data output circuit as input data, reads neural network training parameters under the current sliding calculation from the neural network parameter storage module and then sends the neural network training parameters to the 6:2 compression tree module;
the 6:2 compression tree module consists of an exclusive nor gate array, a popcount module and an adder;
the XNOR gate array receives input data and neural network training parameters under the current sliding calculation and respectively processes the input data and the neural network training parameters according to the type of the current sliding calculation to obtain the XNOR result:
if the type of the current sliding calculation is the upper edge calculation and the left edge calculation, taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or, and taking the bits [7:8] of the weight parameter and the bits [0:1] of the data given by the first register as the same or;
if the type of the current sliding calculation is upper edge calculation and is not left-right edge calculation, taking the same OR operation of the bits [3:5] of the weight parameter and the bits [ a: a +2] of the data of the feature diagram given by the second register, and taking the same OR operation of the bits [6:8] of the weight parameter and the bits [ a: a +2] of the data given by the first register, wherein a is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is the upper edge calculation and the right edge calculation, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by the first register;
if the type of the current sliding calculation is non-upper and lower edge calculation and left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:2] of the data given by a third register as an exclusive OR operation, taking the bits [4:5] of the weight parameter and the bits [0:2] of the data given by a second register as an exclusive OR operation, and taking the bits [7:8] of the weight parameter and the bits [0:2] of the data given by a first register as an exclusive OR operation;
if the type of the current sliding calculation is non-upper and lower edge calculation and non-left and right edge calculation, taking an exclusive OR operation on [0:2] bits of the weight parameter and [ c: c +2] bits of data given by a third register, taking an exclusive OR operation on [3:5] bits of the weight parameter and [ c: c +2] bits of data given by a second register, and taking an exclusive OR operation on [6:8] bits of the weight parameter and [ c: c +2] bits of data given by a first register, wherein c is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is non-upper and lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by a third register, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by a second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by a first register;
if the type of the current sliding calculation is the lower edge calculation and the left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:1] of the data given by the third register as the same or operation, and taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or operation;
if the type of the current sliding calculation is lower edge calculation and non-left and right edge calculation, performing exclusive OR operation on the [0:2] bit of the weight parameter and the [ b: b +2] bit of the data given by the third register, and performing exclusive OR operation on the [3:5] bit of the weight parameter and the [ b: b +2] bit of the data given by the second register, wherein b is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by the third register, and taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register;
the popcount module counts '1' in the same or result to obtain a counting result;
the adder accumulates the counting result and the batch standardization parameters, and the sign bit of the obtained accumulation result is used as an output characteristic diagram of the k-th layer binary convolution layer;
the k-th pooling layer performs pooling on the output characteristic map of the k-th binarization convolution layer, and an obtained pooling result is used as an input characteristic map of a k + 1-th binarization convolution layer;
and if K is equal to K, taking the output result of the K-1 layer pooling layer as the input feature map of the fully-connected classified output layer, calculating the result at the K layer of the binarization convolutional neural network, and taking the accumulated result obtained by the adder and the sign bit as the recognition result of the hardware accelerator on the numbers in the binarization gray level picture.
The invention relates to a data processing method of a hardware accelerator applied to a binary convolution neural network, which is characterized by comprising the following steps:
step 1, constructing a binary convolution neural network comprises the following steps: k layers of binarization convolution layers, K layers of activation function layers, K batch normalization layers, K layers of pooling layers and a layer of full-connection classification output layer;
step 2, combining four training parameters E [ A ], Var [ A ], β and gamma in the batch standardization layer into one parameter C by using a batch standardization transformation formula shown in formula (1):
Figure BDA0002256443860000051
in the formula (1), gamma is a standardized extended parameter, β is a standardized translation parameter, A is the number of a small batch during training, Var [ A ] is the average value of each small batch input in a batch standardized layer during one network training, E [ A ] is the variance value of each small batch input in a scalar standardized layer during one network training, epsilon is a preset offset parameter before network training, and the bit width of a single input value of Vec _ Len;
step 3, acquiring a gray level digital picture with the size of NxM and carrying out binarization processing to obtain an input binarization gray level picture; the input binary gray picture represents a pixel point "-1" in the binary gray picture by using data "0", and represents a pixel point "1" in the binary gray picture by using data "1";
step 4, training the input binary gray level picture by using the binary convolution neural network to obtain a neural network training parameter of each layer; the neural network training parameters include: weight parameters and batch standardization parameters;
step 5, constructing the hardware acceleration circuit comprises: the device comprises a neural network parameter storage module, a matrix generator module, a convolution calculation array module, a pooling layer module and a global control module; storing the input binary gray level picture and the neural network training parameters of each layer into the neural network parameter storage module;
step 6, defining the current layer number in the binary convolution neural network as k, and initializing k as 1;
step 7, under the k-th layer calculation of the binary convolution neural network, the global control module sends a first matrix generation starting signal as a first sliding calculation of a plurality of rows in the input characteristic diagram to the matrix generator module; if k is 1, the input feature map is an input binary gray level picture, and if k is more than 1, the input feature map is an intermediate feature map of each layer;
and 8, after receiving the matrix generation starting signal, the matrix generator module respectively processes according to the type of the current sliding calculation:
if the type of the current sliding calculation is the upper edge calculation, reading data of a first row of input feature maps under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in a self data register group when a rising edge of a first period comes; when the rising edge of a second period comes, storing the data in the first register into a second register, reading out the data of a second row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into the first register;
if the type of the current sliding calculation is non-edge calculation, reading data of a first row of input feature maps under the current sliding calculation from the neural network parameter storage module and storing the data into the first register when a rising edge of a first period comes; when the rising edge of a second period comes, storing the data in the first register into a second register, reading out the data of a second row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into the first register; when the rising edge of a third period comes, storing the data in the second register into a third register, storing the data in the first register into the second register, reading out the data of a third row input feature map under the current sliding calculation from the neural network parameter storage module, and storing the data into the first register;
if the type of the current sliding calculation is the lower edge calculation, when the rising edge of the first period comes, the data of the input feature diagram in the last-but-third row stored in the third register under the current sliding calculation is shifted out, the data of the input feature diagram in the last-but-second row under the current sliding calculation in the second register is stored in the third register, and the data of the input feature diagram in the last-but-first row under the current sliding calculation in the first register is stored in the second register;
step 9, under the k-th layer calculation of the binary convolution neural network, the convolution calculation array module judges and acquires data under the current sliding calculation output by the matrix generator module as input data, and reads the neural network training parameters under the current sliding calculation from the neural network parameter storage module, so as to respectively process according to the type of the current sliding calculation, and obtain the same or the same result:
if the type of the current sliding calculation is the upper edge calculation and the left edge calculation, taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or, and taking the bits [7:8] of the weight parameter and the bits [0:1] of the data given by the first register as the same or;
if the type of the current sliding calculation is upper edge calculation and is not left-right edge calculation, taking the same OR operation of the bits [3:5] of the weight parameter and the bits [ a: a +2] of the data of the feature diagram given by the second register, and taking the same OR operation of the bits [6:8] of the weight parameter and the bits [ a: a +2] of the data given by the first register, wherein a is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is the upper edge calculation and the right edge calculation, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by the first register;
if the type of the current sliding calculation is non-upper and lower edge calculation and left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:2] of the data given by a third register as an exclusive OR operation, taking the bits [4:5] of the weight parameter and the bits [0:2] of the data given by a second register as an exclusive OR operation, and taking the bits [7:8] of the weight parameter and the bits [0:2] of the data given by a first register as an exclusive OR operation;
if the type of the current sliding calculation is non-upper and lower edge calculation and non-left and right edge calculation, taking an exclusive OR operation on [0:2] bits of the weight parameter and [ c: c +2] bits of data given by a third register, taking an exclusive OR operation on [3:5] bits of the weight parameter and [ c: c +2] bits of data given by a second register, and taking an exclusive OR operation on [6:8] bits of the weight parameter and [ c: c +2] bits of data given by a first register, wherein c is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is non-upper and lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by a third register, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by a second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by a first register;
if the type of the current sliding calculation is the lower edge calculation and the left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:1] of the data given by the third register as the same or operation, and taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or operation;
if the type of the current sliding calculation is lower edge calculation and non-left and right edge calculation, performing exclusive OR operation on the [0:2] bit of the weight parameter and the [ b: b +2] bit of the data given by the third register, and performing exclusive OR operation on the [3:5] bit of the weight parameter and the [ b: b +2] bit of the data given by the second register, wherein b is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by the third register, and taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register;
step 10, counting '1' in the same or result to obtain a counting result;
step 11, accumulating the counting result and the batch standardized parameters by adopting the formula (2) to obtain an accumulated result Y 2And after the sign bit is taken, an intermediate result is obtained;
Figure BDA0002256443860000071
in the formula (2), X 0,1Is a normalized single input value, popcount (X) 0,1) Representing an input value X 0,1The number of middle "1";
an obtained accumulation result step 12, taking the intermediate result as an output characteristic diagram of the k-th layer binarization convolution layer, and performing pooling processing on the output characteristic diagram of the k-th layer binarization convolution layer by using a k-th layer pooling layer to obtain a k-th layer pooling result;
step 13, taking the K-th pooling result as an input feature map of a K + 1-th binary convolution layer, assigning K +1 to K, judging whether K > K is satisfied, and if so, indicating that the output result of the K-th pooling layer is the input feature map of the fully-connected classified output layer;
and step 14, processing the input characteristic diagram of the fully-connected classified output layer in steps 7 to 11 to obtain an intermediate result as a recognition result of the hardware accelerator on the numbers in the binary gray level picture.
Compared with the prior art, the beneficial technical effects of the invention are as follows:
1. the invention provides a method for carrying out binarization processing on a training and testing data set picture, unifies a hardware overall framework, enables floating point numbers not to exist in the network training and testing process, obtains results through simple logic operation of convolution operation, reduces storage space occupied by input data and resources consumed by convolution operation in the process of extracting features from an input picture, enables the network to be lighter and accelerates the operation speed of the network to a certain extent.
2. The invention further changes the existing batch standardization conversion formula, adds the characteristics of all binaryzation processing of the characteristic parameters and the weight parameters, and finally converts the result which can be obtained only through a complex batch standardization calculation formula into a standardization result which can be obtained through a simple hardware comparison circuit. In the conversion process, 4 training parameters which are standardized in batches are combined into one parameter, so that when the hardware stores the batch standardized parameters, the hardware does not need to store 4 training parameters and only needs to prestore one parameter, precious hardware storage resources are saved, and the hardware design is simplified.
3. The invention further provides a 6:2 compression tree technology with higher utilization rate of hardware resources on the basis of the existing 6:3 compression tree. By the proposed 6:2 compression technology and combining with the calculation of the edge skipping calculation array, the resources of the LUT can be fully utilized when the same or array carries out convolution operation. When calculating the non-edge array, 3 calculation units with 6 inputs and 2 outputs are needed in the process of extracting the features of a feature map by a convolution kernel, namely 3 LUTs are needed, for the characteristic extraction process of non-edge angles, because of the scheme design of skipping edge array data, only 6bit data parameter operation is calculated at one time, therefore, 2 calculation units with 6 inputs and 2 outputs are needed, namely 2 LUTs are needed, and for the feature extraction of the upper left corner, the upper right corner, the lower left corner and the lower right corner, the convolution calculation only has 4bit of input, thus, 2 6-in 2-out calculation units, also 2 LUTs, are also required, resulting in the convolution calculation for these four positions not being fully utilized with LUTs, however, the calculation resources for the fee are relatively small compared with the resources used in the whole process, and the resource waste is less than that of the existing 6:3 compression technology.
4. The invention solves the problems of increased data volume and occupation of LUT (look-up table) resources caused by completion of the conventional convolutional neural network deployed on the FPGA in a zero padding mode during edge calculation by adopting an independently designed edge unit calculation array. The embodiment adopts a unique edge unit design, so that each layer of feature diagram data stored in the DDR does not need to be stored in a zero padding mode, and occupied storage resources are reduced. The edge cell designs have 4 types, and are divided into upper, lower, left and right edge cell designs, and the specific edge calculation process is shown in fig. 6 and 7.
In summary, unlike the conventional BNN network input which is a full-precision gray-scale image, the present invention performs binarization on the input gray-scale image; based on the existing 6:3 compression tree technology, a 6:2 compression tree technology which saves resources more and calculates with high efficiency is designed, and edge skipping unit fast element design is combined, so that the calculated data volume is reduced, and the calculation speed is increased; the calculation process is optimized in the batch standardization calculation process, so that the result can be obtained through simple comparison, and operations such as multiplication, addition and the like are reduced.
Drawings
FIG. 1 is a flow chart of a convolutional neural network for handwritten word recognition as employed in the present invention;
FIG. 2 is a diagram of a hardware architecture of a binary convolutional neural network employed in the present invention;
FIG. 3 is a diagram of a parameter memory employed by the present invention;
FIG. 4 is a data flow diagram of a matrix generator employed by the present invention;
FIG. 5 is a 6:2 compression tree structure diagram employed by the present invention;
FIG. 6 is a data flow diagram designed using a top edge skipping unit in accordance with the present invention;
FIG. 7 is a data flow diagram designed using a lower edge skipping unit in accordance with the present invention;
FIG. 8 is a flow chart of the convolution of the present invention.
Detailed Description
In this embodiment, the binarizing convolutional neural network includes: k layers of binarization convolution layers, K layers of activation function layers, K batch normalization layers, K layers of pooling layers and a layer of full-connection classification output layer; combining the number of the training parameters in the batch standardization layer into one;
the convolutional neural network used in this embodiment is a handwritten number recognition network, and the structure diagram is shown in fig. 1, and the structure of the convolutional neural network includes an input layer, two convolutional layers, two pooling layers, and a full-link layer. The first layer of calculation is convolutional layer calculation, the input layer is 784 neural nodes, the size of a convolutional kernel is 3 multiplied by 3, the convolution step length is 1, 16 characteristic graphs of 28 multiplied by 28 are output, and 12544 neural nodes are total; the second layer is calculated for a pooling layer, the input layer is 16 characteristic maps of 28 × 28 output by the first convolutional layer, and the pooling kernel adopts a characteristic map of 2 × 2 in size and 2 in step size and outputs 16 characteristic maps of 14 × 14, so that 3136 neurons are total; the third layer is convolution layer calculation, the input layer is 16 14 × 14 feature maps output by the pooling layer, the size of a convolution kernel adopts 3 × 3, the convolution step size is 1, 32 14 × 14 feature maps are output, and 6272 neurons are output; the fourth layer is the calculation of the pooling layer, the input layer is 32 characteristic maps of 14 × 14 output by the second convolution layer, the pooling kernel adopts 2 × 2 in size and 2 in step length, 32 characteristic maps of 7 × 7 are output, and 1568 neurons are obtained in total; the fifth layer is a full connection layer calculation, the input layer is 32 7 multiplied by 7 characteristic graphs output by the second pooling layer, and 10 neurons are output; activation operations and batch normalization operations are performed between the convolutional layer and the pooling layer.
And the binarization convolutional neural network trains the input binarization gray level picture with the size of NxM. This example of implementation is primarily done using the MNIST (Mixed National Institute of Standards and technology database) handwriting library, which is a library of handwritten numbers trained by Karan research, Kyoho laboratories and New York university. The whole training library comprises a training library and a testing library, wherein the training library comprises 60000 handwritten digital images, and the testing library comprises 10000 images. The MNIST handwritten digital image is 28 x 28 in size, i.e. 784 in number of input layer nodes. Obtaining a neural network training parameter of each layer after training is finished;
a hardware accelerator for application to a binarized convolutional neural network comprising: a neural network parameter storage module, a matrix generator module, a convolution calculation array module, a pooling layer module and a global control module, wherein a specific hardware structure block diagram is shown in fig. 2;
the neural network parameter storage module is pre-stored with input binary gray level pictures and neural network training parameters of each layer; the neural network training parameters include: weight parameters and batch standardization parameters; the stored input binary gray level picture represents a pixel point '-1' in the binary gray level picture by data '0', and represents a pixel point '1' in the binary gray level picture by data '1'; the data of the stored characteristic map, the weight parameters and the batch standardized data mode are shown in FIG. 3.
The data storage sequence of the feature map is (nchw), that is, the data corresponding to each address in the DDR is a row of data of the feature map. Storing a first line of data of a first characteristic diagram at an initial address, storing a second line of data of the first characteristic diagram at a second address, sequentially storing all data of the first characteristic diagram, continuously storing the first line of data of a second characteristic diagram, and continuously storing all data of the characteristic diagram of the layer according to the rules;
the convolution kernel data is stored in the order of (nc (w × h)), since the convolution kernel size is fixed to 3 × 3, while the weight parameter is quantized to 1bit, therefore, the data size of each channel of one convolution kernel is only 9 bits, so that the data of the first channel of the first convolution kernel is spliced into 9-bit data according to three lines of data (the first line of 3-bit data is placed at the upper three bits, the second line of 3-bit data is stored at the rear three bits, and the third line of 3-bit data is stored at the lower three bits) to occupy the first address mode for storage, the second address stores the 9-bit data of the second channel of the first convolution kernel, after all the data of the first convolution kernel are stored in sequence, and then storing the data of the first channel of the second convolution kernel, storing the data of the second channel of the second convolution kernel at the next address, and sequentially storing all the convolution kernel data of the layer.
The storage sequence of the batch standardized parameters is as follows: and storing the standardized parameters corresponding to the first convolution kernel at a certain initial address, then storing the standardized parameters corresponding to the second convolution kernel, and sequentially storing all the standardized parameters of the layer.
The matrix generator module consists of a data input circuit, a data register group of a characteristic diagram and a data output circuit;
defining the current layer number in the binary convolution neural network as k, and initializing k to be 1;
under the k-th layer calculation of the binary convolution neural network, the global control module sends a first matrix generation starting signal as a first sliding calculation of a plurality of rows in the input characteristic diagram to the matrix generator module; if k is 1, the input feature map is an input binary gray level picture, and if k is greater than 1, the input feature map is an intermediate feature map of each layer;
after receiving the matrix generation start signal, the data input circuit respectively processes according to the type of the current sliding calculation, and the process is as shown in fig. 4:
if the type of the current sliding calculation is the upper edge calculation, reading data of a first row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in the data register group when a rising edge of a first period comes; when the rising edge of the second period comes, the data in the first register is stored in the second register, the data of the second row input characteristic diagram under the current sliding calculation is read from the neural network parameter storage module and stored in the first register in the data register group, and the data in the first register and the data in the second register are output by the data output circuit;
if the type of the current sliding calculation is non-upper and lower edge calculation, reading data of a first row of input feature maps under the current sliding calculation from a neural network parameter storage module and storing the data into a first register in a data register group when a rising edge of a first period comes; when the rising edge of the second period comes, the data in the first register is stored into the second register, and then the data of the second row of input characteristic diagram under the current sliding calculation is read out from the neural network parameter storage module and stored into the first register in the data register group; when the rising edge of the third period comes, storing the data in the second register into a third register, storing the data in the first register into the second register, reading out the data of the third row input characteristic diagram under the current sliding calculation from the neural network parameter storage module, storing the data into the first register in the data register group, and outputting the data in the first register, the second register and the third register by the data output circuit;
if the type of the current sliding calculation is the lower edge calculation, when the rising edge of the first period comes, the data of the input feature diagram in the last-but-third row stored in the third register under the current sliding calculation is shifted out, the data of the input feature diagram in the last-but-second row under the current sliding calculation in the second register is stored in the third register, the data of the input feature diagram in the last-but-first row under the current sliding calculation in the first register is stored in the second register, and the data in the second register and the data in the third register are output by the data output circuit;
the convolution calculation array module consists of a parameter reading module, a 6:2 compression tree module and an output interface module;
under the k-th layer calculation of the binary convolution neural network, the parameter reading module judges and acquires data under the current sliding calculation output by the data output circuit as input data, reads neural network training parameters under the current sliding calculation from the neural network parameter storage module and then sends the neural network training parameters to the 6:2 compression tree module;
the 6:2 compression tree module consists of an exclusive nor gate array, a popcount module and an adder;
the specific structure of the 6:2 compression tree is shown in fig. 5, the 3-bit weight parameters and the data of the feature map corresponding to the 3bit are divided into a group and input into a calculation unit, and the same or operation and the popcount calculation are completed to obtain the convolution result of the operation.
The xnor gate array receives input data and neural network training parameters under the current sliding calculation, and respectively processes according to the type of the current sliding calculation to obtain the xnor result, and the upper edge calculation and the lower edge calculation are respectively shown in fig. 6 and 7:
if the type of the current sliding calculation is the upper edge calculation and the left edge calculation, taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or, and taking the bits [7:8] of the weight parameter and the bits [0:1] of the data given by the first register as the same or;
if the type of the current sliding calculation is upper edge calculation and is not left-right edge calculation, taking the same OR operation of the bits [3:5] of the weight parameter and the bits [ a: a +2] of the data of the feature diagram given by the second register, and taking the same OR operation of the bits [6:8] of the weight parameter and the bits [ a: a +2] of the data given by the first register, wherein a is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is the upper edge calculation and the right edge calculation, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by the first register;
if the type of the current sliding calculation is non-upper and lower edge calculation and left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:2] of the data given by a third register as an exclusive OR operation, taking the bits [4:5] of the weight parameter and the bits [0:2] of the data given by a second register as an exclusive OR operation, and taking the bits [7:8] of the weight parameter and the bits [0:2] of the data given by a first register as an exclusive OR operation;
if the type of the current sliding calculation is non-upper and lower edge calculation and non-left and right edge calculation, taking an exclusive OR operation on [0:2] bits of the weight parameter and [ c: c +2] bits of data given by a third register, taking an exclusive OR operation on [3:5] bits of the weight parameter and [ c: c +2] bits of data given by a second register, and taking an exclusive OR operation on [6:8] bits of the weight parameter and [ c: c +2] bits of data given by a first register, wherein c is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is non-upper and lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by a third register, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by a second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by a first register;
if the type of the current sliding calculation is the lower edge calculation and the left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:1] of the data given by the third register as the same or operation, and taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or operation;
if the type of the current sliding calculation is lower edge calculation and non-left and right edge calculation, performing exclusive OR operation on the [0:2] bit of the weight parameter and the [ b: b +2] bit of the data given by the third register, and performing exclusive OR operation on the [3:5] bit of the weight parameter and the [ b: b +2] bit of the data given by the second register, wherein b is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by the third register, and taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register;
the popcount module counts '1' in the same or result to obtain a counting result;
the adder accumulates the counting result and the batch standardization parameters, the sign bit of the obtained accumulation result is taken as an output characteristic diagram of the k-th layer binary convolution layer, and the whole calculation flow is shown in fig. 8;
the k-th pooling layer performs pooling on the output characteristic map of the k-th binarization convolution layer, and the obtained pooling result is used as an input characteristic map of the k + 1-th binarization convolution layer;
and if K is equal to K, taking the output result of the K-1 layer pooling layer as an input feature map of the fully-connected classified output layer, calculating the result at the K layer of the binarization convolutional neural network, and taking the accumulated result obtained by the adder and the sign bit as the recognition result of the hardware accelerator on the numbers in the binarization gray level picture.
In this embodiment, a data processing method of a hardware accelerator applied to a binary convolutional neural network is performed as follows:
step 1, constructing a binary convolution neural network comprises the following steps: k layers of binarization convolution layers, K layers of activation function layers, K batch normalization layers, K layers of pooling layers and a layer of full-connection classification output layer;
step 2, combining 4 training parameters E [ A ], Var [ A ], β and gamma in the batch standardization layer into a parameter C by using a batch standardization transformation formula shown in formula (1):
Figure BDA0002256443860000131
in the formula (1), gamma is a standardized extended parameter, β is a standardized translation parameter, A is the number of a small batch during training, Var [ A ] is the average value of each small batch input in a batch standardization layer during one network training, E [ A ] is the variance value of each small batch input in a scalar standardization layer during one network training, epsilon is an offset parameter preset before network training, and Vec _ Len is the bit width of a single input value, wherein E [ A ], Var [ A ], gamma and β can be integrated into a variable C through the formula (1);
step 3, acquiring a gray level digital picture with the size of NxM and carrying out binarization processing to obtain an input binarization gray level picture; the input binary gray picture uses data '0' to represent pixel point '-1' in the binary gray picture, and uses data '1' to represent pixel point '1' in the binary gray picture;
step 4, training the input binary gray level picture by using a binary convolution neural network to obtain a neural network training parameter of each layer; the neural network training parameters include: weight parameters and batch standardization parameters;
step 5, constructing a hardware acceleration circuit comprises: the device comprises a neural network parameter storage module, a matrix generator module, a convolution calculation array module, a pooling layer module and a global control module; storing the input binary gray level picture and the neural network training parameters of each layer into a neural network parameter storage module;
step 6, defining the current layer number in the binary convolution neural network as k, and initializing k as 1;
step 7, under the k-th layer calculation of the binary convolution neural network, the global control module sends a first matrix generation starting signal as a first sliding calculation of a plurality of rows in the input characteristic diagram to the matrix generator module; if k is 1, the input feature map is an input binary gray level picture, and if k is greater than 1, the input feature map is an intermediate feature map of each layer;
and 8, after receiving the matrix generation starting signal, the matrix generator module respectively processes according to the type of the current sliding calculation:
if the type of the current sliding calculation is the upper edge calculation, reading data of a first row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in a self data register group when a rising edge of a first period comes; when the rising edge of the second period comes, the data in the first register is stored into the second register, and then the data of the second row of input characteristic diagram under the current sliding calculation is read out from the neural network parameter storage module and stored into the first register;
if the type of the current sliding calculation is non-edge calculation, reading data of a first row of input feature maps under the current sliding calculation from a neural network parameter storage module and storing the data into a first register when a rising edge of a first period comes; when the rising edge of the second period comes, the data in the first register is stored into the second register, and then the data of the second row of input characteristic diagram under the current sliding calculation is read out from the neural network parameter storage module and stored into the first register; when the rising edge of the third period comes, storing the data in the second register into the third register, storing the data in the first register into the second register, reading out the data of the third row input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into the first register;
if the type of the current sliding calculation is the lower edge calculation, when the rising edge of the first period comes, the data of the input feature diagram in the last-but-third row stored in the third register under the current sliding calculation is shifted out, the data of the input feature diagram in the last-but-second row under the current sliding calculation in the second register is stored in the third register, and the data of the input feature diagram in the last-but-first row under the current sliding calculation in the first register is stored in the second register;
step 9, under the k-th layer calculation of the binary convolution neural network, the convolution calculation array module judges and acquires data under the current sliding calculation output by the matrix generator module as input data, and reads the neural network training parameters under the current sliding calculation from the neural network parameter storage module, so that the processing is respectively carried out according to the type of the current sliding calculation to obtain the same or the same result:
if the type of the current sliding calculation is the upper edge calculation and the left edge calculation, taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or, and taking the bits [7:8] of the weight parameter and the bits [0:1] of the data given by the first register as the same or;
if the type of the current sliding calculation is upper edge calculation and is not left-right edge calculation, taking the same OR operation of the bits [3:5] of the weight parameter and the bits [ a: a +2] of the data of the feature diagram given by the second register, and taking the same OR operation of the bits [6:8] of the weight parameter and the bits [ a: a +2] of the data given by the first register, wherein a is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is the upper edge calculation and the right edge calculation, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by the first register;
if the type of the current sliding calculation is non-upper and lower edge calculation and left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:2] of the data given by a third register as an exclusive OR operation, taking the bits [4:5] of the weight parameter and the bits [0:2] of the data given by a second register as an exclusive OR operation, and taking the bits [7:8] of the weight parameter and the bits [0:2] of the data given by a first register as an exclusive OR operation;
if the type of the current sliding calculation is non-upper and lower edge calculation and non-left and right edge calculation, taking an exclusive OR operation on [0:2] bits of the weight parameter and [ c: c +2] bits of data given by a third register, taking an exclusive OR operation on [3:5] bits of the weight parameter and [ c: c +2] bits of data given by a second register, and taking an exclusive OR operation on [6:8] bits of the weight parameter and [ c: c +2] bits of data given by a first register, wherein c is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is non-upper and lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by a third register, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by a second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by a first register;
if the type of the current sliding calculation is the lower edge calculation and the left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:1] of the data given by the third register as the same or operation, and taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or operation;
if the type of the current sliding calculation is lower edge calculation and non-left and right edge calculation, performing exclusive OR operation on the [0:2] bit of the weight parameter and the [ b: b +2] bit of the data given by the third register, and performing exclusive OR operation on the [3:5] bit of the weight parameter and the [ b: b +2] bit of the data given by the second register, wherein b is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by the third register, and taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register;
step 10, counting '1' in the same or result to obtain a counting result;
step 11, simplifying the formula (2) into the formula (3) by using the formula (1), and accumulating the counting result and the batch standardized parameters by using the formula (3) to obtain an accumulated result Y 2And after the sign bit is taken, an intermediate result is obtained:
Figure BDA0002256443860000151
Figure BDA0002256443860000161
in the formulae (2) and (3), X 0,1Is a normalized single input value, popcount (X) 0,1) Representing an input value X 0,1The number of "1" s in (A).
Step 12, taking the intermediate result as an output characteristic diagram of the k-th layer of the binary convolution layer, and performing pooling processing on the output characteristic diagram of the k-th layer of the binary convolution layer by utilizing the k-th layer of the pooling layer to obtain a k-th layer of the pooling result;
step 13, taking the K layer pooling result as an input feature map of a K +1 layer binary convolution layer, assigning K +1 to K, judging whether K > K is satisfied, and if so, indicating that the output result of the K layer pooling layer is the input feature map of the fully-connected classification output layer;
and step 14, processing the input characteristic diagram of the fully-connected classification output layer in steps 7-11 to obtain an intermediate result as a recognition result of the hardware accelerator on the numbers in the binary gray level image.

Claims (2)

1. A hardware accelerator applied to a binary convolution neural network is characterized in that: the binary convolution neural network comprises: k layers of binarization convolution layers, K layers of activation function layers, K batch normalization layers, K layers of pooling layers and a layer of full-connection classification output layer; combining the number of the training parameters in the batch standardization layer into one;
the binary convolution neural network trains an input binary gray picture with the size of NxM to obtain a neural network training parameter of each layer;
the hardware acceleration circuit comprises: the device comprises a neural network parameter storage module, a matrix generator module, a convolution calculation array module, a pooling layer module and a global control module;
the neural network parameter storage module is pre-stored with the input binary gray level picture and the neural network training parameters of each layer; the neural network training parameters include: weight parameters and batch standardization parameters; the stored input binary gray level picture represents a pixel point "-1" in the binary gray level picture by data "0", and represents a pixel point "1" in the binary gray level picture by data "1";
the matrix generator module consists of a data input circuit, a data register group of a characteristic diagram and a data output circuit;
defining the current layer number in the binary convolution neural network as k, and initializing k to be 1;
under the k-th layer calculation of the binary convolution neural network, the global control module sends a first matrix generation starting signal as a first sliding calculation of a plurality of rows in the input characteristic diagram to the matrix generator module; if k is 1, the input feature map is an input binary gray level picture, and if k is more than 1, the input feature map is an intermediate feature map of each layer;
after the data input circuit receives the matrix generation starting signal, the data input circuit respectively processes according to the type of the current sliding calculation:
if the type of the current sliding calculation is the upper edge calculation, reading data of a first row of input feature maps under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in the data register group when a rising edge of a first period comes; when the rising edge of a second period comes, storing the data in the first register into a second register, reading out the data of a second row input characteristic diagram under the current sliding calculation from the neural network parameter storage module, storing the data into a first register in the data register group, and outputting the data in the first register and the second register by the data output circuit;
if the type of the current sliding calculation is non-edge calculation, reading data of a first row of input feature maps under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in the data register group when a rising edge of a first period comes; when the rising edge of a second period comes, storing the data in the first register into a second register, reading out the data of a second row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in the data register group; when the rising edge of a third period comes, storing the data in the second register into a third register, storing the data in the first register into the second register, reading out the data of a third row input characteristic diagram under the current sliding calculation from the neural network parameter storage module, storing the data into a first register in the data register group, and outputting the data in the first register, the second register and the third register by the data output circuit;
if the type of the current sliding calculation is the lower edge calculation, when a rising edge of a first period comes, data of a last-but-third-row input feature map under the current sliding calculation stored in the third register is shifted out, data of a last-but-second-row input feature map under the current sliding calculation stored in the second register is stored in the third register, data of a last-but-first-row input feature map under the current sliding calculation stored in the first register is stored in the second register, and data in the second register and data in the third register are output by the data output circuit;
the convolution calculation array module consists of a parameter reading module, a 6:2 compression tree module and an output interface module;
under the k-th layer calculation of the binary convolution neural network, the parameter reading module judges and acquires data under the current sliding calculation output by the data output circuit as input data, reads neural network training parameters under the current sliding calculation from the neural network parameter storage module and then sends the neural network training parameters to the 6:2 compression tree module;
the 6:2 compression tree module consists of an exclusive nor gate array, a popcount module and an adder;
the XNOR gate array receives input data and neural network training parameters under the current sliding calculation and respectively processes the input data and the neural network training parameters according to the type of the current sliding calculation to obtain the XNOR result:
if the type of the current sliding calculation is the upper edge calculation and the left edge calculation, taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or, and taking the bits [7:8] of the weight parameter and the bits [0:1] of the data given by the first register as the same or;
if the type of the current sliding calculation is upper edge calculation and is not left-right edge calculation, taking the same OR operation of the bits [3:5] of the weight parameter and the bits [ a: a +2] of the data of the feature diagram given by the second register, and taking the same OR operation of the bits [6:8] of the weight parameter and the bits [ a: a +2] of the data given by the first register, wherein a is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is the upper edge calculation and the right edge calculation, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by the first register;
if the type of the current sliding calculation is non-upper and lower edge calculation and left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:2] of the data given by a third register as an exclusive OR operation, taking the bits [4:5] of the weight parameter and the bits [0:2] of the data given by a second register as an exclusive OR operation, and taking the bits [7:8] of the weight parameter and the bits [0:2] of the data given by a first register as an exclusive OR operation;
if the type of the current sliding calculation is non-upper and lower edge calculation and non-left and right edge calculation, taking an exclusive OR operation on [0:2] bits of the weight parameter and [ c: c +2] bits of data given by a third register, taking an exclusive OR operation on [3:5] bits of the weight parameter and [ c: c +2] bits of data given by a second register, and taking an exclusive OR operation on [6:8] bits of the weight parameter and [ c: c +2] bits of data given by a first register, wherein c is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is non-upper and lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by a third register, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by a second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by a first register;
if the type of the current sliding calculation is the lower edge calculation and the left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:1] of the data given by the third register as the same or operation, and taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or operation;
if the type of the current sliding calculation is lower edge calculation and non-left and right edge calculation, performing exclusive OR operation on the [0:2] bit of the weight parameter and the [ b: b +2] bit of the data given by the third register, and performing exclusive OR operation on the [3:5] bit of the weight parameter and the [ b: b +2] bit of the data given by the second register, wherein b is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by the third register, and taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register;
the popcount module counts '1' in the same or result to obtain a counting result;
the adder accumulates the counting result and the batch standardization parameters, and the sign bit of the obtained accumulation result is used as an output characteristic diagram of the k-th layer binary convolution layer;
the k-th pooling layer performs pooling on the output characteristic map of the k-th binarization convolution layer, and an obtained pooling result is used as an input characteristic map of a k + 1-th binarization convolution layer;
and if K is equal to K, taking the output result of the K-1 layer pooling layer as the input feature map of the fully-connected classified output layer, calculating the result at the K layer of the binarization convolutional neural network, and taking the accumulated result obtained by the adder and the sign bit as the recognition result of the hardware accelerator on the numbers in the binarization gray level picture.
2. A data processing method of a hardware accelerator applied to a binary convolution neural network is characterized by comprising the following steps:
step 1, constructing a binary convolution neural network comprises the following steps: k layers of binarization convolution layers, K layers of activation function layers, K batch normalization layers, K layers of pooling layers and a layer of full-connection classification output layer;
step 2, combining four training parameters E [ A ], Var [ A ], β and gamma in the batch standardization layer into one parameter C by using a batch standardization transformation formula shown in formula (1):
Figure FDA0002256443850000031
in the formula (1), gamma is a standardized extended parameter, β is a standardized translation parameter, A is the number of a small batch during training, Var [ A ] is the average value of each small batch input in a batch standardized layer during one network training, E [ A ] is the variance value of each small batch input in a scalar standardized layer during one network training, epsilon is a preset offset parameter before network training, and the bit width of a single input value of Vec _ Len;
step 3, acquiring a gray level digital picture with the size of NxM and carrying out binarization processing to obtain an input binarization gray level picture; the input binary gray picture represents a pixel point "-1" in the binary gray picture by using data "0", and represents a pixel point "1" in the binary gray picture by using data "1";
step 4, training the input binary gray level picture by using the binary convolution neural network to obtain a neural network training parameter of each layer; the neural network training parameters include: weight parameters and batch standardization parameters;
step 5, constructing the hardware acceleration circuit comprises: the device comprises a neural network parameter storage module, a matrix generator module, a convolution calculation array module, a pooling layer module and a global control module; storing the input binary gray level picture and the neural network training parameters of each layer into the neural network parameter storage module;
step 6, defining the current layer number in the binary convolution neural network as k, and initializing k as 1;
step 7, under the k-th layer calculation of the binary convolution neural network, the global control module sends a first matrix generation starting signal as a first sliding calculation of a plurality of rows in the input characteristic diagram to the matrix generator module; if k is 1, the input feature map is an input binary gray level picture, and if k is more than 1, the input feature map is an intermediate feature map of each layer;
and 8, after receiving the matrix generation starting signal, the matrix generator module respectively processes according to the type of the current sliding calculation:
if the type of the current sliding calculation is the upper edge calculation, reading data of a first row of input feature maps under the current sliding calculation from the neural network parameter storage module and storing the data into a first register in a self data register group when a rising edge of a first period comes; when the rising edge of a second period comes, storing the data in the first register into a second register, reading out the data of a second row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into the first register;
if the type of the current sliding calculation is non-edge calculation, reading data of a first row of input feature maps under the current sliding calculation from the neural network parameter storage module and storing the data into the first register when a rising edge of a first period comes; when the rising edge of a second period comes, storing the data in the first register into a second register, reading out the data of a second row of input characteristic diagram under the current sliding calculation from the neural network parameter storage module and storing the data into the first register; when the rising edge of a third period comes, storing the data in the second register into a third register, storing the data in the first register into the second register, reading out the data of a third row input feature map under the current sliding calculation from the neural network parameter storage module, and storing the data into the first register;
if the type of the current sliding calculation is the lower edge calculation, when the rising edge of the first period comes, the data of the input feature diagram in the last-but-third row stored in the third register under the current sliding calculation is shifted out, the data of the input feature diagram in the last-but-second row under the current sliding calculation in the second register is stored in the third register, and the data of the input feature diagram in the last-but-first row under the current sliding calculation in the first register is stored in the second register;
step 9, under the k-th layer calculation of the binary convolution neural network, the convolution calculation array module judges and acquires data under the current sliding calculation output by the matrix generator module as input data, and reads the neural network training parameters under the current sliding calculation from the neural network parameter storage module, so as to respectively process according to the type of the current sliding calculation, and obtain the same or the same result:
if the type of the current sliding calculation is the upper edge calculation and the left edge calculation, taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or, and taking the bits [7:8] of the weight parameter and the bits [0:1] of the data given by the first register as the same or;
if the type of the current sliding calculation is upper edge calculation and is not left-right edge calculation, taking the same OR operation of the bits [3:5] of the weight parameter and the bits [ a: a +2] of the data of the feature diagram given by the second register, and taking the same OR operation of the bits [6:8] of the weight parameter and the bits [ a: a +2] of the data given by the first register, wherein a is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is the upper edge calculation and the right edge calculation, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by the first register;
if the type of the current sliding calculation is non-upper and lower edge calculation and left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:2] of the data given by a third register as an exclusive OR operation, taking the bits [4:5] of the weight parameter and the bits [0:2] of the data given by a second register as an exclusive OR operation, and taking the bits [7:8] of the weight parameter and the bits [0:2] of the data given by a first register as an exclusive OR operation;
if the type of the current sliding calculation is non-upper and lower edge calculation and non-left and right edge calculation, taking an exclusive OR operation on [0:2] bits of the weight parameter and [ c: c +2] bits of data given by a third register, taking an exclusive OR operation on [3:5] bits of the weight parameter and [ c: c +2] bits of data given by a second register, and taking an exclusive OR operation on [6:8] bits of the weight parameter and [ c: c +2] bits of data given by a first register, wherein c is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is non-upper and lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by a third register, taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by a second register, and taking the same OR operation on the [6:7] bit of the weight parameter and the [ M-1: M ] bit of the data given by a first register;
if the type of the current sliding calculation is the lower edge calculation and the left edge calculation, taking the bits [1:2] of the weight parameter and the bits [0:1] of the data given by the third register as the same or operation, and taking the bits [4:5] of the weight parameter and the bits [0:1] of the data given by the second register as the same or operation;
if the type of the current sliding calculation is lower edge calculation and non-left and right edge calculation, performing exclusive OR operation on the [0:2] bit of the weight parameter and the [ b: b +2] bit of the data given by the third register, and performing exclusive OR operation on the [3:5] bit of the weight parameter and the [ b: b +2] bit of the data given by the second register, wherein b is more than or equal to 0 and less than or equal to M;
if the type of the current sliding calculation is lower edge calculation and right edge calculation, taking the same OR operation on the [0:1] bit of the weight parameter and the [ M-1: M ] bit of the data given by the third register, and taking the same OR operation on the [3:4] bit of the weight parameter and the [ M-1: M ] bit of the data given by the second register;
step 10, counting '1' in the same or result to obtain a counting result;
step 11, adopting the formula (2) to carry out standardization on counting results and the batchThe numbers are accumulated to obtain an accumulated result Y 2And after the sign bit is taken, an intermediate result is obtained;
Figure FDA0002256443850000061
in the formula (2), X 0,1Is a normalized single input value, popcount (X) 0,1) Representing an input value X 0,1The number of middle "1";
an obtained accumulation result step 12, taking the intermediate result as an output characteristic diagram of the k-th layer binarization convolution layer, and performing pooling processing on the output characteristic diagram of the k-th layer binarization convolution layer by using a k-th layer pooling layer to obtain a k-th layer pooling result;
step 13, taking the K-th pooling result as an input feature map of a K + 1-th binary convolution layer, assigning K +1 to K, judging whether K > K is satisfied, and if so, indicating that the output result of the K-th pooling layer is the input feature map of the fully-connected classified output layer;
and step 14, processing the input characteristic diagram of the fully-connected classified output layer in steps 7 to 11 to obtain an intermediate result as a recognition result of the hardware accelerator on the numbers in the binary gray level picture.
CN201911055498.0A 2019-10-31 2019-10-31 Hardware accelerator applied to binary convolution neural network and data processing method thereof Active CN110780923B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911055498.0A CN110780923B (en) 2019-10-31 2019-10-31 Hardware accelerator applied to binary convolution neural network and data processing method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911055498.0A CN110780923B (en) 2019-10-31 2019-10-31 Hardware accelerator applied to binary convolution neural network and data processing method thereof

Publications (2)

Publication Number Publication Date
CN110780923A true CN110780923A (en) 2020-02-11
CN110780923B CN110780923B (en) 2021-09-14

Family

ID=69388303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911055498.0A Active CN110780923B (en) 2019-10-31 2019-10-31 Hardware accelerator applied to binary convolution neural network and data processing method thereof

Country Status (1)

Country Link
CN (1) CN110780923B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582451A (en) * 2020-05-08 2020-08-25 中国科学技术大学 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
CN111797977A (en) * 2020-07-03 2020-10-20 西安交通大学 Accelerator structure for binarization neural network and cyclic expansion method
CN111860819A (en) * 2020-07-27 2020-10-30 南京大学 Splicing and segmentable full-connection neural network reasoning accelerator and acceleration method thereof
CN112101537A (en) * 2020-09-17 2020-12-18 广东高云半导体科技股份有限公司 CNN accelerator and electronic device
CN112906886A (en) * 2021-02-08 2021-06-04 合肥工业大学 Result-multiplexing reconfigurable BNN hardware accelerator and image processing method
CN113435590A (en) * 2021-08-27 2021-09-24 之江实验室 Edge calculation-oriented searching method for heavy parameter neural network architecture
CN114781629A (en) * 2022-04-06 2022-07-22 合肥工业大学 Hardware accelerator of convolutional neural network based on parallel multiplexing and parallel multiplexing method
CN115660057A (en) * 2022-12-13 2023-01-31 至讯创新科技(无锡)有限公司 Control method for realizing convolution operation of NAND flash memory

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN106909970A (en) * 2017-01-12 2017-06-30 南京大学 A kind of two-value weight convolutional neural networks hardware accelerator computing module based on approximate calculation
US20180032844A1 (en) * 2015-03-20 2018-02-01 Intel Corporation Object recognition based on boosting binary convolutional neural network features
CN108197699A (en) * 2018-01-05 2018-06-22 中国人民解放军国防科技大学 Debugging module for convolutional neural network hardware accelerator
CN108765449A (en) * 2018-05-16 2018-11-06 南京信息工程大学 A kind of image background segmentation and recognition methods based on convolutional neural networks
CN109784488A (en) * 2019-01-15 2019-05-21 福州大学 A kind of construction method of the binaryzation convolutional neural networks suitable for embedded platform
CN109784489A (en) * 2019-01-16 2019-05-21 北京大学软件与微电子学院 Convolutional neural networks IP kernel based on FPGA
CN109948774A (en) * 2019-01-25 2019-06-28 中山大学 Neural network accelerator and its implementation based on network layer binding operation
CN110188795A (en) * 2019-04-24 2019-08-30 华为技术有限公司 Image classification method, data processing method and device
CN110263925A (en) * 2019-06-04 2019-09-20 电子科技大学 A kind of hardware-accelerated realization framework of the convolutional neural networks forward prediction based on FPGA
CN110263920A (en) * 2019-06-21 2019-09-20 北京石油化工学院 Convolutional neural networks model and its training method and device, method for inspecting and device

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180032844A1 (en) * 2015-03-20 2018-02-01 Intel Corporation Object recognition based on boosting binary convolutional neural network features
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN106909970A (en) * 2017-01-12 2017-06-30 南京大学 A kind of two-value weight convolutional neural networks hardware accelerator computing module based on approximate calculation
CN108197699A (en) * 2018-01-05 2018-06-22 中国人民解放军国防科技大学 Debugging module for convolutional neural network hardware accelerator
CN108765449A (en) * 2018-05-16 2018-11-06 南京信息工程大学 A kind of image background segmentation and recognition methods based on convolutional neural networks
CN109784488A (en) * 2019-01-15 2019-05-21 福州大学 A kind of construction method of the binaryzation convolutional neural networks suitable for embedded platform
CN109784489A (en) * 2019-01-16 2019-05-21 北京大学软件与微电子学院 Convolutional neural networks IP kernel based on FPGA
CN109948774A (en) * 2019-01-25 2019-06-28 中山大学 Neural network accelerator and its implementation based on network layer binding operation
CN110188795A (en) * 2019-04-24 2019-08-30 华为技术有限公司 Image classification method, data processing method and device
CN110263925A (en) * 2019-06-04 2019-09-20 电子科技大学 A kind of hardware-accelerated realization framework of the convolutional neural networks forward prediction based on FPGA
CN110263920A (en) * 2019-06-21 2019-09-20 北京石油化工学院 Convolutional neural networks model and its training method and device, method for inspecting and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
罗聪: "基于人脸检测YOLO算法的专用型卷积神经网络推理加速器的研究与设计", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582451B (en) * 2020-05-08 2022-09-06 中国科学技术大学 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
CN111582451A (en) * 2020-05-08 2020-08-25 中国科学技术大学 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
CN111797977A (en) * 2020-07-03 2020-10-20 西安交通大学 Accelerator structure for binarization neural network and cyclic expansion method
CN111860819A (en) * 2020-07-27 2020-10-30 南京大学 Splicing and segmentable full-connection neural network reasoning accelerator and acceleration method thereof
CN111860819B (en) * 2020-07-27 2023-11-07 南京大学 Spliced and sectionable full-connection neural network reasoning accelerator and acceleration method thereof
CN112101537A (en) * 2020-09-17 2020-12-18 广东高云半导体科技股份有限公司 CNN accelerator and electronic device
CN112101537B (en) * 2020-09-17 2021-08-03 广东高云半导体科技股份有限公司 CNN accelerator and electronic device
CN112906886A (en) * 2021-02-08 2021-06-04 合肥工业大学 Result-multiplexing reconfigurable BNN hardware accelerator and image processing method
CN113435590A (en) * 2021-08-27 2021-09-24 之江实验室 Edge calculation-oriented searching method for heavy parameter neural network architecture
CN113435590B (en) * 2021-08-27 2021-12-21 之江实验室 Edge calculation-oriented searching method for heavy parameter neural network architecture
CN114781629A (en) * 2022-04-06 2022-07-22 合肥工业大学 Hardware accelerator of convolutional neural network based on parallel multiplexing and parallel multiplexing method
CN114781629B (en) * 2022-04-06 2024-03-05 合肥工业大学 Hardware accelerator of convolutional neural network based on parallel multiplexing and parallel multiplexing method
CN115660057A (en) * 2022-12-13 2023-01-31 至讯创新科技(无锡)有限公司 Control method for realizing convolution operation of NAND flash memory

Also Published As

Publication number Publication date
CN110780923B (en) 2021-09-14

Similar Documents

Publication Publication Date Title
CN110780923B (en) Hardware accelerator applied to binary convolution neural network and data processing method thereof
US10096134B2 (en) Data compaction and memory bandwidth reduction for sparse neural networks
CN108364064B (en) Method, device and system for operating neural network
CN111860693A (en) Lightweight visual target detection method and system
WO2019060670A1 (en) Compression of sparse deep convolutional network weights
CN111242844B (en) Image processing method, device, server and storage medium
CN110991608B (en) Convolutional neural network quantitative calculation method and system
CN113344179B (en) IP core of binary convolution neural network algorithm based on FPGA
CN109886391B (en) Neural network compression method based on space forward and backward diagonal convolution
CN111814973B (en) Memory computing system suitable for neural ordinary differential equation network computing
CN112257844B (en) Convolutional neural network accelerator based on mixed precision configuration and implementation method thereof
CN111240746B (en) Floating point data inverse quantization and quantization method and equipment
CN109711442B (en) Unsupervised layer-by-layer generation confrontation feature representation learning method
CN113741858B (en) Memory multiply-add computing method, memory multiply-add computing device, chip and computing equipment
CN114973011A (en) High-resolution remote sensing image building extraction method based on deep learning
CN110910434B (en) Method for realizing deep learning parallax estimation algorithm based on FPGA (field programmable Gate array) high energy efficiency
CN113792621B (en) FPGA-based target detection accelerator design method
CN113888505A (en) Natural scene text detection method based on semantic segmentation
CN111914993B (en) Multi-scale deep convolutional neural network model construction method based on non-uniform grouping
WO2023115814A1 (en) Fpga hardware architecture, data processing method therefor and storage medium
CN115640833A (en) Accelerator and acceleration method for sparse convolutional neural network
Liu et al. Low computation and high efficiency Sobel edge detector for robot vision
CN113743593A (en) Neural network quantization method, system, storage medium and terminal
CN114494284A (en) Scene analysis model and method based on explicit supervision area relation
CN114419630A (en) Text recognition method based on neural network search in automatic machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant