CN110543939A - hardware acceleration implementation framework for convolutional neural network backward training based on FPGA - Google Patents

hardware acceleration implementation framework for convolutional neural network backward training based on FPGA Download PDF

Info

Publication number
CN110543939A
CN110543939A CN201910504155.1A CN201910504155A CN110543939A CN 110543939 A CN110543939 A CN 110543939A CN 201910504155 A CN201910504155 A CN 201910504155A CN 110543939 A CN110543939 A CN 110543939A
Authority
CN
China
Prior art keywords
module
error
input
weight
systolic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910504155.1A
Other languages
Chinese (zh)
Other versions
CN110543939B (en
Inventor
黄圳
何春
李玉柏
王坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910504155.1A priority Critical patent/CN110543939B/en
Publication of CN110543939A publication Critical patent/CN110543939A/en
Application granted granted Critical
Publication of CN110543939B publication Critical patent/CN110543939B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a hardware acceleration implementation framework for convolutional neural network backward training based on an FPGA. The architecture is based on a basic processing module of each layer of backward training of the convolutional neural network, the operation processing time and the resource consumption are comprehensively considered, and the backward training process of the Hcnn convolutional neural network is realized in a parallel pipeline form by using the methods of parallel-serial conversion, data fragmentation, pipeline design resource reuse and the like on the basis of the principle of achieving the highest possible parallelism and the lowest possible resource consumption. The framework makes full use of the characteristics of parallel data and parallel production lines of the FPGA, and has the advantages of simple realization, more regular structure, more consistent wiring, greatly improved frequency and obvious acceleration effect. More importantly, the structure balances IO read-write and calculation by using an optimized pulse array structure, improves the throughput rate under the condition of consuming less storage bandwidth, and effectively solves the problem of realizing the convolutional neural network FPGA with the data access and storage speed far higher than the data processing speed.

Description

Hardware acceleration implementation framework for convolutional neural network backward training based on FPGA
Technical Field
the invention relates to the field of deep learning, which is one of important development directions in artificial intelligence, in particular to a hardware acceleration realization framework of convolutional neural network backward training based on an FPGA.
Background
in recent years, the field of artificial intelligence, especially machine learning, has achieved breakthrough in both theory and application. Deep learning is one of the most important development directions of machine learning, and can learn characteristics with multi-level abstractions, so that the deep learning has excellent performance in solving the learning problem of complex abstractions. However, as the problem becomes more complicated and abstract, the model of the deep learning network becomes more complicated, and the learning time of the model also increases. For example, google's AlphaGo uses a multi-layer neural network architecture containing thousands of neurons, and even if the architecture is computed using a computer cluster containing approximately 20000 processors, the learning process to identify complex images consumes seven-eight days. Therefore, the excellent achievement of deep learning on the learning problem of complex abstraction is based on complex calculation and huge training data. The research of the deep learning acceleration algorithm with high speed and low power consumption gradually becomes a trend.
Compared with a CPU, a GPU and an ASIC, the FPGA has the advantages of high speed, low power consumption, stability, extremely low delay, suitability for streaming computation intensive tasks and communication intensive tasks, flexibility, short development period, low cost, convenience in carrying and the like in deep learning algorithm acceleration. Therefore, the FPGA is a good choice for deep learning acceleration, but the specific structure of the FPGA implementation of the deep learning algorithm is not researched much at present, the problems of insufficient storage bandwidth and the like exist, and the acceleration effect also has a great space for improvement.
The convolutional neural network algorithm is one of the most common and important deep learning algorithms, and achieves breakthrough achievement in common applications such as voice and image recognition. The backward training process is the most important part of convolutional neural network learning.
disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the existing convolutional neural network backward training and providing a backward training convolutional neural network Hcnn structure suitable for being realized on a small FPGA.
The technical scheme adopted by the invention for solving the technical problems is that the hardware acceleration implementation architecture of the convolutional neural network backward training based on the FPGA is composed of an error item transfer module and a parameter updating module, wherein the convolutional neural network backward training algorithm is a back propagation algorithm. The full connection layer is divided into an output layer and a hidden layer. The error transfer module for hidden layer training mainly comprises a multiplier-accumulator and an effective control module, and the parameter updating module for hidden layer training mainly comprises an acquisition module, a multiplier and an updating module. The pooling layer only has an error transfer module and no parameter updating module. The error transmission module of the pooling layer consists of a multiply accumulator, a parallel-serial conversion module, a quadruple up-sampling module and a control module. The pulse array structure is a main component of an error transfer module and a parameter updating module of convolutional layer training, and the main matrix convolution operation of the pulse array structure is realized by the pulse array structure.
The invention has the advantages that based on the basic processing module trained by each layer of the convolutional neural network, the operation processing time and the resource consumption are comprehensively considered, and the backward training process of the Hcnn convolutional neural network is realized in a parallel pipeline form by using the methods of parallel-serial conversion, data fragmentation, pipeline design resource reuse and the like on the basis of the principle of realizing the parallelism degree as large as possible and the resource consumption as small as possible. The structure makes full use of the characteristics of parallel data and parallel production lines of the FPGA, the realization is simple, the structure is more regular, the wiring is more consistent, the frequency is greatly improved, and the acceleration effect is obvious. More importantly, the structure balances IO read-write and calculation by using an optimized pulse array structure, improves the throughput rate under the condition of consuming less storage bandwidth, and effectively solves the problem of realizing the convolutional neural network FPGA with the data access and storage speed far higher than the data processing speed.
drawings
FIG. 1 is a diagram of a convolutional neural network Hcnn structure;
FIG. 2 is a general block diagram of the hardware architecture of the Hcnn backward prediction process;
FIG. 3 is a diagram of a fully-connected output layer parameter update hardware architecture;
FIG. 4 is a diagram of a parameter update hardware architecture for a fully-connected hidden layer;
FIG. 5 is a hardware architecture of convolutional layer error propagation basic arithmetic unit;
FIG. 6 is an internal structure diagram of systolic _ pe;
FIG. 7 is a hardware architecture of the convolution layer CONV2 weight updating process;
FIG. 8 is a hardware architecture of a convolution layer weight update basic operation unit;
FIG. 9 is a simulation diagram of the Modelsim final result of the backward training process;
FIG. 10 is a resource consumption graph in example one;
FIG. 11 is a comparison graph of the speed and power consumption performance analysis of the FPGA and the CPU and GPU.
Detailed Description
The present invention is described in further detail below with reference to specific implementation architectures. It should be understood that the scope of the above-described subject matter is not limited to the following examples, and any techniques implemented based on the disclosure of the present invention are within the scope of the present invention.
the specific structure of the convolutional neural network Hcnn is shown in fig. 1. Hcnn contains 2 convolutional layers, 2 pooling layers and 2 full-link layers, the sizes of convolutional cores are all 5 x 5, the pooling uses maximum pooling, the size is 2 x 2, and non-all-zero padding is adopted for padding. The original input feature data of Hcnn is the most commonly used MNIST handwritten digit data set of the convolutional neural network, the original image feature size of the MNIST handwritten digit data set is 1 × 28 × 28, the MNIST handwritten digit data set comprises 60000 training samples and 10000 testing samples, and classification marks are finally output after the MNIST handwritten digit data set passes through the convolutional layer 1, the pooling layer 1, the convolutional layer 2, the pooling layer 2, the full connection layer 1 and the full connection layer 2.
the back propagation algorithm is a convolutional neural network back training algorithm and comprises the following specific steps:
a. determining a loss function of the convolutional neural network as a cross entropy function:
Where Ed is the loss function, i.e. the error, yi and ti represent the predicted output value and the correct flag, respectively, of the output neuron i.
b. Calculating an output value aj of each neuron;
c. Calculating an error term delta j of each neuron;
d. the gradient of each neuron connection weight ω ji and the bias bj is calculated.
e. Using gradient descent algorithms
where ω' ji is the updated weight, ω ji is the weight before updating, and η is the step length, which is the gradient of ω ji.
Therefore, the weight and the bias item when the loss function reaches the minimum value are obtained, and the weight and the bias item are updated.
In summary, the backward training process of the convolutional neural network is mainly divided into an error transfer process and a parameter update process, and therefore the hardware acceleration implementation architecture of the backward training is also mainly composed of an error item transfer module and a parameter update module. The architecture takes an error transfer process as a main line, comprehensively considers operation processing time and resource consumption, controls parallelism by a fragmentation and parallel-serial conversion method, saves resources as much as possible by using methods such as resource multiplexing and the like, and realizes a backward training process of the Lenet convolutional neural network in a parallel pipeline form.
The general block diagram of the hardware architecture of the backward direction thereof, as shown in fig. 2, includes: the system comprises a full-connection output layer error transmission module, a full-connection output layer parameter updating module, a full-connection hidden layer error transmission module, a full-connection hidden layer parameter updating module, a pooling layer 2 error transmission module, a pooling layer 2 parameter updating module, a pooling layer 1 error transmission module and a pooling layer 1 parameter updating module;
the error transfer formula of the output layer is:
Where δ j is the jth value of the error term in this layer, netj is the weighted input of node j, tj is the correct label of node j, and yj is the calculation result obtained from the forward output of node j.
The parameter update formula is as follows:
Where η is the learning rate and xi is the value of the forward input.
Therefore, the hardware structure of the error transfer process of the fully-connected output layer only needs one adder. One input end of the adder receives a label obtained in the forward training process, the other output end of the adder receives a negative value of a loss function obtained in the last full-connection layer in the forward training process, and the output end of the adder is connected with the input ends of the full-connection output layer parameter updating module and the full-connection hidden layer error transfer module.
the hardware structure of the parameter updating process of the fully-connected output layer is shown in fig. 3, and the parameter updating module of the fully-connected output layer comprises a weight gradient calculating module, a weight updating module and a bias item updating module; error data are simultaneously input into a weight gradient calculation module and a bias item updating module, the weight gradient calculation module multiplies error data delta j by forward input characteristics xi to obtain the gradient of the weight and outputs the gradient of the weight to the weight updating module; the weight updating module moves the input to the left according to the learning rate, adds the weight omega ji read from the weight updating RAM to obtain a new weight after updating, and stores the new weight into the weight updating RAM; for the bias term update module, the gradient of the bias term is equal to the error term δ j and therefore is not multiplied by the forward input feature xi. And the offset item updating module shifts the input to the left according to the learning rate, adds the input to the original offset item bj read from the offset item updating RAM to obtain a new offset item which is updated and stores the new offset item in the offset item updating RAM. In the embodiment, the learning rate η is shifted to the left by 2 bits, because the learning rate η is 0.001, and fi (1,18,12), i.e. 12 bits decimal, is used in the fixed-point mode, so the learning rate is equal to about 4 in the FPGA implementation, and the learning rate is multiplied by 4, i.e. shifted to the left by 2 bits.
In the process of training the weight of the hidden layer of the full connection layer, a parameter updating formula is consistent with that of the output layer, but the error transfer formula is as follows:
Where k ∈ Downstream (j) represents all downstream nodes of node j.
the fully-connected hidden layer error transfer module comprises 12 multiply-accumulate units (N is consistent with the number of output neurons of the hidden layer), a weight RAM (random access memory) of the fully-connected layer 2, a control data RAM and a parallel-serial conversion unit; errors from an error transmission module of a full connection output layer are simultaneously input into 12 multiply-accumulate units, a weight RAM of a full connection layer 2 is connected with the 12 multiply-accumulate units, the 12 multiply-accumulate units respectively multiply and accumulate the input errors with corresponding weights to obtain 12 paths of data with the effective length of 1 and output the data to an effective control module in parallel, the control data RAM is connected with the effective control module, the effective control module obtains corresponding 12 paths of control data according to the control data RAM, when the control data netj is more than or equal to 0, the path of output is equal to the input, otherwise, the path of invalid output is 0; the 12 paths of output of the effective control module are that the error of the layer is respectively connected with the parallel-serial conversion unit and the fully-connected output layer parameter updating module; the parallel-serial conversion unit combines the 12 paths of errors into 1 path of errors and outputs the path of errors to the error transmission module of the pooling layer 2. And the error delta k of the previous layer simultaneously enters 12 multiply-accumulate devices, the multiply-accumulate devices respectively perform multiply-accumulate operation with the corresponding weight data omega kj to obtain 12 paths of data with the effective length of 1, and the error delta j of the layer is obtained through an effective control module.
the hardware structure for updating the parameters of the hidden layer of the full connection layer is shown in fig. 4, and comprises an up-sampling module, a multiplier group, a weight updating module and a bias item updating module, wherein 12 paths of error data delta j are input into a 64-time up-sampling module and a bias item updating module; the bias item updating module reads out a bias item bj from the bias item updating RAM, then adds error data which is shifted left according to the learning rate to obtain an updated bias item bj and stores the updated bias item bj in the bias item updating RAM; and the 12 paths of error data delta j are subjected to 64 times of upsampling by a 64 times upsampling module, each error keeps 64 clk, the upsampled error data are output to 12 parallel multipliers, the upsampled error data are respectively multiplied by the forward input characteristics xi of 12 groups of pooling layer 2 RAMs with the effective length of 64 to obtain 12 paths of gradient data with the effective length of 64 and output to a weight updating module, the weight updating module performs left shift on the input according to the learning rate, and the gradient data are added with the original weight omega ji read from the weight updating RAM to obtain a new weight which is updated and stored in the weight updating RAM.
In the training of the convolutional neural network pooling layer, only the error term needs to be transferred to the previous layer, and gradient calculation does not exist. For the maximum pooling mode, the value δ l of the error term of the next layer will be completely transferred to the error of the maximum value in the corresponding region in the feature map of the previous layer, while the values of the other error terms are all 0.
The pooling layer 2 error transfer module comprises K1 (the number of K1 is consistent with that of input neurons of a hidden layer, and K1 is 64), a multiply-accumulator, a parallel-serial conversion unit, a 4-time upsampling module, a control module, a weight RAM of a full connection layer 1 and a maximum effective matrix 1 RAM; inputting serial error data from a fully-connected hidden layer error transfer module into 64 multiply-accumulate devices, multiplying and accumulating the serial error data with 64 groups of upper layer weights omega ji of a weight RAM from a fully-connected layer 1 to obtain 64 paths of error data with the effective length of 1, outputting the error data to a serial-parallel conversion module, converting the input data into 4 paths of error data with the effective length of 16 by the serial-parallel conversion module, outputting the error data to a 4-time upper sampling module, collecting 4 times of error values to enable each error value to keep 4 clk so as to facilitate subsequent operation, and then entering a control module, wherein for reasonable utilization of resources, the control module aligns the maximum effective matrix (4 paths of data with the effective length of 16) output by the maximum effective matrix 1RAM with the collected error data, when the maximum effective matrix data is 1, the error data is kept unchanged, and when the maximum effective data is 0, the error data is 0; the control module outputs 4 paths of error data with the effective length of 64 to the convolutional layer 2 error transmission module and the convolutional layer 2 parameter updating module; since the output feature image size of the pooling layer 2 is 4 × 4 × 4 ═ 64, the serial error data δ j of the upper layer simultaneously enters 64 multiply-accumulators.
it should be noted that the maximum value valid matrix is obtained in the network model output process, and its size is the same as the input of the convolutional layer, when the input of the convolutional layer corresponding to the position is the maximum value in the 2 × 2 pooling matrix, that is, the output of the pooling process, the position is set to 1, and the other three positions are set to 0.
the convolutional layer 2 error Transfer module comprises M2 groups of M1 convolutional layer error Transfer basic operation units CONV _ Transfer _ PE and M2 adders in parallel and a weight RAM of the convolutional layer 2, wherein M1 error data from the convolutional layer 2 error Transfer module are input into M1 groups of convolutional layer error Transfer basic operation units, and the convolutional layer error Transfer basic operation units CONV _ Transfer _ PE are used for finishing the cross-correlation operation of the input error data and the weight of the weight RAM from the convolutional layer 2; the output of the mth convolutional layer error transfer basic operation unit of each group is connected with the input of the mth adder, and the output error matrixes of the M-1, 2 …, M2 and M2 adders are sent to the error transfer module of the pooling layer 1; the transfer process of the error item in the training of the convolutional layer is to make a circle of 0 around the error matrix of the layer I, and perform cross correlation operation with the filter after 180-degree turnover. Convolution corresponds to a cross-correlation operation that rotates the filter by 180 degrees. M1 corresponds to the number of signatures output from convolutional layer 2, and M2 corresponds to the number of signatures output from pooling layer 1. Therefore, the error transfer calculation formula of the convolution layer with the input layer depth D is:
Wherein, δ l-1 is the error term of layer l-1, δ dl is the error term of layer l depth d, Wdl layer l depth is the weight of d after transposition, f is the activation function, and netl-1 is the neuron weighting input of layer l-1.
As shown in fig. 5, a convolutional layer error propagation basic operation unit CONV _ Transfer _ PE has a convolutional kernel size of 5 × 5 and an input error matrix size L of 8. The method is realized by a shift register-based serial matrix conversion structure, and comprises 1 zero filling module (which realizes zero filling operation of an input error matrix by using a dual-port RAM capable of reading and writing simultaneously), 25 ripple arrays formed by ripple processing units Systolic _ pe to realize cross-correlation operation of X and W, L1-1 shift registers with the depth of matrix _ len being L +8, 1 adder and 1 effective control unit; in the Systolic array, L1 × L1 processing units PE are arranged in L1 rows and L1 columns, the processing units Systolic _ PE in each row are sequentially connected in series, and the first output end and the second output end of the processing unit Systolic _ PE in the 1 st to L1-1 st columns in each row are correspondingly connected with the first input end and the second output end of the next processing unit Systolic _ PE respectively; l1-1 shift registers are connected end to end; the output end of the previous shift register is connected with the second input end of the next shift register; the output end of the L-th shift register is also connected with the second input end of the first column processing unit Systolic _ pe of the L + 1-th row in the Systolic array, wherein L is 1, … and L1-1; the zero filling module outputs control parameters to the control ends of the L1-1 shift registers to control the depth of the shift registers according to the size of the received error matrix, and receives the control parameters and outputs the control parameters to the second input end of the Systolic _ pe of the processing unit in the 1 st row and the 1 st column and the input end of the 1 st shift register after the zero filling operation of the error matrix is completed; the first input terminal of each processing unit Systolic _ pe in column 1 is 0; the first output end of the processing unit Systolic _ pe in the last column of each row in the Systolic array is connected with the input end of the adder, and the adder sums L1 inputs and then outputs the sum to the Valid control unit Valid control; the effective control unit is used for eliminating invalid operation matrixes to finish the operation of the error transfer process of the convolutional layer.
The internal structure of the ripple processing unit Systolic _ pe is shown in fig. 6, and includes 1 adder, 1 multiplier, and 2 registers, where a first input terminal of the processing unit Systolic _ pe is connected to an input terminal of the first register and one input terminal of the multiplier, respectively, and another input terminal of the multiplier receives an input weight; one input end of the adder is connected with the output end of the multiplier, the other input end of the adder is connected with the second input end of the Systolic _ pe, and the output end of the adder is connected with the input end of the second register; the output of the first register is connected to a first output of the processing unit Systolic _ pe and the output of the second register is connected to a second output of the processing unit Systolic _ pe. It should be noted that the weight W fixed in the Systolic processing unit Systolic _ pe is the rearranged weight W after being turned over by 180 °.
The pooling layer 1 error transfer module comprises an up-sampling module, a control module and a maximum effective matrix 2 RAM; parallel M2 paths of error data from the parameter updating module of the convolutional layer 2 are output to 4 input ends of the control module after passing through the up-sampling module, the control module aligns the error data after being acquired according to a maximum effective matrix output by a maximum effective matrix 2RAM, when the maximum effective matrix data is 1, the error data is kept unchanged, and when the maximum effective matrix data is 0, the error data is 0; the control module outputs M2 error data to the convolutional layer 1 parameter updating module.
The gradient calculation process of the convolutional layer weight is to use an error matrix delta d, m, n as a convolution kernel, and perform cross-correlation operation on the forward input of the convolutional layer, and the calculation formula is as follows:
The gradient of the convolutional layer bias term is the sum of all error terms in the error matrix. The calculation formula is as follows:
The updating process of the bias item is simpler, and a hardware architecture of the weight value updating process is mainly introduced here. Taking the convolution layer CONV2 weight update as an example, the hardware structure is shown in fig. 7, and mainly comprises a basic operation unit CONV _ update _ PE for convolution layer weight update and a weight update module.
In the weight updating process of the convolutional layer CONV2, there are 4 paths of input feature data al-1, and cross-correlation operation needs to be performed with the 4 paths of error data δ dl to obtain 16 paths of weight updating data, so that 16 convolutional layer weight updating basic operation units are needed. Each convolution layer weight updating basic operation unit needs to consume 64 DSPs, if 16 convolution layer weight updating basic operation units are all realized in parallel, then 1024 DSPs are needed in the process. Considering that this method consumes excessive resources and the error propagation process and the upper layer parameter update process continue while updating the convolutional layer CONV2 parameter, it is only necessary to update the end convolutional layer CONV2 parameter before the end of the convolutional layer CONV1 parameter update. Therefore, by adopting a resource multiplexing method, 1 path of input characteristic data al-1 shares one basic operation unit, and is sequentially subjected to cross-correlation operation with 4 paths of error data delta dl. The method only uses 4 parallel basic operation units, consumes 256 DSPs, and reduces resource consumption while not increasing the total operation processing time.
the convolutional layer 2 parameter updating module comprises a convolutional weight updating unit and a bias item updating module, wherein the convolutional weight updating unit comprises M2 groups of parallel CONV _ update _ PE basic operation units and weight updating modules, wherein each group of M1 convolutional layer weight updating basic operation units comprises a convolutional layer weight updating basic operation unit CONV _ update _ PE basic operation unit and a weight updating module; the M1 paths of error matrixes and the forward input characteristics are input into a corresponding convolutional layer weight updating basic operation unit group, and the convolutional layer weight updating basic operation unit CONV _ update _ PE is used for realizing the matrix convolution operation of the forward input characteristics and the error matrixes.
FIG. 8 shows a convolution layer weight update basic operation unit CONV _ update _ PE. The method is realized through a shift register-based serial matrix conversion structure, because the size of an error matrix delta dl is 8 × 8, the CONV _ update _ PE comprises a Systolic array consisting of L2 × L2(L2 ═ 8) processing units Systolic _ PE, L2-1 ═ 7 shift registers complete a serial matrix conversion process, 1 adder and 1 effective control unit; in the Systolic array, 8 × 8 processing units Systolic _ pe are arranged in 8 rows and 8 columns, the processing units Systolic _ pe in each row are sequentially connected in series, and the first output end and the second output end of the Systolic processing unit Systolic _ pe in the 1 st to 7 th columns in each row are correspondingly connected with the first input end and the second output end of the next processing unit Systolic _ pe respectively; 7 shift registers are connected end to end; the output end of the previous shift register is connected with the second input end of the next shift register; the output of the l-th shift register is also connected to the second input of the first column of processing units Systolic _ pe in the l +1 th row of the Systolic array, l being 1, …, 7; the size of the error matrix is input into the control ends of 7 shift registers to control the depth of the shift registers, and the forward input characteristics are output to the second input end of the processing unit Systolic _ pe of the 1 st row and the 1 st column and the input end of the 1 st shift register; the first input terminal of each processing unit Systolic _ pe in column 1 is 0; the first output end of the processing unit Systolic _ pe in the last column of each row in the Systolic array is connected with the input end of the adder, and the adder sums 8 inputs and outputs the sum to the effective control unit; the effective control unit is used for eliminating invalid operation matrixes; and realizing the cross-correlation operation of al-1 and delta dl to obtain a gradient gradd, and then entering a weight value updating module to update the weight value.
the Systolic processing unit Systolic _ PE has the same configuration as that of the Systolic layer error Transfer basic operation unit CONV _ Transfer _ PE.
The hardware structure of the parameter updating process of the convolutional layer CONV1 is similar to that of CONV2, and is composed of a basic operation unit CONV _ update _ PE for updating convolutional layer weight and a weight updating module. However, since the layer error entry size is 24 × 24, 23 shift registers and 576 dither units Systolic _ PE are required for CONV _ update _ PE.
the present invention will be further described with reference to the following specific examples.
1. example one: FPGA simulation and implementation of convolutional neural network Hcnn backward training process
Example one simulation platform used is Matlab R2017b, ISE 14.7 and model sim 10.1a, and the implemented architecture is shown in fig. 2. Firstly, the Hcnn convolutional neural network Matlab fixed-point simulation of the FIG. 2 is verified in Matlab R2017b, and the accuracy of the model can reach 95.34%. Then, in ISE 14.7 and model sim 10.1a, simulation verification and implementation of the hardware architecture are performed. In matlab fixed point simulation and FPGA realization, the fixed point mode adopted by most parameters and intermediate register variables is fi (1,18,12), namely 1-bit sign bit, 5-bit integer and 12-bit decimal.
The Modelsim simulation results of the Hcnn convolutional neural network forward prediction process in example one are graphically shown in fig. 9. As can be seen from the figure, the data processing time is 821 clk (excluding the read input data time), resulting in the final output result. The result is consistent with the Maltab fixed point simulation result, and the model design function is proved to be correct.
In example one, the Hcnn convolutional neural network forward prediction process is integrated on an FPGA with the model of XC7VX690T-2FFG1157 to obtain the maximum clock frequency of the system of 204.50MHz, and the resource consumption graph is shown in fig. 10. As can be seen from FIG. 10, the module has less resource consumption, and mainly consumes 2920 DSPs 48E1s, because 2920 DSPs are called as multipliers.
2. Example two: speed and power consumption performance analysis of implementation model in example one
example two simulation platforms used were ISE 14.7 and PyCharm. First, it can be seen from fig. 9 that the model data processing time is 821 clk (excluding the read input data time), so at a clock frequency of 200M, the elapsed time is 4105 ns.
Then, on a Pycharm of a simulation platform, the CPU with the model number of Intel E3-1230V2@3.30GHz and the GPU with the model number of TitanX are respectively used for completing the calculation of the structure in the first example, and the calculation time of processing one sample by the CPU and the GPU is respectively 7330ns and 405 ns.
A comparison graph of the speed and power consumption performance of the architecture implementing example one in FPGA and CPU, GPU is shown in fig. 11. As can be seen from the figure, in the aspect of speed, the convolutional neural network FPGA has about three times of improvement of the architecture compared with a CPU; compared with the GPU, the method has a certain gap, which is limited by the resources of the FPGA chip, and the maximum parallelism is only 16. In terms of power consumption, FPGAs are much lower than CPUs and GPUs.

Claims (4)

1. an architecture is realized to hardware acceleration of convolutional neural network backward training based on FPGA, its characterized in that includes: the system comprises a full-connection output layer error transmission module, a full-connection output layer parameter updating module, a full-connection hidden layer error transmission module, a full-connection hidden layer parameter updating module, a pooling layer 2 error transmission module, a pooling layer 2 parameter updating module, a pooling layer 1 error transmission module and a pooling layer 1 parameter updating module;
The error transfer module of the full-connection output layer is 1 adder, one input end of the adder receives a label obtained in the forward training, the other output end of the adder receives a negative value of a loss function obtained by the last full-connection layer in the forward training, and the output end of the adder is connected with the input ends of the parameter updating module of the full-connection output layer and the error transfer module of the full-connection hidden layer;
The fully-connected hidden layer error transfer module comprises N multiply-accumulate devices, a weight RAM of a fully-connected layer 2, a control data RAM and a parallel-serial conversion unit; errors from an error transmission module of a full-connection output layer are simultaneously input into N multiply-accumulate units, a weight RAM of a full-connection layer 2 is connected with the N multiply-accumulate units, the N multiply-accumulate units respectively carry out multiply-accumulate operation on the input errors and corresponding weights to obtain N paths of data with the effective length of 1 and output the data to an effective control module in parallel, the control data RAM is connected with the effective control module, the effective control module obtains corresponding N paths of control data according to the control data RAM, when the control data are more than or equal to 0, the paths of output are equal to the input, and otherwise, the paths of ineffective output are set to 0; the N paths of output of the effective control module are that the error of the layer is respectively connected with the parallel-serial conversion unit and the fully-connected output layer parameter updating module; the parallel-serial conversion unit combines the N paths of errors into 1 path of errors and outputs the path of errors to the error transmission module of the pooling layer 2;
the pooling layer 2 error transfer module comprises K1 multiply-accumulate devices, a parallel-serial conversion unit, an up-sampling module, a control module, a weight RAM of a full connection layer 1 and a maximum effective matrix 1 RAM; serial error data from a fully-connected hidden layer error transfer module are input into K1 multiply-accumulate devices, multiplied and accumulated with K1 groups of weights from a weight RAM of a fully-connected layer 1 to obtain K1 paths of error data with the effective length of 1, the error data are output to a serial-parallel conversion module, the input data of the serial-parallel conversion module are converted into M1 paths of error data, the error data are output to an upper sampling module and then enter a control module, the control module aligns with the error data after being acquired according to a maximum effective matrix output by a maximum effective matrix 1RAM, when the maximum effective matrix data are 1, the error data are kept unchanged, and when the maximum effective data are 0, the error data are 0; the control module outputs M1 paths of error data with the effective length of M13 to the convolutional layer 2 error transmission module and the convolutional layer 2 parameter updating module;
the convolutional layer 2 parameter updating module comprises M2 groups of M1 convolutional layer error transfer basic operation units, M2 adders and a weight RAM of the convolutional layer 2 in parallel, M1 paths of error data from the convolutional layer 2 error transfer module are input into M1 groups of convolutional layer error transfer basic operation units, and the convolutional layer error transfer basic operation units are used for finishing the cross-correlation operation of the input error data and the weight of the weight RAM from the convolutional layer 2; the output of the mth convolutional layer error transfer basic operation unit of each group is connected with the input of the mth adder, and the output error matrixes of the M-1, 2 …, M2 and M2 adders are sent to the error transfer module of the pooling layer 1;
The convolutional layer error transfer basic operation unit is realized by a serial matrix conversion structure based on a shift register, and comprises 1 zero padding module, a Systolic array formed by L1 XL 1 processing units Systolic _ pe, L1-1 shift registers, 1 adder and 1 effective control unit; in the Systolic array, L1 × L1 processing units PE are arranged in L1 rows and L1 columns, the processing units Systolic _ PE in each row are sequentially connected in series, and the first output end and the second output end of the processing unit Systolic _ PE in the 1 st to L1-1 st columns in each row are correspondingly connected with the first input end and the second output end of the next processing unit Systolic _ PE respectively; l1-1 shift registers are connected end to end; the output end of the previous shift register is connected with the second input end of the next shift register; the output end of the L-th shift register is also connected with the second input end of the first column processing unit Systolic _ pe of the L + 1-th row in the Systolic array, wherein L is 1, … and L1-1; the zero filling module outputs control parameters to the control ends of the L1-1 shift registers to control the depth of the shift registers according to the size of the received error matrix, and receives the control parameters and outputs the control parameters to the second input end of the Systolic _ pe of the processing unit in the 1 st row and the 1 st column and the input end of the 1 st shift register after the zero filling operation of the error matrix is completed; the first input terminal of each processing unit Systolic _ pe in column 1 is 0; the first output end of the processing unit Systolic _ pe in the last column of each row in the Systolic array is connected with the input end of the adder, and the adder sums L1 inputs and outputs the sum to the effective control unit; the effective control unit is used for eliminating invalid operation matrixes;
the processing unit Systolic _ pe in the convolutional layer error transfer basic operation unit comprises 1 adder, 1 multiplier and 2 registers, wherein a first input end of the processing unit Systolic _ pe is respectively connected with an input end of the first register and one input end of the multiplier, and the other input end of the multiplier receives an input weight; one input end of the adder is connected with the output end of the multiplier, the other input end of the adder is connected with the second input end of the Systolic _ pe, and the output end of the adder is connected with the input end of the second register; the output end of the first register is connected with the first output end of the processing unit Systolic _ pe, and the output end of the second register is connected with the second output end of the processing unit Systolic _ pe;
the pooling layer 1 error transfer module comprises an up-sampling module, a control module and a maximum effective matrix 2 RAM; parallel M2 paths of error data from the parameter updating module of the convolutional layer 2 are output to 4 input ends of the control module after passing through the up-sampling module, the control module aligns the error data after being acquired according to a maximum effective matrix output by a maximum effective matrix 2RAM, when the maximum effective matrix data is 1, the error data is kept unchanged, and when the maximum effective matrix data is 0, the error data is 0; the control module outputs M2 error data to the convolutional layer 1 parameter updating module.
2. The architecture of claim 1, wherein the fully-connected output layer parameter update module comprises a weight gradient calculation module, a weight update module, and a bias term update module; error data is simultaneously input into a weight gradient calculation module and a bias item updating module, the weight gradient calculation module multiplies the error data by a forward input characteristic to obtain the gradient of a weight and outputs the gradient of the weight to the weight updating module; the weight updating module moves the input to the left according to the learning rate, adds the weight omega ji read from the weight updating RAM to obtain a new weight after updating, and stores the new weight into the weight updating RAM; and the offset item updating module shifts the input to the left according to the learning rate, adds the input to the original offset item bj read from the offset item updating RAM to obtain a new offset item which is updated and stores the new offset item in the offset item updating RAM.
3. The architecture for hardware-accelerated implementation of backward training of a convolutional neural network based on an FPGA of claim 1, wherein the fully-connected hidden layer parameter updating module comprises an upsampling module, a multiplier group, a weight updating module and a bias item updating module, and N paths of error data are input to the upsampling module and the bias item updating module; the bias item updating module reads out a bias item bj from the bias item updating RAM, then adds error data which is shifted left according to the learning rate to obtain an updated bias item bj and stores the updated bias item bj in the bias item updating RAM; n paths of error data are output to N parallel multipliers after passing through an up-sampling module, are multiplied by corresponding N groups of forward input characteristics of the RAM 2 with the pooling layer respectively to obtain N paths of gradient data and are output to a weight updating module, and the weight updating module moves the input to the left according to the learning rate, adds the input to the original weight omega ji read from the weight updating RAM to obtain a new weight and stores the new weight in the weight updating RAM.
4. the architecture of claim 1, wherein the convolutional layer 2 parameter updating module comprises a convolutional weight updating unit and a bias item updating module, wherein the convolutional weight updating unit comprises M2 groups of M1 convolutional layer weight updating basic operation units CONV _ update _ PE and a weight updating module in parallel; the M1 paths of error matrixes and the forward input characteristics are input into a corresponding convolutional layer weight updating basic operation unit group, and a convolutional layer weight updating basic operation unit CONV _ update _ PE is used for realizing the matrix convolution operation of the forward input characteristics and the error matrixes;
the convolutional layer weight updating basic operation unit CONV _ update _ PE is realized by a shift register-based serial matrix conversion structure, and comprises a pulse array consisting of L2 multiplied by L2 processing units Systolic _ PE, L2-1 shift registers, 1 adder and 1 effective control unit; l2 × L2 is the input error matrix size; in the Systolic array, L2 XL 2 processing units Systolic _ pe are arranged in L2 rows and L2 columns, the processing units Systolic _ pe in each row are sequentially connected in series, and the first output end and the second output end of the processing unit Systolic _ pe in the 1 st to L2-1 st columns in each row are correspondingly connected with the first input end and the second output end of the next processing unit Systolic _ pe respectively; l2-1 shift registers are connected end to end; the output end of the previous shift register is connected with the second input end of the next shift register; the output end of the L-th shift register is also connected with the second input end of the first column processing unit Systolic _ pe of the L + 1-th row in the Systolic array, wherein L is 1, … and L2-1; the size of the error matrix is input into the control end of the L2-1 shift registers to control the depth of the shift registers, and the forward input characteristics are output to the second input end of the Systolic _ pe processing unit of the 1 st row and the 1 st column and the input end of the 1 st shift register; the first input terminal of each processing unit Systolic _ pe in column 1 is 0; the first output end of the processing unit Systolic _ pe in the last column of each row in the Systolic array is connected with the input end of the adder, and the adder sums L2 inputs and outputs the sum to the effective control unit; the effective control unit is used for eliminating invalid operation matrixes;
The processing unit Systolic _ pe in the convolutional layer weight updating basic operation unit comprises 1 adder, 1 multiplier and 2 registers, wherein a first input end of the processing unit Systolic _ pe is respectively connected with an input end of the first register and one input end of the multiplier, and the other input end of the multiplier receives an input weight; one input end of the adder is connected with the output end of the multiplier, the other input end of the adder is connected with the second input end of the Systolic _ pe, and the output end of the adder is connected with the input end of the second register; the output of the first register is connected to a first output of the processing unit Systolic _ pe and the output of the second register is connected to a second output of the processing unit Systolic _ pe.
CN201910504155.1A 2019-06-12 2019-06-12 Hardware acceleration realization device for convolutional neural network backward training based on FPGA Active CN110543939B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910504155.1A CN110543939B (en) 2019-06-12 2019-06-12 Hardware acceleration realization device for convolutional neural network backward training based on FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910504155.1A CN110543939B (en) 2019-06-12 2019-06-12 Hardware acceleration realization device for convolutional neural network backward training based on FPGA

Publications (2)

Publication Number Publication Date
CN110543939A true CN110543939A (en) 2019-12-06
CN110543939B CN110543939B (en) 2022-05-03

Family

ID=68709592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910504155.1A Active CN110543939B (en) 2019-06-12 2019-06-12 Hardware acceleration realization device for convolutional neural network backward training based on FPGA

Country Status (1)

Country Link
CN (1) CN110543939B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461340A (en) * 2020-03-10 2020-07-28 北京百度网讯科技有限公司 Weight matrix updating method and device and electronic equipment
CN111680794A (en) * 2020-06-09 2020-09-18 北京环境特性研究所 Text generation device and method based on FPGA and electronic equipment
CN111832720A (en) * 2020-09-21 2020-10-27 电子科技大学 Configurable neural network reasoning and online learning fusion calculation circuit
CN112183668A (en) * 2020-11-03 2021-01-05 支付宝(杭州)信息技术有限公司 Method and device for training service models in parallel
CN112926733A (en) * 2021-03-10 2021-06-08 之江实验室 Special chip for voice keyword detection
CN113298237A (en) * 2021-06-23 2021-08-24 东南大学 Convolutional neural network on-chip training accelerator based on FPGA
CN113344179A (en) * 2021-05-31 2021-09-03 哈尔滨理工大学 IP core of binary convolution neural network algorithm based on FPGA
CN114615112A (en) * 2022-02-25 2022-06-10 中国人民解放军国防科技大学 FPGA-based channel equalizer, network interface and network equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6768735B1 (en) * 2000-03-31 2004-07-27 Alcatel Method and apparatus for controlling signaling links in a telecommunications system
CN102681815A (en) * 2012-05-11 2012-09-19 深圳市清友能源技术有限公司 Signed multiply-accumulate algorithm method using adder tree structure
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
CN206727986U (en) * 2017-04-19 2017-12-08 吉林大学 seismic data compression device based on FPGA
CN107656899A (en) * 2017-09-27 2018-02-02 深圳大学 A kind of mask convolution method and system based on FPGA
CN108090565A (en) * 2018-01-16 2018-05-29 电子科技大学 Accelerated method is trained in a kind of convolutional neural networks parallelization
CN108470190A (en) * 2018-03-09 2018-08-31 北京大学 The image-recognizing method of impulsive neural networks is customized based on FPGA
CN109784489A (en) * 2019-01-16 2019-05-21 北京大学软件与微电子学院 Convolutional neural networks IP kernel based on FPGA

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6768735B1 (en) * 2000-03-31 2004-07-27 Alcatel Method and apparatus for controlling signaling links in a telecommunications system
CN102681815A (en) * 2012-05-11 2012-09-19 深圳市清友能源技术有限公司 Signed multiply-accumulate algorithm method using adder tree structure
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
CN206727986U (en) * 2017-04-19 2017-12-08 吉林大学 seismic data compression device based on FPGA
CN107656899A (en) * 2017-09-27 2018-02-02 深圳大学 A kind of mask convolution method and system based on FPGA
CN108090565A (en) * 2018-01-16 2018-05-29 电子科技大学 Accelerated method is trained in a kind of convolutional neural networks parallelization
CN108470190A (en) * 2018-03-09 2018-08-31 北京大学 The image-recognizing method of impulsive neural networks is customized based on FPGA
CN109784489A (en) * 2019-01-16 2019-05-21 北京大学软件与微电子学院 Convolutional neural networks IP kernel based on FPGA

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
XUSHEN HAN等: "CNN-MERP: An FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks", 《2016 IEEE 34TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD)》 *
余奇: "基于 FPGA 的深度学习加速器设计与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
马玉奇: "基于NiosⅡ与硬件加速核的SoPC图像处理系统架构研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
魏小淞: "FPGA加速卷积神经网络训练的研究与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
黄圳: "深度学习算法的FPGA硬件加速研究与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461340A (en) * 2020-03-10 2020-07-28 北京百度网讯科技有限公司 Weight matrix updating method and device and electronic equipment
CN111461340B (en) * 2020-03-10 2023-03-31 北京百度网讯科技有限公司 Weight matrix updating method and device and electronic equipment
CN111680794A (en) * 2020-06-09 2020-09-18 北京环境特性研究所 Text generation device and method based on FPGA and electronic equipment
CN111832720A (en) * 2020-09-21 2020-10-27 电子科技大学 Configurable neural network reasoning and online learning fusion calculation circuit
CN112183668B (en) * 2020-11-03 2022-07-22 支付宝(杭州)信息技术有限公司 Method and device for training service models in parallel
CN112183668A (en) * 2020-11-03 2021-01-05 支付宝(杭州)信息技术有限公司 Method and device for training service models in parallel
CN112926733A (en) * 2021-03-10 2021-06-08 之江实验室 Special chip for voice keyword detection
CN113344179A (en) * 2021-05-31 2021-09-03 哈尔滨理工大学 IP core of binary convolution neural network algorithm based on FPGA
CN113344179B (en) * 2021-05-31 2022-06-14 哈尔滨理工大学 IP core of binary convolution neural network algorithm based on FPGA
CN113298237A (en) * 2021-06-23 2021-08-24 东南大学 Convolutional neural network on-chip training accelerator based on FPGA
CN113298237B (en) * 2021-06-23 2024-05-14 东南大学 Convolutional neural network on-chip training accelerator based on FPGA
CN114615112A (en) * 2022-02-25 2022-06-10 中国人民解放军国防科技大学 FPGA-based channel equalizer, network interface and network equipment
CN114615112B (en) * 2022-02-25 2023-09-01 中国人民解放军国防科技大学 Channel equalizer, network interface and network equipment based on FPGA

Also Published As

Publication number Publication date
CN110543939B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
CN110543939B (en) Hardware acceleration realization device for convolutional neural network backward training based on FPGA
CN110263925B (en) Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA
Ardakani et al. An architecture to accelerate convolution in deep neural networks
CN111459877B (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
Gupta et al. Deep learning with limited numerical precision
CN110321997B (en) High-parallelism computing platform, system and computing implementation method
CN107704916A (en) A kind of hardware accelerator and method that RNN neutral nets are realized based on FPGA
CN110705703B (en) Sparse neural network processor based on systolic array
Choi et al. An energy-efficient deep convolutional neural network training accelerator for in situ personalization on smart devices
CN110851779B (en) Systolic array architecture for sparse matrix operations
US20220350662A1 (en) Mixed-signal acceleration of deep neural networks
Xiao et al. FPGA implementation of CNN for handwritten digit recognition
Li et al. An efficient hardware architecture for activation function in deep learning processor
Sommer et al. Efficient hardware acceleration of sparsely active convolutional spiking neural networks
CN114519425A (en) Convolution neural network acceleration system with expandable scale
CN111582465A (en) Convolutional neural network acceleration processing system and method based on FPGA and terminal
CN111275167A (en) High-energy-efficiency pulse array framework for binary convolutional neural network
KR20200020117A (en) Deep learning apparatus for ANN with pipeline architecture
Wu et al. A 3.89-GOPS/mW scalable recurrent neural network processor with improved efficiency on memory and computation
CN110766136B (en) Compression method of sparse matrix and vector
CN113052299A (en) Neural network memory computing device based on lower communication bound and acceleration method
CN109978143B (en) Stack type self-encoder based on SIMD architecture and encoding method
CN112836793B (en) Floating point separable convolution calculation accelerating device, system and image processing method
Lu et al. SparseNN: A performance-efficient accelerator for large-scale sparse neural networks
Niknia et al. Nanoscale Accelerators for Artificial Neural Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant