CN106228240B

CN106228240B - Deep convolution neural network implementation method based on FPGA

Info

Publication number: CN106228240B
Application number: CN201610615714.2A
Authority: CN
Inventors: 王展雄; 周光朕; 冯瑞
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2016-07-30
Filing date: 2016-07-30
Publication date: 2020-09-01
Anticipated expiration: 2036-07-30
Also published as: CN106228240A

Abstract

The invention belongs to the technical field of digital image processing and pattern recognition. In particular to a deep convolution neural network implementation method based on FPGA. The hardware platform realized by the invention is a XilinxZYNQ-7030 programmable SoC on chip, and an FPGA and an ARM Cortex A9 processor are arranged in the hardware platform. Firstly, loading trained network model parameters to an FPGA end, then preprocessing input data at an ARM end, transmitting a result to the FPGA end, realizing convolution calculation and down-sampling of a deep convolution neural network at the FPGA end, forming data feature vectors and transmitting the data feature vectors to the ARM end to finish feature classification calculation. The invention realizes the convolution calculation part with the highest complexity in the deep convolution neural network model by utilizing the rapid parallel processing and the high-efficiency calculation characteristic of extremely low power consumption of the FPGA, greatly improves the algorithm efficiency and reduces the power consumption on the premise of ensuring the algorithm accuracy.

Description

Deep convolution neural network implementation method based on FPGA

Technical Field

The invention belongs to the technical field of digital image processing and pattern recognition, and particularly relates to a method for realizing a deep convolutional neural network model on an FPGA hardware platform.

Background

Under the condition of high-speed development of the current computer technology and the internet, the data scale is increased explosively, and intelligent analysis and processing of mass data become the key points for effectively utilizing the data value. The artificial intelligence technology is an effective means for finding valuable information from mass data, and breakthrough progress is made in the application fields of computer vision, speech recognition, natural language processing and the like in recent years. One representative example is a deep learning algorithm model based on a deep convolutional neural network.

Convolutional Neural Networks (CNN) were inspired by neuroscience research. After the evolution of more than 20 years, remarkable theoretical research and practical application achievements are obtained in the fields of pattern recognition, man-machine confrontation and the like, and in a famous man-machine go confrontation game, an artificial intelligent system AlphaGo based on a CNN + Monte Carlo search tree algorithm overcomes world go champion plum stone with the advantage of 4:1 score. A typical CNN algorithm model consists of two parts: a feature extractor and a classifier. The feature extractor is responsible for generating low-dimensional feature vectors of input data and has good robustness on the data. The vector is classified as input data of a classifier (usually based on a traditional artificial neural network), and a classification result of the input data is obtained.

In the implementation of the convolutional neural network algorithm model, the convolution calculation accounts for 90% of the calculation amount of the whole algorithm model^[1]Therefore, efficient calculation of the convolutional layer is the key for greatly improving the calculation efficiency of the CNN algorithm model, and realization of convolutional calculation through hardware acceleration is an effective way.

At present, a GPU cluster is generally used in the industry to realize a deep learning algorithm model, a deep neural network model is realized through large-scale parallel computing, and remarkable high-efficiency and high-performance results are obtained, however, the large-scale application of the GPU is restricted by the high power consumption of the GPU, and the deep learning algorithm model becomes a bottleneck of the practical popularization and application of the deep convolutional neural network algorithm model. The FPGA has the advantages of high-performance parallel computation and ultra-low power consumption, and the realization of a deep learning algorithm model on the FPGA is a necessary development direction in the field.

At present, there are three main schemes for implementing CNN by using FPGA:

(1) a soft core CPU is used for realizing a control part and is matched with an FPGA to realize algorithm acceleration;

(2) a hard core ARM Cortex A9 CPU embedded in a hard core SoC is used for realizing a control part and is matched with an FPGA (field programmable gate array) to realize algorithm acceleration;

(3) and the algorithm acceleration is realized by matching the cloud server with the FPGA.

The three schemes have advantages and disadvantages, and different acceleration schemes can be selected according to different application occasions.

In the deep convolutional neural network, convolutional layer calculation occupies more than 90% of calculated amount, and is a key link in the whole network model after the beginning, and the calculation efficiency directly influences the performance of the realization of the model algorithm. However, it is difficult to implement convolution calculation on FPGA, which mainly includes the following aspects:

(1) the deep learning algorithm model is basically still in the academic research stage at present, and large-scale industrial application also has a lot of algorithms and model optimization works, so the algorithm model needs to be continuously optimized to adapt to different application scenes, and deep learning theory and algorithm need to be deeply understood;

(2) the research and development of the FPGA are based on a bottom hardware language, the method is suitable for the condition that an algorithm model is relatively stable, and the continuously-changed deep learning algorithm model brings great difficulty for the realization of the algorithm model on the FPGA;

(3) implementing deep convolutional neural networks on FPGAs requires a great deal of experience in the engineering implementation of FPGAs. The running clock frequency of the FPGA and the output delay (Latency) of the module such as the multiplier used are contradictory to each other, and the higher the clock frequency is, the longer the output delay of the module is, and the lower the clock frequency is, the shorter the output delay of the module is. The parameters of the relative balance need to be found by manual experimentation with engineering experience.

Disclosure of Invention

The invention aims to provide a method for realizing a deep convolutional neural network model with high efficiency and low power consumption, so as to solve the problems of high power consumption and low efficiency of the current deep learning model based on a GPU or a CPU.

The invention optimizes the FPGA hardware design, effectively reduces the resource consumption and can realize the deep convolution neural network model on a low-end FPGA hardware platform.

The method for realizing the deep convolutional neural network model provided by the invention realizes that a hardware platform is a XilinxZYNQ-7030 programmable on-chip SoC, and an FPGA and an ARM Cortex A9 processor are arranged in the hardware platform. Firstly, loading trained network model parameters to an FPGA end, then preprocessing input data at an ARM end, transmitting a result to the FPGA end, realizing convolution calculation and down-sampling of a deep convolution neural network at the FPGA end, forming data feature vectors and transmitting the data feature vectors to the ARM end to finish feature classification calculation. The method specifically comprises the following 4 processes: model parameter loading process, input data preprocessing operation process, convolution and downsampling calculation process and classification calculation process:

1. the model parameter loading process comprises the following steps:

(1) training a deep convolutional neural network model offline;

(2) loading training model parameters at the ARM end;

(3) transmitting the model parameters to the FPGA;

2. the input data preprocessing operation process comprises the following steps:

(1) normalization processing;

(2) transmitting the processing result to the FPGA;

(3) storing the data to a Block RAM at an FPGA end;

3. the convolution and downsampling calculation process is as follows:

(1) initializing a convolution assembly line;

(2) performing convolution calculation;

(3) performing pooling downsampling calculation;

(4) reinitializing the convolution assembly line, and performing multilayer convolution downsampling calculation;

4. the classification calculation process comprises the following steps:

(1) transmitting the feature vector back to the ARM end;

(2) calculating through a classification model;

(3) and outputting a classification result.

The specific introduction is as follows:

step 1, loading training model parameters

(1) Loading parameters of a deep convolutional neural network model trained offline at an ARM end;

(2) transmitting the parameters of the training model to an FPGA end;

(3) the FPGA end is cached by FIFO and then stored in a Block RAM (random Access memory);

step 2, preprocessing a deep convolution neural network model

(1) Normalizing the input data to meet the requirement of model convolution operation;

(2) transmitting the ARM end normalized data to the FPGA end by using an APB bus;

(3) the FPGA end stores the normalized data into a Block RAM after FIFO cache;

step 3, convolution and down-sampling calculation

And designing a deep pipeline implementation mode aiming at the computation of the convolution layer with the maximum computation amount in the deep convolutional neural network model. The network model is provided with H convolutional layers and pooling layers. The H (H =1,2, …, H) th convolutional layer is input as T m × m floating point number (32-bit) matrices, output as S (m-n +1) × (m-n +1) floating point number (32-bit) matrices, convolution kernel is K n × n floating point number (32-bit) matrices (n is less than or equal to m), input data sliding window scale is n × n, horizontal sliding step is 1, and vertical sliding step is 1.

(1) Initializing a convolution operation pipeline

Defining n +1 data cache registers P₀，P₁，…，P_n-1，P_nEach register holds m data. Wherein n registers (P)_{（i-1）%(n+1)+0}，P_{（i-1）%(n+1)+1}，…，P_{（i-1）%(n+1)+n-1}) Storing the data of the ith (i =1,2, …, T) sub-matrix (n × m) of the T (T =1,2, …, m-n +1) input data matrix, wherein% represents the remainder if (i-1)）%(n+1)+x>n, then (i-1)% (n +1) + x =0, (i-1)% (n +1) + x +1=1, …, wherein x =0,1, …, n-1. If n is<m，P_{（i-1）%(n+1)+n}The register stores the (i + n) th row of data in the input data matrix, and parallel initialization is realized in the convolution calculation process, so that the idle period of the FPGA is reduced, and the calculation efficiency is improved.

Defining 1 convolution kernel matrix buffer register W, and storing the kth (K =1,2, …, K) n × n convolution kernel matrix weight data.

(2) H convolutional layer calculation

And completing convolution calculation of the t input data matrix and the k convolution kernel of the h convolution layer of the network, and activating the calculation result through a Sigmoid function.

Specifically, while performing convolution calculation each time, the i + n-th data buffer register P is initialized_{（i-1）%(n+1)+n}And the data is used as the buffer input data of the (i +1) th sub-matrix convolution calculation in the convolution to realize the circular convolution.

Constructing a Sigmoid function at an FPGA end through a Floating-point IP (IP) core to realize the activation of a convolution calculation result; the expression of the Sigmoid function is:

. The method comprises the following specific steps:

as described above, the input data is an m × m floating point matrix, the convolution kernel is an n × n floating point matrix, the sliding window scale is n × n, the horizontal sliding step is 1, and the vertical sliding step is 1, then the convolution result is an (m-n +1) x (m-n +1) floating point matrix, offset b11 (offline training model parameter) is added to each element of the matrix, and after activation by using a Sigmoid function, the result is an (m-n +1) x (m-n +1) floating point matrix, which is stored in the Block RAM.

And after 1 convolution calculation, re-initializing the convolution kernel matrix cache register W, performing the next convolution calculation and the reciprocating circular convolution calculation, wherein the calculation result is S (m-n +1) x (m-n +1) floating point number matrixes, and storing the floating point number matrixes into the Block RAM.

(3) H pooling level calculation

And realizing pooling calculation of the h convolution layer calculation result, wherein the result is S [ (m-n +1)/2] × [ (m-n +1)/2] floating point number matrixes, and storing the S floating point number matrixes into a Block RAM. The method comprises the following specific steps: and (3) setting the scale of a data sliding window of the convolution calculation result to be 2 multiplied by 2 and the step length to be 2, and realizing pooling by adopting an average down-sampling method, namely adding 2 multiplied by 2 floating-point number matrixes one by one, and averaging the calculation result to obtain S [ (m-n +1)/2] × [ (m-n +1)/2] floating-point number matrixes as an input matrix for the h +1 th convolution layer calculation.

Step 4, classified calculation

And transmitting the convolution calculation and pooling calculation results back to the ARM end for classification operation. The method comprises the following specific steps: the FPGA end transmits a convolution pooling calculation result matrix in the Block RAM to the ARM end through FIFO cache and an APB bus, and the ARM end completes data classification calculation by utilizing Softmax operation to obtain and output a classification result of input data.

The method of the invention has the following main characteristics:

(1) a deep convolutional neural network model is realized on a low-end FPGA;

(2) the convolution calculation in the deep convolution neural network model is accelerated by utilizing a pipeline calculation mode;

(3) the control chip is realized by adopting an Soc embedded ARM processor, has the characteristics of small volume, low power consumption and high efficiency, and can be widely applied to the field of embedded systems.

The invention realizes the convolution calculation part with the highest complexity in the deep convolution neural network model by utilizing the rapid parallel processing and the high-efficiency calculation characteristic of extremely low power consumption of the FPGA, and greatly improves the algorithm efficiency on the premise of ensuring the algorithm accuracy. Compared with the traditional method for realizing the deep convolutional neural network based on the CPU or the GPU, the method disclosed by the invention has the advantages that the algorithm calculation speed is effectively improved, meanwhile, the power consumption is greatly reduced, and the problems of long operation time or large power consumption caused by adopting the CPU or the GPU to realize the deep convolutional neural network are effectively solved.

Drawings

FIG. 1 is a flow diagram of an FPGA-based deep convolutional neural network implementation.

Fig. 2MNIST database (section).

Fig. 3 is a schematic diagram of matrix transposition.

FIG. 4 is a schematic diagram of a pipeline computation.

FIG. 5 is a schematic diagram of convolution calculations.

FIG. 6 is a diagram of a deep convolutional neural network architecture.

Fig. 7 is a schematic view of the downsampling calculation.

FIG. 8 shows simulation results of a deep convolutional neural network model based on FPGA.

Fig. 9 measured classification results (MNIST database) of number "7".

Detailed Description

The following explains the concrete implementation of the handwritten character recognition algorithm by utilizing a deep convolutional neural network model on an FPGA hardware platform by using the method of the invention in combination with the attached drawings. (the deep convolutional neural network model consists of an input layer I, a first convolutional layer C1, a first downsampled layer S1, a second convolutional layer C2, a second downsampled layer S2, and a full-link layer Softmax. the input picture size is 28 × 28, the first convolutional layer contains 1 convolutional kernel of size 5 × 5, and the second convolutional layer contains 3 convolutional kernels of size 5 × 5).

The specific operation steps implemented on the FPGA by using the handwritten character recognition algorithm of the deep convolutional neural network model are shown in fig. 1.

1. Loading trained model parameters

Firstly, referring to a CNN function in a deep Learn Toolbox-master, and carrying out certain modification (rewriting a convolution function, changing the number of layers of a neural network into 5 layers, one input layer, two convolution layers and two down-sampling layers, wherein the first convolution layer has 1 convolution kernel with the size of 5 multiplied by 5, the second convolution layer has 3 convolution kernels with the size of 5 multiplied by 5, the sliding step length of the two down-sampling layers is 2, the sliding window has 2 multiplied by 2 matrixes, and the training times are set as 10), training a deep convolution neural network by using Matlab, then loading trained weight parameters and offset parameters at an ARM end, finally transmitting the trained model parameters to an FPGA end, caching the model parameters through an FIFO, and storing the model parameters in a Block RAM.

2. Pretreatment of

The MNIST handwriting image shown in FIG. 2 is read into memory, normalized by dividing each pixel by 255, and transposed as shown in FIG. 3.

3. Transmitting the pre-processing result to the FPGA

And transmitting the preprocessing result to an FPGA end through an APB bus on ZYNQ-7030 Soc, and storing the preprocessing result in a Block RAM after FIFO cache.

4. Initializing a convolution operation pipeline

As shown in FIG. 4, 6 data cache registers P are defined₀，P₁，P₂，P₃，P₄，P₅Each register may hold 28 floating-point data. Of which 5 registers (P)_{（i-1）%(5+1)+0}，P_{（i-1）%(5+1)+1}，…，P_{（i-1）%(5+1)+5-1}) Data of the ith (i =1,2, …, 24) submatrix (5 × 28) of the input image matrix is stored, where% represents the remainder of the drawing if (i-1)% (5+1) + x>5, (i-1)% (5+1) + x =0, (i-1)% (5+1) + x +1=1, …, wherein x =0,1, …, 4. P_{（i-1）%(5+1)+5}The register stores the (i + 5) th line of data in the input image matrix.

Defining 1 convolution kernel matrix buffer register W, storing 1 convolution layer 1, 5 x 5 convolution kernel matrix weight data.

5. Performing the 1 st convolution layer calculation

And completing convolution calculation of the 1 st convolution layer input image matrix of the network and the 1 st convolution kernel of the 1 st convolution layer, and activating a calculation result through a Sigmoid function.

Initializing the (i + 5) th data buffer register P while performing convolution calculation_{（i-1）%(5+1)+5}And the circular convolution is realized as the buffer input data of the convolution calculation of the (i +1) th sub-matrix in the convolution, as shown in the figure 5.

And constructing a Sigmoid function at the FPGA end through a Floating-point IP (IP) core to realize the activation of the convolution calculation result. The Sigmoid function is expressed as:

。

the method comprises the following specific steps:

as described above, the input image is a 28 × 28 floating point matrix, the convolution kernel is a 5 × 5 floating point matrix, the sliding window scale is 5 × 5, the horizontal sliding step is 1, and the vertical sliding step is 1, so that the convolution result is a 24 × 24 floating point matrix, each element of the matrix is added with an offset b11 (offline training model parameter), and after activation by using a Sigmoid function, the result is a 24 × 24 floating point matrix, and the floating point matrix is stored in the Block RAM.

After 1 convolution calculation, the calculation result is 1 matrix of 24 × 24 floating point numbers, and the matrix is stored in the Block RAM.

6. Perform the 1 st pooling level calculation

The pooling calculation of the 1 st convolution layer calculation result is realized, as shown in fig. 6, the result is 1 12 × 12 floating-point number matrix, and is stored in the Block RAM. The method comprises the following specific steps: the scale of the convolution calculation result data sliding window is 2 × 2, the step size is 2, and the pooling is realized by adopting an average down-sampling method, that is, 2 × 2 floating-point number matrixes are added one by one, the calculation result is averaged to obtain 1 12 × 12 floating-point number matrix which is used as the input matrix of the 2 nd convolution layer calculation, as shown in fig. 7.

7. Reinitializing a convolution pipeline

As shown in FIG. 4, 6 data cache registers P are reinitialized₀，P₁，P₂，P₃，P₄，P₅Each register holds 12 floating-point data. Of which 5 registers (P)_{（i-1）%(5+1)+0}，P_{（i-1）%(5+1)+1}，…，P_{（i-1）%(5+1)+5-1}) Data of the ith (i =1,2, …, 8) submatrix (5 × 12) of the input matrix is stored, wherein% represents the remainder of the drawing, if (i-1)% (5+1) + x>5, (i-1)% (5+1) + x =0, (i-1)% (5+1) + x +1=1, …, wherein x =0,1, …, 4. P_{（i-1）%(5+1)+5}The register stores the (i + 5) th row of data in the input matrix.

And reinitializing the convolution kernel matrix cache register W to store the 1 st 5 x 5 convolution kernel matrix weight data of the 2 nd convolution layer.

8. Performing 2 nd convolution layer calculation

And completing convolution calculation of the 2 nd convolution layer input data matrix of the network and the 1 st convolution kernel of the 2 nd convolution layer, and activating a calculation result through a Sigmoid function.

And reinitializing the convolution kernel matrix cache register W, storing the 2 nd 5 multiplied by 5 convolution kernel matrix weight data of the 2 nd convolution layer, completing the convolution calculation of the 2 nd convolution layer input data matrix and the 2 nd convolution kernel of the 2 nd convolution layer of the network, and activating the calculation result through a Sigmoid function.

And reinitializing a convolution kernel matrix cache register W, storing the 3 rd 5 multiplied by 5 convolution kernel matrix weight data of the 2 nd convolution layer, completing the convolution calculation of the 2 nd convolution layer input data matrix and the 2 nd convolution layer 3 rd convolution kernel of the network, and activating the calculation result through a Sigmoid function.

Initializing the (i + 5) th data buffer register P while performing each convolution calculation_{（i-1）%(5+1)+5}And the circular convolution is realized as the buffer input data of the convolution calculation of the (i +1) th sub-matrix in the convolution, as shown in the figure 5.

The method comprises the following specific steps: as described above, the input image is a 12 × 12 floating point matrix, the convolution kernel is 3 5 × 5 floating point matrices, the sliding window scale is 5 × 5, the horizontal sliding step is 1, and the vertical sliding step is 1, the convolution result is 3 floating point matrices of 8 × 8, each element of the 3 matrices is added with offsets b21, b22, and b23 (offline training model parameters), and after activation by using a Sigmoid function, the result is 3 floating point matrices of 8 × 8, and the floating point matrices are stored in the Block RAM.

After 2 times of convolution calculation, the calculation result is 3 matrixes of 8 multiplied by 8 floating point numbers, and the matrixes are stored in a Block RAM.

9. Perform the 2 nd pooling level calculation

The pooling of the 2 nd convolutional layer calculation result is achieved, as shown in fig. 6, the result is 3 matrices of 4 × 4 floating-point numbers, and is stored in the Block RAM. The method comprises the following specific steps: the scale of the convolution calculation result data sliding window is 2 × 2, the step size is 2, the pooling is realized by adopting an average down-sampling method, that is, 2 × 2 floating point number matrixes are added one by one, the calculation result is averaged, and 3 4 × 4 floating point number matrixes are obtained and used as the input matrix of the Softmax layer, as shown in fig. 7.

10. Classification calculation

And transmitting the convolution calculation and pooling calculation results back to the ARM end for classification operation. The method comprises the following specific steps: the FPGA end transmits a convolution pooling calculation result matrix in the Block RAM to the ARM end through FIFO cache and APB bus, and the ARM end completes data classification calculation by utilizing Softmax operation to obtain and output a classification result of an input picture.

The simulation result of the method for processing the digital picture '7' in the MNIST database is shown in FIG. 8.

The measured classification results of the digital picture "7" in the MNIST database processed by the above method are shown in FIG. 9.

Reference to the literature

[1]Cong J, Xiao B. Minimizing Computation in Convolutional NeuralNetworks[M]// Artificial Neural Networks and Machine Learning – ICANN 2014.Springer International Publishing, 2014:33-7.

[2]Farabet C, Poulet C, Han J Y, et al. CNP: An FPGA-based processorfor Convolutional Networks[J]. International Conference on Field ProgrammableLogic&Applications, 2009:32-37.

[3]Gokhale V, Jin J, Dundar A, et al. A 240 G-ops/s MobileCoprocessor for Deep Neural Networks[C]// IEEE Embedded Vision Workshop.2014:696-701.

[4]Zhang C, Li P, Sun G, et al. Optimizing FPGA-based AcceleratorDesign for Deep Convolutional Neural Networks[C]// Acm/sigda InternationalSymposium. 2015:161-170.

[5]Krizhevsky A, Sutskever I, Hinton G E. ImageNet Classificationwith Deep Convolutional Neural Networks[J]. Advances in Neural InformationProcessing Systems, 2012, 25(2):2012.

[6]Farabet C, Martini B, Corda B, et al. NeuFlow: A runtimereconfigurable dataflow processor for vision[J]. 2011, 9(6):109-116.

[7]Matai J, Irturk A, Kastner R. Design and Implementation of anFPGA-Based Real-Time Face Recognition System[C]// IEEE, InternationalSymposium on Field-Programmable Custom Computing Machines. 2011:97-100.

[8]Sankaradas M, Jakkula V, Cadambi S, et al. A Massively ParallelCoprocessor for Convolutional Neural Networks[C]// IEEE InternationalConference on Application-Specific Systems, Architectures and Processors.IEEE Computer Society, 2009:53-60.。

Claims

1. A deep convolution neural network implementation method based on FPGA is characterized by comprising the following specific steps:

step 1, loading training model parameters

(2) transmitting the parameters of the training model to an FPGA end;

(3) the FPGA end is cached by FIFO and then stored in the block random access memory;

step 2, preprocessing a deep convolution neural network model

(3) the FPGA end stores the normalized data into a block random access memory after FIFO cache;

step 3, convolution and down-sampling calculation

Setting a network model to have H convolutional layers and H pooling layers, wherein the input of the H convolutional layer is a T m multiplied by m 32-bit floating point number matrix, and H is 1,2, … and H; the output is S (m-n +1) x (m-n +1) 32-bit floating point number matrixes, the convolution kernel is K n x n 32-bit floating point number matrixes, n is less than or equal to m, the input data sliding window scale is n x n, the transverse sliding step length is 1, and the longitudinal sliding step length is 1;

(1) initializing a convolution operation pipeline

Defining n +1 data cache registers P₀，P₁，…，P_n-1，P_nEach register storing m data, n registers P_{(i-1)％(n+1)+0}，P_{(i-1)％(n+1)+1}，…，P_{(i-1)％(n+1)+n-1}The ith sub-matrix n × m data of the tth input data matrix is stored, T is 1,2, …, TI ═ 1,2, …, m-n + 1; the remainder is expressed in% if (i-1)% (n +1) + x>n, then (i-1)% (n +1) + x ═ 0, (i-1)% (n +1) + x +1 ═ 1, …, where x ═ 0,1, …, n-1; if n is<m，P_{(i-1)％(n+1)+n}The register stores the (i + n) th row of data in the input data matrix, and parallel initialization is realized in the convolution calculation process, so that the idle period of the FPGA is reduced, and the calculation efficiency is improved;

defining 1 convolution kernel matrix cache register W, storing the kth n × n convolution kernel matrix weight data, wherein K is 1,2, … and K;

(2) h convolutional layer calculation

Completing convolution calculation of the t input data matrix and the k convolution kernel of the h convolution layer of the network, and activating a calculation result through a Sigmoid function;

initializing the data buffer register P while performing each convolution calculation_{(i-1)％(n+1)+n}The data is used as cache input data for the convolution calculation of the (i +1) th sub-matrix in the convolution to realize the cyclic convolution;

a Sigmoid function is constructed at the FPGA end through a floating point IP core, so that the activation of a convolution calculation result is realized, and the expression of the Sigmoid function is as follows:

the method comprises the following specific steps:

as described above, the input data is an m × m floating point number matrix, the convolution kernel is an n × n floating point number matrix, the scale of the sliding window is n × n, the transverse sliding step is 1, and the longitudinal sliding step is 1, so that the convolution result is an (m-n +1) x (m-n +1) floating point number matrix, each element of the matrix is added with an offset b11, that is, an offline training model parameter, and after being activated by using a Sigmoid function, the result is an (m-n +1) x (m-n +1) floating point number matrix, and the floating point number matrix is stored in a Block RAM;

after 1 convolution calculation, reinitializing a convolution kernel matrix cache register W, performing the next convolution calculation and reciprocating circular convolution calculation, wherein the calculation result is S (m-n +1) x (m-n +1) floating point number matrixes, and storing the floating point number matrixes into a Block RAM;

(3) h pooling level calculation

Realizing pooling calculation of the h convolution layer calculation result, wherein the result is S [ (m-n +1)/2] × [ (m-n +1)/2] floating point number matrixes, and storing the S floating point number matrixes into a Block RAM; the method comprises the following specific steps: setting the scale of a convolution calculation result data sliding window to be 2 multiplied by 2 and the step length to be 2, realizing pooling by adopting an average down-sampling method, namely adding 2 multiplied by 2 floating point number matrixes one by one, and taking the average value of the calculation result to obtain S [ (m-n +1)/2] x [ (m-n +1)/2] floating point number matrixes as an input matrix for h +1 th convolution layer calculation;

step 4, classified calculation

Transmitting the convolution calculation and pooling calculation results back to the ARM end for classification operation; the method comprises the following specific steps: the FPGA end transmits a convolution pooling calculation result matrix in the Block RAM to the ARM end through FIFO cache and an APB bus, and the ARM end completes data classification calculation by utilizing Softmax operation to obtain and output a classification result of input data.