CN108154229B

CN108154229B - Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework

Info

Publication number: CN108154229B
Application number: CN201810022870.7A
Authority: CN
Inventors: 王坚灿; 董刚; 杨银堂
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-01-10
Filing date: 2018-01-10
Publication date: 2022-04-08
Anticipated expiration: 2038-01-10
Also published as: CN108154229A

Abstract

The invention discloses an image processing method based on an FPGA (field programmable gate array) accelerated convolutional neural network framework, which mainly solves the problems of low resource utilization rate and low speed in the prior art. The scheme is as follows: 1) calculating a picture segmentation fixed value according to the designed picture parameter and the FPGA resource parameter; 2) determining the number of DDR3 according to the picture fixed value, and allocating block ram resources; 3) constructing a convolutional neural network framework according to 1) and 2), wherein the framework comprises a picture storage module, a picture data distribution module, a convolution module, a pooling module, a picture return DDR3 module and an instruction register set; 4) and all modules acquire control instructions from the instruction register group through handshake signals, are matched with each other, and process the image data according to the control instructions. The invention improves the resource utilization rate and the acceleration effect through the convolutional neural network framework accelerated by the FPGA, and can be used for image classification, target recognition, voice recognition and natural language processing.

Description

Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework

Technical Field

The invention belongs to the technical field of computer design, and particularly relates to a convolutional neural network implementation method which can be used for image classification, target recognition, voice recognition and natural language processing.

Background

With the progress of integrated circuit design and manufacturing process, the field programmable gate array with high-speed and high-density programmable logic resources has been rapidly developed, and the integration level of a single chip is higher and higher. In order to further improve the performance of the FPGA, mainstream chip manufacturers integrate a DSP customized computing unit of a digital signal processing chip with high-speed digital signal processing capability inside the chip, and a DSP hard core is a component capable of realizing fixed-point operation with high efficiency and low cost, so that the FPGA is widely used in the application fields of video and image processing, network communication and information security, bioinformatics, and the like.

The convolutional neural network CNN is a structure of an artificial neural network, is widely applied to the fields of image classification, target recognition, voice recognition, natural language processing and the like, and in recent years, along with the great improvement of computer capability and the development of a neural network structure, the performance and accuracy of the CNN are greatly improved, but the requirement on the parallel computing capability of an operation unit is higher and higher, so that a GPU (graphics processing unit) with the parallel computing capability and an FPGA (field programmable gate array) become a mainstream direction.

The configurable computing architecture based on FPGA can exploit the parallelism of the artificial neural network and change the weight and topology of the convolutional neural network through configuration. The artificial neural network realized by the FPGA has the flexibility of software design and is close to an Application Specific Integrated Circuit (ASIC) in the aspect of calculation performance, and meanwhile, the high-efficiency interconnection can be realized by utilizing on-chip programmable connecting line resources, so the FPGA is an important choice for realizing the artificial neural network by hardware.

In the current patents and research directions, the OpenCL programming language is basically used as a construction core, and the purpose is to reduce the implementation time of converting the convolutional neural network algorithm into the hardware description language, but the acceleration of the hardware description language code in the FPGA algorithm is not involved, and meanwhile, the OpenCL programming language is not a language actually running on the FPGA, so that the actual running speed of the FPGA is not ideal. In the prior art implemented based on OpenCL programming, the acceleration of a DSP module in an FPGA is mainly focused, a convolutional neural network algorithm is not integrally implemented and a bottom hardware description language is not optimized, and the computation resources of the FPGA cannot be fully utilized, so that the computation time of the FPGA is increased, and the acceleration effect is not obvious.

Disclosure of Invention

The invention aims to provide a convolutional neural network implementation method based on FPGA acceleration, which is used for integrally implementing the convolutional neural network through a hardware description language, optimizing a bottom hardware description language, fully utilizing FPGA operation resources and maximizing the FPAG acceleration effect.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

(1) parameter processing:

1a) reading the picture and FPGA board resource parameters input by a user, wherein the resource parameters comprise: picture size N, total block ram resource S_sumThe number P of the DDR3 of the synchronous dynamic random access memory and the number A of the DSP of the calculation function chip;

1b) designing the FPGA operation frequency f, the convolution kernel size m, the convolution layer number J, the channel number T, the pooling layer number C, the activation function layer number E, and the multi-classification function softmax layer number G, softmax input number I_inSoftmax layer output number I_outThe full connection layer number Q, a pooling function and an activation function;

1c) calculating a size value set X of each layer of pictures, a maximum convolution parallelizable number L, a theoretical operation speed bandwidth D and a theoretical data transmission bandwidth Z according to the data read in the step 1a) and the parameters designed in the step 1 b);

(2) fixed values for picture segmentation are calculated:

2a) calculating a common divisor M of the size of each layer of picture according to the value set X of the size of each layer of picture obtained by calculation in the step (1);

2b) the common divisor obtained according to the 2a) and the total block ram resource S read in the step (1)_sumCalculating a common divisor C of the pictures meeting the resource limit of the block ram of the FPGA;

2c) calculating the maximum common divisor meeting the DSP resource limit as a picture segmentation fixed value n according to the resource limit common divisor obtained in the step 2b) and the DSP resource read in the step (1);

(3) determining the number of DDR 3:

calculating an actual data transmission bandwidth H according to the picture segmentation fixed value n, and comparing the actual data transmission bandwidth H with a theoretical data transmission bandwidth Z:

if H > Z, the number B of DDR3 is determined to be 2 or an integer of 1+2j, j ≧ 1

If H is less than or equal to Z, determining the number B of the DDR3 as 3 or 1+4i, wherein i is an integer more than or equal to 1; i ≠ j

(4) Resource allocation is carried out on the block ram on the FPGA:

4a) calculating a picture storage block ram resource S according to the picture segmentation fixed value n determined in the step (2) and the channel number T in the step (1)_pic；

4b) Picture storage block ram resource S according to 4a)_picAnd (1) total block ram resource S_sumCalculating the remaining block ram storage resource S_lastAnd the largest storage parameter block ram resource S_neAnd comparing the sizes of the two: if S is_last≥S_neThen S will be_neStoring a block ram resource S as a parameter_parIf S is_last＜S_neThen S will be_lastSubtract 0.5Mbit as parameter memory Block ram resource S_par；

(5) Constructing a convolutional neural network framework, and processing an input picture by combining the parameters in the 1a), 1b), 2c), (3), 4a), and 4 b):

5a) setting a picture storage module for dividing a fixed value n, the number of convolution layers J, the number of channels T and a picture storage block ram resource S according to the pictures in the steps 2c), 4a) and 3_picAnd DDR3 number B, the pixel points of the input picture are taken out from the DDR3 and stored;

5b) setting a picture data distribution module for dividing the fixed value n and the picture storage block ram resource S according to the pictures of 2c), 4a) and 4b)_picAnd parameter storage Block ram resource S_parDistributing the picture data stored in the step 5 a);

5c) setting a convolution module for dividing a fixed value n according to the picture of 2c) and performing convolution calculation on the distributed picture data in 5 b);

5d) setting a pooling module for implementing pooling processing on the image data after the convolution calculation of 5c) according to the pooling function of 1 b);

5e) setting a picture storing module, storing the picture data after the pooling processing in the 5d) into the DDR3 according to the DDR3 number B in the (3) and the 2c) and the picture segmentation fixed value n;

5f) setting an instruction register group module, and setting a convolution kernel size m, a convolution layer number J, a pooling layer number C, an activation function layer number E, softmax layer number G, softmax layer input number I according to the picture size N in 1a), 1b) and 2C)_inSoftmax layer output number I_outA full connection layer output value Q, a picture segmentation size n, and a control instruction which is constructed and distributed to modules arranged in 5a), 5b), 5c), 5d) and 5 e).

Compared with the prior art, the invention has the following advantages:

1. the invention realizes a convolutional neural network framework based on FPGA acceleration through a hardware description language;

2. the invention ensures that the most DSP resources are utilized and the convolution calculation is uninterrupted through the pipeline structure of the picture segmentation fixed value n and the convolution module in the parameter processing, can maximize the DSP resources and the transmission efficiency through the uninterrupted calculation, and realizes the acceleration effect of the convolution neural network framework;

3. according to the invention, through dividing the picture in the picture storage module, the DDR3 transmission bandwidth is ensured to be at the maximum value, and the maximum transmission efficiency is realized;

4. the invention changes the design parameters in the parameter processing through the instruction register group module, and can realize the convolutional neural network with different picture sizes N and different convolutional layer number J parameters.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

fig. 2 is a diagram of simulation results of an embodiment of the present invention.

Detailed Description

The embodiments and effects of the present invention will be described in detail below with reference to the accompanying drawings;

and step 1, processing parameters.

1.1) reading pictures and FPGA board resource parameters input by a user, wherein the FPGA parameters comprise: picture size N, total block ram resource S_sumThe number P of the DDR3 of the synchronous dynamic random access memory and the number A of the DSP of the calculation function chip;

1.2) design parameters including: the method comprises the following steps of FPGA operation frequency f, convolution kernel size m, convolution layer number J, channel number T, pooling layer number C, activation function layer number E, multi-classification function softmax layer number G, softmax layer input number I_inSoftmax layer output number I_outThe full connection layer number Q, a pooling function and an activation function;

1.3) the computer calculates the value set X of the size of each layer of pictures according to the read parameter values, the maximum convolution can be parallel to the line L, the theoretical operation speed bandwidth and the theoretical data transmission bandwidth:

1.3a) solving the set of per-layer picture size values X by the following formula:

X＝N/2ⁱ+2 i＝0,1,2...

wherein N is the picture size of 1.1), and X and i are integers;

1.3b) finding the maximum possible number of parallel rows L by the following formula:

wherein A is the DSP resource number of 1.1), and m is the convolution kernel size of 1.2);

1.3c) solving the maximum operation speed bandwidth D by the following formula:

D＝f×m²×32×L，

wherein f is the FPGA operation frequency of 1.2), m is the convolution kernel size of 1.2), and L is the maximum parallel number of 1.3 b);

1.3d) solving the data transmission bandwidth Z by the following formula:

Z＝4×(P-1)，

wherein P is the number of DDR3 of 1.2);

and 2, calculating a picture segmentation fixed value.

2.1) solving the common divisor M of the size of each layer of picture by the following formula:

M＝GCD(X)

wherein, X is 1.3a), GCD () represents a common divisor;

2.2) solving a common divisor C of the pictures meeting the resource limit of the block ram by the following formula:

C＝max(M)

st.

wherein M is a common divisor of the size of each layer of pictures of 2.1), T is the number of channels of 1.2), and M is the convolution size of 1.2),

S_sum1.2), and max () is the maximum value;

2.3) solving a picture segmentation fixed value n meeting the DSP resource limitation by the following formula:

n＝max(C)＜L，

wherein C is the common divisor C of the pictures of 2.2), and L is the maximum number of parallel rows of 1.2).

Step 3, determining the number of DDR3

3.1) finding the actual data transmission bandwidth H by the following formula:

H＝n²×32×max(T)，

wherein n is a fixed value for picture segmentation of 2.3), and T is the number of channels of 1.2);

3.2) comparing the actual data transmission bandwidth H with the theoretical data transmission bandwidth Z, and calculating the number B of DDR 3:

if H > Z, determining the number B of DDR3 to be 2 or 1+2j, j being an integer greater than or equal to 1;

if H is less than or equal to Z, determining the number B of the DDR3 to be 3 or 1+4i, wherein i is an integer which is greater than or equal to 1; i ≠ j, Z is 1.2).

And 4, distributing block ram resources on the FPGA.

4.1) solving the resource S of the picture storage block ram by the following formula_pic：

S_pic＝max(M)×max(T)×32，

Wherein M is a common divisor of the size of each layer of pictures of 2.1), and T is a channel number of 1.2);

4.2) the remaining block ram storage resource S is solved through the following formula_last：

S_last＝S_sum-2×S_pic，

Wherein S is_picPicture storage block ram resource, S, of 4.1)_sum1.1) FPGA block ram size;

4.3) obtaining storage parameter Block ram resource S_ne：

4.3a) solving for the intermediate variable U by the following formula

Wherein, n is a picture segmentation fixed value n of 2.3), X is 1.3a) the value set X of the size value of each layer of pictures, T is the channel number of 1.2), and max () is the maximum value;

4.3b) solving the storage parameter block ram resource S by_ne：

Where U is an intermediate variable of 4.3a), S_sumA block ram size, S, of 1.1)_picPicture storage block ram resource, S, of 4.1)_last4.2) the remaining block ram storage resources;

and 5, setting a picture storage module.

5.1) dividing B DDR3 into two parts, taking B-1 DDR3 as storage picture pixel points, and 1 for the rest

DDR3 as storage parameters, B is 3.2) DDR 3;

5.2) taking each of the B-1 DDR3 from each DDR3 each with length n and width

Taking the picture pixel points with the matrix size for T times in total, wherein the initial address of the picture pixel points is taken from 0, the initial address is increased by n-1 after the picture is taken for T times, the initial address returns to 0 after the picture is taken for T times, T is the number of channels of 1.2), and n is a picture segmentation fixed value of 2.3);

5.3) storing the picture pixel points taken out from the DDR3 in S_picIn block ram resources of size, storeThe address is increased by 1 and S in sequence from 0_picIs the picture storage block ram resource of 4.1);

5.4) repeat steps 5a2) -5a3) J times, J being the number of convolution layers of 1.2).

And 6, setting a picture data distribution module.

6.1) constructing a register group of m x (n +1), wherein the former register group of m x n is used as a calculation group, the last register group of m x 1 is used as a cache group, n is 2.3) of the picture is divided into fixed values, and m is the convolution kernel size of 1.2);

6.2) taking the picture data with the matrix size of m and the width of n from the picture storage block ram resources, and storing the picture data in the calculation group constructed by 6.1), wherein the initial address of the picture data is started from 0, the initial address is increased by m after the picture data is taken each time, m is the convolution kernel size of 1.2), and n is a picture segmentation fixed value of 2.3).

6.3) the calculation group outputs the picture data with the length and the width of n to a convolution module each time, simultaneously, the picture data with the length of m and the width of 1 is taken from a picture storage block ram and stored into a cache group, wherein the address is taken from 0, 1 is automatically added each time, after the calculation group outputs m-1 times, the register data of the first line of the calculation group is abandoned, the register data of the second line of the calculation group is assigned to the register of the first line, the register data of the third line is assigned to the register of the second line, and similarly, other lines are sequentially assigned to the register of the upper line, and m is the convolution kernel size of 1.2).

And 7, setting a convolution module.

Inputting the matrix picture data with the length and the width of n input in the step 6 into n²In each DSP, multiplying two by two, adopting a pipeline structure, adding two by two adjacent multiplied data to finish convolution calculation, and inputting a convolution calculation result into a pooling module, wherein n is a fixed value for dividing the picture of 2.3);

the pipeline structure is that when the system processes data, each clock pulse receives the next instruction for processing data.

And 8, arranging a pooling module.

Acquiring the picture data input in the step 7, carrying out arbitrary pairwise subtraction on every 4 picture data according to the input sequence,

and 6 results are obtained, and whether the highest bit of the data of the 6 results is 1 is judged:

if it is 1, the decrement is removed,

if the number is 0, the number is subtracted, the 6 results are processed in sequence, the last picture data is the maximum value in the 4 picture data, and the maximum value is transferred to the picture storing back module of the DDR 3.

And 9, setting the picture to be stored back to the DDR 3.

Storing the picture data of the step 8 in a block ram resource S_picIn, from block ram resource S_picThe middle part is n in length and n in width

The picture data of (2.3) is stored back in the DDR3, wherein the picture data address is taken to start from 0 and increment by 1 each time, and the DDR3 address is stored to start from 0 and increment by 8 each time, wherein n is the picture division fixed value of 2.3), and B is the number of DDR3 of 3.2).

Step 10, setting an instruction register group.

10.1) constructing a register bank with a length of 128 and a width of J + C + G + Q +1 to store control instructions, wherein J is the number of convolution layers of 1.2), C is the number of pooling layers of 1.2), G is the number of softmax layers of 1.2), and Q is the full nexus output value of 1.2);

10.2) constructing a 128-bit binary code control instruction: the instructions are as follows in sequence from top to bottom: an input picture size N of 10 bits, a picture division size N of 8 bits, a convolution kernel size m of 4 bits, a convolution layer number J of 6 bits, a pooling layer number C of 6 bits, an activation function layer number E of 4 bits, a softmax layer number G of 4 bits, and an input number I of softmax layer of 16 bits_in16-bit softmax layer output number I_outAnd a 54-bit full-connected-layer output value Q, wherein N is 1.1) the size value of the input picture, N is 2.3) the picture segmentation fixed value, m is a convolution kernel size m of 1.2, J is the number of convolution layers of 1.2), C is the number of pooling layers of 1.2), E is the number of activation function layers of 1.2), G is the number of softmax layers of 1.2), I is_in1.2) softmax layer transferNumber of entries, I_out1.2) output number of the softmax layer, and Q is 1.2) output value of the full connection layer;

10.3) transmitting control instructions to the modules arranged in the steps 5-9 at the same time through handshake signals.

The handshake signal means that before two modules communicate, the modules need to acknowledge each other to enable signals, and then can transmit data to each other.

The effects of the present invention can be further illustrated by the following simulations.

1. Simulation conditions

The simulation uses a purple light co-created FPGA platform, model PGT 180H;

reading FPGA resource parameters and design parameters input by a user by a computer:

the FPGA resource parameters comprise: picture size N224, total block ram resource S_sum9.2M, the number P of the sdram DDR3 is 3 and the number a of the computing function chips DSP is 424.

The design parameters include: the FPGA operation frequency f is 150M, the convolution kernel size M is 3, the number of convolution layers J is 8, the number of channels T is 524, the number of pooling layers C is 8, the number of activation function layers E is 8, the number of multi-classification function softmax layers G is 2, and the softmax layer input number I_in5120, softmax layer output number I_out1024, the number of full connection layers Q is 100, the pooling function is a maximum pooling function, and the activation function is a linear correction relu activation function;

2. emulated content

Simulation 1, using modleisim software, based on the above parameters, to perform simulation processing on an image with an input length and width N being 224 and a pixel size being 1 by using the method of the present invention at a clock frequency of 150M, so as to obtain an output result of a convolution module, as shown in fig. 2.

As can be seen from fig. 2, the output result values of the convolution modules are 9, 36, 81, and 144, which are in accordance with the result of the convolution neural network algorithm when the same image and design parameters are input in the cpu, and it is verified that the method can correctly implement the convolution neural network structure.

Simulation 2, by using modleisim software, based on the above parameters, the method of the present invention performs simulation processing on an input picture with length and width N being 224 at a clock frequency of 150M, to obtain simulation time and resource utilization rate of FPGA, as shown in table 1.

As can be seen from the following table 1, the simulation time is 0.4s, the clock frequency is 187MHz at most, the DDR3 bandwidth is 3.73Gbit/s, and the transmission efficiency is maximized; the number of the DSPs is 390, the resource utilization rate is 87%, the block ram resource is 9.1M, the resource utilization rate is 98.9%, the final calculation speed is 6.2Gbit/s, and the acceleration effect of the convolutional neural network framework is realized.

TABLE 1

Claims

1. The image processing method based on the FPGA accelerated convolutional neural network framework comprises the following steps:

(1) parameter processing:

1b) designing the FPGA operation frequency f, the convolution kernel size m, the convolution layer number J, the channel number T, the pooling layer number D, the activation function layer number E, and the multi-classification function softmax layer number G, softmax input number I_inSoftmax layer output number I_outThe number of full-connection layers Q, a pooling function and an activation function;

(2) fixed values for picture segmentation are calculated:

2b) the common divisor obtained according to the 2a) and the total block ram resource S read in the step (1)_sumCalculating to satisfy FPGA, a common image divisor C limited by block ram resources;

(3) determining the number of DDR 3:

(4) Resource allocation is carried out on the block ram on the FPGA:

5a) setting a picture storage module for dividing a fixed value n, the number of convolution layers J, the number of channels T and a picture storage block ram resource S according to the pictures in the steps 2c), 4a) and 3_picAnd the number B of DDR3, the pixel point of the input picture is taken out from the DDR3 and stored;

5b) setting a picture data distribution module for dividing a fixed value n and a picture storage block ram resource S according to pictures of 2c), 4a) and 4b)_picAnd parameter storage Block ram resource S_parFor the number of pictures stored in 5a)According to the distribution;

5f) setting an instruction register group module, and setting a convolution kernel size m, a convolution layer number J, a pooling layer number D, an activation function layer number E, softmax layer number G, softmax layer input number I according to the picture size N in 1a), 1b) and 2c)_inSoftmax layer output number I_outThe full-connection layer output value Q, the picture segmentation size n, and the construction instructions are distributed to the modules arranged in 5a), 5b), 5c), 5d) and 5 e).

2. The method according to claim 1, wherein in step 1c), the set of picture size values X of each layer is calculated according to the data read in 1a) and the parameters designed in 1b), the maximum convolution parallelizable number L, the theoretical operation speed bandwidth D, and the theoretical data transmission bandwidth Z are calculated according to the following formula:

X＝N/2ⁱ+2 i＝0,1,2...

D＝f×m²×32×L

Z＝4×(P-1)

wherein N is the picture size, L is the maximum parallelizable number, A is the DSP resource number, m is the convolution kernel size, f is the FPGA operation frequency, P is the DDR3 number, wherein X and i are integers.

3. The method according to claim 1, wherein step 2a) calculates a common divisor M of the picture size of each layer according to the set of picture size values of each layer calculated in (1);

M＝GCD(X)

where X is a set of per-layer picture size values and GCD () represents a common divisor.

4. The method according to claim 1, wherein the common divisor M of pictures of block ram resource limit obtained in step 2a) and the total block ram resource S read in (1)_sumCalculating a common divisor C of the pictures meeting the resource limit of the block ram of the FPGA;

C＝max(M)

wherein M is the common divisor of the size of each layer of picture, T is the number of channels, M is the convolution size, S_sumMax () is the maximum value for the block ram size in the FPGA.

5. The method according to claim 1, wherein step 4a) calculates a greatest common divisor satisfying DSP resource limitations as a picture division fixed value n according to the picture division fixed value n determined in (2) and the number of channels T in (1);

n＝max(M)＜L

wherein, M is a common divisor of pictures limited by the resource of the block ram, L is a maximum number of parallel lines, and max () is a maximum value.

6. The method of claim 1, wherein step 4a) calculates picture storage block ram resource S based on picture segmentation fixed value n determined in (2) and channel number T in (1)_pic：

S_pic＝max(M)×max(T)×32

Wherein, M is a common divisor of pictures limited by the resource of the block ram, T is the number of input channels, and max () is a maximum value.

7. Method according to claim 1, wherein step 4b) stores block ram resources S according to picture of 4a)_picAnd (1) total block ram resource S_sumCalculating the remaining blocks ram storage resource S_lastAnd storing the parameter block ram resource S_ne：

S_last＝S_sum-2×S_pic

Wherein the content of the first and second substances,

u is an intermediate variable, S_sumThe block ram size of the FPGA is obtained, X is the size value set of each layer of pictures, T is the number of input channels, and max () is the maximum value.

8. The method according to claim 1, wherein in step 5a) a fixed value n, a number of convolution layers J, a number of channels T, and picture storage block ram resources S are partitioned according to the pictures in 2c), 4a), and (3)_picAnd the DDR3 number B, the pixel point of the input picture is taken out from the DDR3 and stored, and the method comprises the following steps:

5a1) dividing the B DDR3 into two parts, taking the B-1 DDR3 as storage picture pixel points, and taking the rest 1 DDR3 as storage parameters;

5a2) each DDR3 of B-1 DDR3 takes the length n and the width n

Taking the picture pixel points with the matrix size for T times in total, wherein the initial address of the picture pixel points is taken from 0, the initial address is increased by n-1 after the picture is taken for T times, and the initial address is returned to 0 after the picture is taken for T times;

5a3) storing the picture pixel points taken out from the DDR3 in S_picIn the block ram resource with the size, the storage addresses are sequentially increased by one from 0;

5a4) repeating steps 5a2) -5a3) J times.

9. The process of claim 1, wherein in step 5b) as per 1a),4a) And 2c) convolution size m, picture storage block ram resource S_picAnd a picture division fixed value n, distributing the picture data stored in the step 5a), and performing the following steps:

5b1) constructing an mxn (n +1) register group, wherein the first mxn register group is used as a calculation group, and the last mx1 register group is used as a cache group;

5b2) taking the picture data with the length of m and the width of n in the picture storage block ram resources, storing the picture data in a matrix size constructed in 5b1), wherein the initial address of the picture data is started from 0, and the initial address is increased by m after the picture data is taken each time;

5b3) the calculation group outputs the picture data with the length and the width of n to the convolution module each time, simultaneously, the picture data with the length of m and the width of 1 is taken from the picture storage block ram and stored into the cache group, wherein the address is taken from 0, 1 is automatically added each time, after the calculation group outputs m-1 times, the register data of the first line of the calculation group is abandoned, the register data of the second line of the calculation group is assigned to the register of the first line, the register data of the third line is assigned to the register of the second line, and similarly, other lines are sequentially assigned to the register of the first line.

10. The method according to claim 1, wherein the fixed value n is divided according to the picture of 2c) in step 5c), and the convolution calculation is performed on the allocated picture data in 5b), wherein the matrix picture data with the length and width of n input in step 5b) is input into n²In each DSP, multiplying two by two, adopting a pipeline structure, and adding the multiplied data two by two to complete convolution calculation;

11. The method according to claim 1, wherein the step 5d) of pooling the picture data after the 5c) convolution calculation according to the pooling function of 1b) is performed as follows:

5d1) acquiring 5c) input picture data, and subtracting any two of every 4 picture data to obtain 6 results;

5d2) judging whether the highest bit of the data of 5d1)6 results is 1:

if the number is 1, the number of subtractions is removed, if the number is 0, the number of subtractions is removed, the 6 results are processed in sequence, and the last remaining picture data is the maximum value of the 4 picture data.

12. The method as claimed in claim 1, wherein the fixed value n is divided according to the number B of DDR3 and the picture in (3) and 2c) in step 5e), the picture data after pooling in 5d) is stored back in DDR3, and the picture data after pooling in step 5d) is stored in block ram resource S_picIn, from block ram resource S_picThe middle part is n in length and n in width

The picture data of (2) is stored back to the DDR3, wherein the address of the picture data is fetched from 0 and incremented by 1 automatically each time, and the address of the DDR3 is stored from 0 and incremented by 8 automatically each time.

13. The method according to claim 1, wherein in step 5f) the picture size N, convolution kernel size m, number of convolution layers J, number of pooling layers D, number of activation function layers E, softmax, number of layers G, softmax input numbers I in steps 1a), 1b) and 2c), and the number of layers of the convolution kernel_inSoftmax layer output number I_outThe output value Q of the full connection layer, the picture segmentation size n, the control instruction is constructed and distributed to modules arranged in 5a), 5b), 5c), 5d) and 5e), and the method comprises the following steps:

5f1) constructing a register group with the length of 128 and the width of J + C + G + Q +1 to store instructions;

5f2) the composing sequence of the instructions is as follows from top to bottom: an input picture size N of 10 bits, a picture segmentation size N of 8 bits, a convolution kernel size m of 4 bits, a convolution layer number J of 6 bits, a pooling layer number D of 6 bits, an activation function layer number E of 4 bits, a softmax layer number G of 4 bits, and a softmax layer input number I of 16 bits_in16 bits softmax layer output number I_outA full link layer output value Q of 54 bits;

5f3) the instructions are simultaneously transmitted to the modules arranged in 5a), 5b), 5c), 5d) and 5e) through handshake signals;