CN108154229B - Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework - Google Patents

Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework Download PDF

Info

Publication number
CN108154229B
CN108154229B CN201810022870.7A CN201810022870A CN108154229B CN 108154229 B CN108154229 B CN 108154229B CN 201810022870 A CN201810022870 A CN 201810022870A CN 108154229 B CN108154229 B CN 108154229B
Authority
CN
China
Prior art keywords
picture
resource
block ram
size
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810022870.7A
Other languages
Chinese (zh)
Other versions
CN108154229A (en
Inventor
王坚灿
董刚
杨银堂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201810022870.7A priority Critical patent/CN108154229B/en
Publication of CN108154229A publication Critical patent/CN108154229A/en
Application granted granted Critical
Publication of CN108154229B publication Critical patent/CN108154229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an image processing method based on an FPGA (field programmable gate array) accelerated convolutional neural network framework, which mainly solves the problems of low resource utilization rate and low speed in the prior art. The scheme is as follows: 1) calculating a picture segmentation fixed value according to the designed picture parameter and the FPGA resource parameter; 2) determining the number of DDR3 according to the picture fixed value, and allocating block ram resources; 3) constructing a convolutional neural network framework according to 1) and 2), wherein the framework comprises a picture storage module, a picture data distribution module, a convolution module, a pooling module, a picture return DDR3 module and an instruction register set; 4) and all modules acquire control instructions from the instruction register group through handshake signals, are matched with each other, and process the image data according to the control instructions. The invention improves the resource utilization rate and the acceleration effect through the convolutional neural network framework accelerated by the FPGA, and can be used for image classification, target recognition, voice recognition and natural language processing.

Description

Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework
Technical Field
The invention belongs to the technical field of computer design, and particularly relates to a convolutional neural network implementation method which can be used for image classification, target recognition, voice recognition and natural language processing.
Background
With the progress of integrated circuit design and manufacturing process, the field programmable gate array with high-speed and high-density programmable logic resources has been rapidly developed, and the integration level of a single chip is higher and higher. In order to further improve the performance of the FPGA, mainstream chip manufacturers integrate a DSP customized computing unit of a digital signal processing chip with high-speed digital signal processing capability inside the chip, and a DSP hard core is a component capable of realizing fixed-point operation with high efficiency and low cost, so that the FPGA is widely used in the application fields of video and image processing, network communication and information security, bioinformatics, and the like.
The convolutional neural network CNN is a structure of an artificial neural network, is widely applied to the fields of image classification, target recognition, voice recognition, natural language processing and the like, and in recent years, along with the great improvement of computer capability and the development of a neural network structure, the performance and accuracy of the CNN are greatly improved, but the requirement on the parallel computing capability of an operation unit is higher and higher, so that a GPU (graphics processing unit) with the parallel computing capability and an FPGA (field programmable gate array) become a mainstream direction.
The configurable computing architecture based on FPGA can exploit the parallelism of the artificial neural network and change the weight and topology of the convolutional neural network through configuration. The artificial neural network realized by the FPGA has the flexibility of software design and is close to an Application Specific Integrated Circuit (ASIC) in the aspect of calculation performance, and meanwhile, the high-efficiency interconnection can be realized by utilizing on-chip programmable connecting line resources, so the FPGA is an important choice for realizing the artificial neural network by hardware.
In the current patents and research directions, the OpenCL programming language is basically used as a construction core, and the purpose is to reduce the implementation time of converting the convolutional neural network algorithm into the hardware description language, but the acceleration of the hardware description language code in the FPGA algorithm is not involved, and meanwhile, the OpenCL programming language is not a language actually running on the FPGA, so that the actual running speed of the FPGA is not ideal. In the prior art implemented based on OpenCL programming, the acceleration of a DSP module in an FPGA is mainly focused, a convolutional neural network algorithm is not integrally implemented and a bottom hardware description language is not optimized, and the computation resources of the FPGA cannot be fully utilized, so that the computation time of the FPGA is increased, and the acceleration effect is not obvious.
Disclosure of Invention
The invention aims to provide a convolutional neural network implementation method based on FPGA acceleration, which is used for integrally implementing the convolutional neural network through a hardware description language, optimizing a bottom hardware description language, fully utilizing FPGA operation resources and maximizing the FPAG acceleration effect.
In order to achieve the purpose, the technical scheme of the invention comprises the following steps:
(1) parameter processing:
1a) reading the picture and FPGA board resource parameters input by a user, wherein the resource parameters comprise: picture size N, total block ram resource SsumThe number P of the DDR3 of the synchronous dynamic random access memory and the number A of the DSP of the calculation function chip;
1b) designing the FPGA operation frequency f, the convolution kernel size m, the convolution layer number J, the channel number T, the pooling layer number C, the activation function layer number E, and the multi-classification function softmax layer number G, softmax input number IinSoftmax layer output number IoutThe full connection layer number Q, a pooling function and an activation function;
1c) calculating a size value set X of each layer of pictures, a maximum convolution parallelizable number L, a theoretical operation speed bandwidth D and a theoretical data transmission bandwidth Z according to the data read in the step 1a) and the parameters designed in the step 1 b);
(2) fixed values for picture segmentation are calculated:
2a) calculating a common divisor M of the size of each layer of picture according to the value set X of the size of each layer of picture obtained by calculation in the step (1);
2b) the common divisor obtained according to the 2a) and the total block ram resource S read in the step (1)sumCalculating a common divisor C of the pictures meeting the resource limit of the block ram of the FPGA;
2c) calculating the maximum common divisor meeting the DSP resource limit as a picture segmentation fixed value n according to the resource limit common divisor obtained in the step 2b) and the DSP resource read in the step (1);
(3) determining the number of DDR 3:
calculating an actual data transmission bandwidth H according to the picture segmentation fixed value n, and comparing the actual data transmission bandwidth H with a theoretical data transmission bandwidth Z:
if H > Z, the number B of DDR3 is determined to be 2 or an integer of 1+2j, j ≧ 1
If H is less than or equal to Z, determining the number B of the DDR3 as 3 or 1+4i, wherein i is an integer more than or equal to 1; i ≠ j
(4) Resource allocation is carried out on the block ram on the FPGA:
4a) calculating a picture storage block ram resource S according to the picture segmentation fixed value n determined in the step (2) and the channel number T in the step (1)pic
4b) Picture storage block ram resource S according to 4a)picAnd (1) total block ram resource SsumCalculating the remaining block ram storage resource SlastAnd the largest storage parameter block ram resource SneAnd comparing the sizes of the two: if S islast≥SneThen S will beneStoring a block ram resource S as a parameterparIf S islast<SneThen S will belastSubtract 0.5Mbit as parameter memory Block ram resource Spar
(5) Constructing a convolutional neural network framework, and processing an input picture by combining the parameters in the 1a), 1b), 2c), (3), 4a), and 4 b):
5a) setting a picture storage module for dividing a fixed value n, the number of convolution layers J, the number of channels T and a picture storage block ram resource S according to the pictures in the steps 2c), 4a) and 3picAnd DDR3 number B, the pixel points of the input picture are taken out from the DDR3 and stored;
5b) setting a picture data distribution module for dividing the fixed value n and the picture storage block ram resource S according to the pictures of 2c), 4a) and 4b)picAnd parameter storage Block ram resource SparDistributing the picture data stored in the step 5 a);
5c) setting a convolution module for dividing a fixed value n according to the picture of 2c) and performing convolution calculation on the distributed picture data in 5 b);
5d) setting a pooling module for implementing pooling processing on the image data after the convolution calculation of 5c) according to the pooling function of 1 b);
5e) setting a picture storing module, storing the picture data after the pooling processing in the 5d) into the DDR3 according to the DDR3 number B in the (3) and the 2c) and the picture segmentation fixed value n;
5f) setting an instruction register group module, and setting a convolution kernel size m, a convolution layer number J, a pooling layer number C, an activation function layer number E, softmax layer number G, softmax layer input number I according to the picture size N in 1a), 1b) and 2C)inSoftmax layer output number IoutA full connection layer output value Q, a picture segmentation size n, and a control instruction which is constructed and distributed to modules arranged in 5a), 5b), 5c), 5d) and 5 e).
Compared with the prior art, the invention has the following advantages:
1. the invention realizes a convolutional neural network framework based on FPGA acceleration through a hardware description language;
2. the invention ensures that the most DSP resources are utilized and the convolution calculation is uninterrupted through the pipeline structure of the picture segmentation fixed value n and the convolution module in the parameter processing, can maximize the DSP resources and the transmission efficiency through the uninterrupted calculation, and realizes the acceleration effect of the convolution neural network framework;
3. according to the invention, through dividing the picture in the picture storage module, the DDR3 transmission bandwidth is ensured to be at the maximum value, and the maximum transmission efficiency is realized;
4. the invention changes the design parameters in the parameter processing through the instruction register group module, and can realize the convolutional neural network with different picture sizes N and different convolutional layer number J parameters.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
fig. 2 is a diagram of simulation results of an embodiment of the present invention.
Detailed Description
The embodiments and effects of the present invention will be described in detail below with reference to the accompanying drawings;
and step 1, processing parameters.
1.1) reading pictures and FPGA board resource parameters input by a user, wherein the FPGA parameters comprise: picture size N, total block ram resource SsumThe number P of the DDR3 of the synchronous dynamic random access memory and the number A of the DSP of the calculation function chip;
1.2) design parameters including: the method comprises the following steps of FPGA operation frequency f, convolution kernel size m, convolution layer number J, channel number T, pooling layer number C, activation function layer number E, multi-classification function softmax layer number G, softmax layer input number IinSoftmax layer output number IoutThe full connection layer number Q, a pooling function and an activation function;
1.3) the computer calculates the value set X of the size of each layer of pictures according to the read parameter values, the maximum convolution can be parallel to the line L, the theoretical operation speed bandwidth and the theoretical data transmission bandwidth:
1.3a) solving the set of per-layer picture size values X by the following formula:
X=N/2i+2 i=0,1,2...
wherein N is the picture size of 1.1), and X and i are integers;
1.3b) finding the maximum possible number of parallel rows L by the following formula:
Figure BDA0001544130080000041
wherein A is the DSP resource number of 1.1), and m is the convolution kernel size of 1.2);
1.3c) solving the maximum operation speed bandwidth D by the following formula:
D=f×m2×32×L,
wherein f is the FPGA operation frequency of 1.2), m is the convolution kernel size of 1.2), and L is the maximum parallel number of 1.3 b);
1.3d) solving the data transmission bandwidth Z by the following formula:
Z=4×(P-1),
wherein P is the number of DDR3 of 1.2);
and 2, calculating a picture segmentation fixed value.
2.1) solving the common divisor M of the size of each layer of picture by the following formula:
M=GCD(X)
wherein, X is 1.3a), GCD () represents a common divisor;
2.2) solving a common divisor C of the pictures meeting the resource limit of the block ram by the following formula:
C=max(M)
st.
Figure BDA0001544130080000051
wherein M is a common divisor of the size of each layer of pictures of 2.1), T is the number of channels of 1.2), and M is the convolution size of 1.2),
Ssum1.2), and max () is the maximum value;
2.3) solving a picture segmentation fixed value n meeting the DSP resource limitation by the following formula:
n=max(C)<L,
wherein C is the common divisor C of the pictures of 2.2), and L is the maximum number of parallel rows of 1.2).
Step 3, determining the number of DDR3
3.1) finding the actual data transmission bandwidth H by the following formula:
H=n2×32×max(T),
wherein n is a fixed value for picture segmentation of 2.3), and T is the number of channels of 1.2);
3.2) comparing the actual data transmission bandwidth H with the theoretical data transmission bandwidth Z, and calculating the number B of DDR 3:
if H > Z, determining the number B of DDR3 to be 2 or 1+2j, j being an integer greater than or equal to 1;
if H is less than or equal to Z, determining the number B of the DDR3 to be 3 or 1+4i, wherein i is an integer which is greater than or equal to 1; i ≠ j, Z is 1.2).
And 4, distributing block ram resources on the FPGA.
4.1) solving the resource S of the picture storage block ram by the following formulapic
Spic=max(M)×max(T)×32,
Wherein M is a common divisor of the size of each layer of pictures of 2.1), and T is a channel number of 1.2);
4.2) the remaining block ram storage resource S is solved through the following formulalast
Slast=Ssum-2×Spic
Wherein S ispicPicture storage block ram resource, S, of 4.1)sum1.1) FPGA block ram size;
4.3) obtaining storage parameter Block ram resource Sne
4.3a) solving for the intermediate variable U by the following formula
Figure BDA0001544130080000061
Wherein, n is a picture segmentation fixed value n of 2.3), X is 1.3a) the value set X of the size value of each layer of pictures, T is the channel number of 1.2), and max () is the maximum value;
4.3b) solving the storage parameter block ram resource S byne
Figure BDA0001544130080000062
Where U is an intermediate variable of 4.3a), SsumA block ram size, S, of 1.1)picPicture storage block ram resource, S, of 4.1)last4.2) the remaining block ram storage resources;
and 5, setting a picture storage module.
5.1) dividing B DDR3 into two parts, taking B-1 DDR3 as storage picture pixel points, and 1 for the rest
DDR3 as storage parameters, B is 3.2) DDR 3;
5.2) taking each of the B-1 DDR3 from each DDR3 each with length n and width
Figure BDA0001544130080000063
Taking the picture pixel points with the matrix size for T times in total, wherein the initial address of the picture pixel points is taken from 0, the initial address is increased by n-1 after the picture is taken for T times, the initial address returns to 0 after the picture is taken for T times, T is the number of channels of 1.2), and n is a picture segmentation fixed value of 2.3);
5.3) storing the picture pixel points taken out from the DDR3 in SpicIn block ram resources of size, storeThe address is increased by 1 and S in sequence from 0picIs the picture storage block ram resource of 4.1);
5.4) repeat steps 5a2) -5a3) J times, J being the number of convolution layers of 1.2).
And 6, setting a picture data distribution module.
6.1) constructing a register group of m x (n +1), wherein the former register group of m x n is used as a calculation group, the last register group of m x 1 is used as a cache group, n is 2.3) of the picture is divided into fixed values, and m is the convolution kernel size of 1.2);
6.2) taking the picture data with the matrix size of m and the width of n from the picture storage block ram resources, and storing the picture data in the calculation group constructed by 6.1), wherein the initial address of the picture data is started from 0, the initial address is increased by m after the picture data is taken each time, m is the convolution kernel size of 1.2), and n is a picture segmentation fixed value of 2.3).
6.3) the calculation group outputs the picture data with the length and the width of n to a convolution module each time, simultaneously, the picture data with the length of m and the width of 1 is taken from a picture storage block ram and stored into a cache group, wherein the address is taken from 0, 1 is automatically added each time, after the calculation group outputs m-1 times, the register data of the first line of the calculation group is abandoned, the register data of the second line of the calculation group is assigned to the register of the first line, the register data of the third line is assigned to the register of the second line, and similarly, other lines are sequentially assigned to the register of the upper line, and m is the convolution kernel size of 1.2).
And 7, setting a convolution module.
Inputting the matrix picture data with the length and the width of n input in the step 6 into n2In each DSP, multiplying two by two, adopting a pipeline structure, adding two by two adjacent multiplied data to finish convolution calculation, and inputting a convolution calculation result into a pooling module, wherein n is a fixed value for dividing the picture of 2.3);
the pipeline structure is that when the system processes data, each clock pulse receives the next instruction for processing data.
And 8, arranging a pooling module.
Acquiring the picture data input in the step 7, carrying out arbitrary pairwise subtraction on every 4 picture data according to the input sequence,
and 6 results are obtained, and whether the highest bit of the data of the 6 results is 1 is judged:
if it is 1, the decrement is removed,
if the number is 0, the number is subtracted, the 6 results are processed in sequence, the last picture data is the maximum value in the 4 picture data, and the maximum value is transferred to the picture storing back module of the DDR 3.
And 9, setting the picture to be stored back to the DDR 3.
Storing the picture data of the step 8 in a block ram resource SpicIn, from block ram resource SpicThe middle part is n in length and n in width
Figure BDA0001544130080000071
The picture data of (2.3) is stored back in the DDR3, wherein the picture data address is taken to start from 0 and increment by 1 each time, and the DDR3 address is stored to start from 0 and increment by 8 each time, wherein n is the picture division fixed value of 2.3), and B is the number of DDR3 of 3.2).
Step 10, setting an instruction register group.
10.1) constructing a register bank with a length of 128 and a width of J + C + G + Q +1 to store control instructions, wherein J is the number of convolution layers of 1.2), C is the number of pooling layers of 1.2), G is the number of softmax layers of 1.2), and Q is the full nexus output value of 1.2);
10.2) constructing a 128-bit binary code control instruction: the instructions are as follows in sequence from top to bottom: an input picture size N of 10 bits, a picture division size N of 8 bits, a convolution kernel size m of 4 bits, a convolution layer number J of 6 bits, a pooling layer number C of 6 bits, an activation function layer number E of 4 bits, a softmax layer number G of 4 bits, and an input number I of softmax layer of 16 bitsin16-bit softmax layer output number IoutAnd a 54-bit full-connected-layer output value Q, wherein N is 1.1) the size value of the input picture, N is 2.3) the picture segmentation fixed value, m is a convolution kernel size m of 1.2, J is the number of convolution layers of 1.2), C is the number of pooling layers of 1.2), E is the number of activation function layers of 1.2), G is the number of softmax layers of 1.2), I isin1.2) softmax layer transferNumber of entries, Iout1.2) output number of the softmax layer, and Q is 1.2) output value of the full connection layer;
10.3) transmitting control instructions to the modules arranged in the steps 5-9 at the same time through handshake signals.
The handshake signal means that before two modules communicate, the modules need to acknowledge each other to enable signals, and then can transmit data to each other.
The effects of the present invention can be further illustrated by the following simulations.
1. Simulation conditions
The simulation uses a purple light co-created FPGA platform, model PGT 180H;
reading FPGA resource parameters and design parameters input by a user by a computer:
the FPGA resource parameters comprise: picture size N224, total block ram resource Ssum9.2M, the number P of the sdram DDR3 is 3 and the number a of the computing function chips DSP is 424.
The design parameters include: the FPGA operation frequency f is 150M, the convolution kernel size M is 3, the number of convolution layers J is 8, the number of channels T is 524, the number of pooling layers C is 8, the number of activation function layers E is 8, the number of multi-classification function softmax layers G is 2, and the softmax layer input number Iin5120, softmax layer output number Iout1024, the number of full connection layers Q is 100, the pooling function is a maximum pooling function, and the activation function is a linear correction relu activation function;
2. emulated content
Simulation 1, using modleisim software, based on the above parameters, to perform simulation processing on an image with an input length and width N being 224 and a pixel size being 1 by using the method of the present invention at a clock frequency of 150M, so as to obtain an output result of a convolution module, as shown in fig. 2.
As can be seen from fig. 2, the output result values of the convolution modules are 9, 36, 81, and 144, which are in accordance with the result of the convolution neural network algorithm when the same image and design parameters are input in the cpu, and it is verified that the method can correctly implement the convolution neural network structure.
Simulation 2, by using modleisim software, based on the above parameters, the method of the present invention performs simulation processing on an input picture with length and width N being 224 at a clock frequency of 150M, to obtain simulation time and resource utilization rate of FPGA, as shown in table 1.
As can be seen from the following table 1, the simulation time is 0.4s, the clock frequency is 187MHz at most, the DDR3 bandwidth is 3.73Gbit/s, and the transmission efficiency is maximized; the number of the DSPs is 390, the resource utilization rate is 87%, the block ram resource is 9.1M, the resource utilization rate is 98.9%, the final calculation speed is 6.2Gbit/s, and the acceleration effect of the convolutional neural network framework is realized.
TABLE 1
Figure BDA0001544130080000091

Claims (13)

1. The image processing method based on the FPGA accelerated convolutional neural network framework comprises the following steps:
(1) parameter processing:
1a) reading the picture and FPGA board resource parameters input by a user, wherein the resource parameters comprise: picture size N, total block ram resource SsumThe number P of the DDR3 of the synchronous dynamic random access memory and the number A of the DSP of the calculation function chip;
1b) designing the FPGA operation frequency f, the convolution kernel size m, the convolution layer number J, the channel number T, the pooling layer number D, the activation function layer number E, and the multi-classification function softmax layer number G, softmax input number IinSoftmax layer output number IoutThe number of full-connection layers Q, a pooling function and an activation function;
1c) calculating a size value set X of each layer of pictures, a maximum convolution parallelizable number L, a theoretical operation speed bandwidth D and a theoretical data transmission bandwidth Z according to the data read in the step 1a) and the parameters designed in the step 1 b);
(2) fixed values for picture segmentation are calculated:
2a) calculating a common divisor M of the size of each layer of picture according to the value set X of the size of each layer of picture obtained by calculation in the step (1);
2b) the common divisor obtained according to the 2a) and the total block ram resource S read in the step (1)sumCalculating to satisfy FPGA, a common image divisor C limited by block ram resources;
2c) calculating the maximum common divisor meeting the DSP resource limit as a picture segmentation fixed value n according to the resource limit common divisor obtained in the step 2b) and the DSP resource read in the step (1);
(3) determining the number of DDR 3:
calculating an actual data transmission bandwidth H according to the picture segmentation fixed value n, and comparing the actual data transmission bandwidth H with a theoretical data transmission bandwidth Z:
if H > Z, the number B of DDR3 is determined to be 2 or an integer of 1+2j, j ≧ 1
If H is less than or equal to Z, determining the number B of the DDR3 as 3 or 1+4i, wherein i is an integer more than or equal to 1; i ≠ j
(4) Resource allocation is carried out on the block ram on the FPGA:
4a) calculating a picture storage block ram resource S according to the picture segmentation fixed value n determined in the step (2) and the channel number T in the step (1)pic
4b) Picture storage block ram resource S according to 4a)picAnd (1) total block ram resource SsumCalculating the remaining block ram storage resource SlastAnd the largest storage parameter block ram resource SneAnd comparing the sizes of the two: if S islast≥SneThen S will beneStoring a block ram resource S as a parameterparIf S islast<SneThen S will belastSubtract 0.5Mbit as parameter memory Block ram resource Spar
(5) Constructing a convolutional neural network framework, and processing an input picture by combining the parameters in the 1a), 1b), 2c), (3), 4a), and 4 b):
5a) setting a picture storage module for dividing a fixed value n, the number of convolution layers J, the number of channels T and a picture storage block ram resource S according to the pictures in the steps 2c), 4a) and 3picAnd the number B of DDR3, the pixel point of the input picture is taken out from the DDR3 and stored;
5b) setting a picture data distribution module for dividing a fixed value n and a picture storage block ram resource S according to pictures of 2c), 4a) and 4b)picAnd parameter storage Block ram resource SparFor the number of pictures stored in 5a)According to the distribution;
5c) setting a convolution module for dividing a fixed value n according to the picture of 2c) and performing convolution calculation on the distributed picture data in 5 b);
5d) setting a pooling module for implementing pooling processing on the image data after the convolution calculation of 5c) according to the pooling function of 1 b);
5e) setting a picture storing module, storing the picture data after the pooling processing in the 5d) into the DDR3 according to the DDR3 number B in the (3) and the 2c) and the picture segmentation fixed value n;
5f) setting an instruction register group module, and setting a convolution kernel size m, a convolution layer number J, a pooling layer number D, an activation function layer number E, softmax layer number G, softmax layer input number I according to the picture size N in 1a), 1b) and 2c)inSoftmax layer output number IoutThe full-connection layer output value Q, the picture segmentation size n, and the construction instructions are distributed to the modules arranged in 5a), 5b), 5c), 5d) and 5 e).
2. The method according to claim 1, wherein in step 1c), the set of picture size values X of each layer is calculated according to the data read in 1a) and the parameters designed in 1b), the maximum convolution parallelizable number L, the theoretical operation speed bandwidth D, and the theoretical data transmission bandwidth Z are calculated according to the following formula:
X=N/2i+2 i=0,1,2...
Figure FDA0003513652720000021
D=f×m2×32×L
Z=4×(P-1)
wherein N is the picture size, L is the maximum parallelizable number, A is the DSP resource number, m is the convolution kernel size, f is the FPGA operation frequency, P is the DDR3 number, wherein X and i are integers.
3. The method according to claim 1, wherein step 2a) calculates a common divisor M of the picture size of each layer according to the set of picture size values of each layer calculated in (1);
M=GCD(X)
where X is a set of per-layer picture size values and GCD () represents a common divisor.
4. The method according to claim 1, wherein the common divisor M of pictures of block ram resource limit obtained in step 2a) and the total block ram resource S read in (1)sumCalculating a common divisor C of the pictures meeting the resource limit of the block ram of the FPGA;
C=max(M)
Figure FDA0003513652720000031
wherein M is the common divisor of the size of each layer of picture, T is the number of channels, M is the convolution size, SsumMax () is the maximum value for the block ram size in the FPGA.
5. The method according to claim 1, wherein step 4a) calculates a greatest common divisor satisfying DSP resource limitations as a picture division fixed value n according to the picture division fixed value n determined in (2) and the number of channels T in (1);
n=max(M)<L
wherein, M is a common divisor of pictures limited by the resource of the block ram, L is a maximum number of parallel lines, and max () is a maximum value.
6. The method of claim 1, wherein step 4a) calculates picture storage block ram resource S based on picture segmentation fixed value n determined in (2) and channel number T in (1)pic
Spic=max(M)×max(T)×32
Wherein, M is a common divisor of pictures limited by the resource of the block ram, T is the number of input channels, and max () is a maximum value.
7. Method according to claim 1, wherein step 4b) stores block ram resources S according to picture of 4a)picAnd (1) total block ram resource SsumCalculating the remaining blocks ram storage resource SlastAnd storing the parameter block ram resource Sne
Slast=Ssum-2×Spic
Figure FDA0003513652720000041
Wherein the content of the first and second substances,
Figure FDA0003513652720000042
u is an intermediate variable, SsumThe block ram size of the FPGA is obtained, X is the size value set of each layer of pictures, T is the number of input channels, and max () is the maximum value.
8. The method according to claim 1, wherein in step 5a) a fixed value n, a number of convolution layers J, a number of channels T, and picture storage block ram resources S are partitioned according to the pictures in 2c), 4a), and (3)picAnd the DDR3 number B, the pixel point of the input picture is taken out from the DDR3 and stored, and the method comprises the following steps:
5a1) dividing the B DDR3 into two parts, taking the B-1 DDR3 as storage picture pixel points, and taking the rest 1 DDR3 as storage parameters;
5a2) each DDR3 of B-1 DDR3 takes the length n and the width n
Figure FDA0003513652720000043
Taking the picture pixel points with the matrix size for T times in total, wherein the initial address of the picture pixel points is taken from 0, the initial address is increased by n-1 after the picture is taken for T times, and the initial address is returned to 0 after the picture is taken for T times;
5a3) storing the picture pixel points taken out from the DDR3 in SpicIn the block ram resource with the size, the storage addresses are sequentially increased by one from 0;
5a4) repeating steps 5a2) -5a3) J times.
9. The process of claim 1, wherein in step 5b) as per 1a),4a) And 2c) convolution size m, picture storage block ram resource SpicAnd a picture division fixed value n, distributing the picture data stored in the step 5a), and performing the following steps:
5b1) constructing an mxn (n +1) register group, wherein the first mxn register group is used as a calculation group, and the last mx1 register group is used as a cache group;
5b2) taking the picture data with the length of m and the width of n in the picture storage block ram resources, storing the picture data in a matrix size constructed in 5b1), wherein the initial address of the picture data is started from 0, and the initial address is increased by m after the picture data is taken each time;
5b3) the calculation group outputs the picture data with the length and the width of n to the convolution module each time, simultaneously, the picture data with the length of m and the width of 1 is taken from the picture storage block ram and stored into the cache group, wherein the address is taken from 0, 1 is automatically added each time, after the calculation group outputs m-1 times, the register data of the first line of the calculation group is abandoned, the register data of the second line of the calculation group is assigned to the register of the first line, the register data of the third line is assigned to the register of the second line, and similarly, other lines are sequentially assigned to the register of the first line.
10. The method according to claim 1, wherein the fixed value n is divided according to the picture of 2c) in step 5c), and the convolution calculation is performed on the allocated picture data in 5b), wherein the matrix picture data with the length and width of n input in step 5b) is input into n2In each DSP, multiplying two by two, adopting a pipeline structure, and adding the multiplied data two by two to complete convolution calculation;
the pipeline structure is that when the system processes data, each clock pulse receives the next instruction for processing data.
11. The method according to claim 1, wherein the step 5d) of pooling the picture data after the 5c) convolution calculation according to the pooling function of 1b) is performed as follows:
5d1) acquiring 5c) input picture data, and subtracting any two of every 4 picture data to obtain 6 results;
5d2) judging whether the highest bit of the data of 5d1)6 results is 1:
if the number is 1, the number of subtractions is removed, if the number is 0, the number of subtractions is removed, the 6 results are processed in sequence, and the last remaining picture data is the maximum value of the 4 picture data.
12. The method as claimed in claim 1, wherein the fixed value n is divided according to the number B of DDR3 and the picture in (3) and 2c) in step 5e), the picture data after pooling in 5d) is stored back in DDR3, and the picture data after pooling in step 5d) is stored in block ram resource SpicIn, from block ram resource SpicThe middle part is n in length and n in width
Figure FDA0003513652720000051
The picture data of (2) is stored back to the DDR3, wherein the address of the picture data is fetched from 0 and incremented by 1 automatically each time, and the address of the DDR3 is stored from 0 and incremented by 8 automatically each time.
13. The method according to claim 1, wherein in step 5f) the picture size N, convolution kernel size m, number of convolution layers J, number of pooling layers D, number of activation function layers E, softmax, number of layers G, softmax input numbers I in steps 1a), 1b) and 2c), and the number of layers of the convolution kernelinSoftmax layer output number IoutThe output value Q of the full connection layer, the picture segmentation size n, the control instruction is constructed and distributed to modules arranged in 5a), 5b), 5c), 5d) and 5e), and the method comprises the following steps:
5f1) constructing a register group with the length of 128 and the width of J + C + G + Q +1 to store instructions;
5f2) the composing sequence of the instructions is as follows from top to bottom: an input picture size N of 10 bits, a picture segmentation size N of 8 bits, a convolution kernel size m of 4 bits, a convolution layer number J of 6 bits, a pooling layer number D of 6 bits, an activation function layer number E of 4 bits, a softmax layer number G of 4 bits, and a softmax layer input number I of 16 bitsin16 bits softmax layer output number IoutA full link layer output value Q of 54 bits;
5f3) the instructions are simultaneously transmitted to the modules arranged in 5a), 5b), 5c), 5d) and 5e) through handshake signals;
the handshake signal means that before two modules communicate, the modules need to acknowledge each other to enable signals, and then can transmit data to each other.
CN201810022870.7A 2018-01-10 2018-01-10 Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework Active CN108154229B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810022870.7A CN108154229B (en) 2018-01-10 2018-01-10 Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810022870.7A CN108154229B (en) 2018-01-10 2018-01-10 Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework

Publications (2)

Publication Number Publication Date
CN108154229A CN108154229A (en) 2018-06-12
CN108154229B true CN108154229B (en) 2022-04-08

Family

ID=62461260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810022870.7A Active CN108154229B (en) 2018-01-10 2018-01-10 Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework

Country Status (1)

Country Link
CN (1) CN108154229B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086867B (en) * 2018-07-02 2021-06-08 武汉魅瞳科技有限公司 Convolutional neural network acceleration system based on FPGA
CN109214506B (en) * 2018-09-13 2022-04-15 深思考人工智能机器人科技(北京)有限公司 Convolutional neural network establishing device and method based on pixels
CN111667046A (en) * 2019-03-08 2020-09-15 富泰华工业(深圳)有限公司 Deep learning acceleration method and user terminal
CN109978161B (en) * 2019-03-08 2022-03-04 吉林大学 Universal convolution-pooling synchronous processing convolution kernel system
CN110175670B (en) * 2019-04-09 2020-12-08 华中科技大学 Method and system for realizing YOLOv2 detection network based on FPGA
CN110413539B (en) * 2019-06-19 2021-09-14 深圳云天励飞技术有限公司 Data processing method and device
CN110399883A (en) * 2019-06-28 2019-11-01 苏州浪潮智能科技有限公司 Image characteristic extracting method, device, equipment and computer readable storage medium
CN110516800B (en) * 2019-07-08 2022-03-04 山东师范大学 Deep learning network application distributed self-assembly instruction processor core, processor, circuit and processing method
CN110390392B (en) * 2019-08-01 2021-02-19 上海安路信息科技有限公司 Convolution parameter accelerating device based on FPGA and data reading and writing method
CN114365148A (en) * 2019-10-22 2022-04-15 深圳鲲云信息科技有限公司 Neural network operation system and method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102118289A (en) * 2010-12-02 2011-07-06 西北工业大学 Real-time image segmentation processing system and high-speed intelligent unified bus interface method based on Institute of Electrical and Electronic Engineers (IEEE) 1394 interface
CN102420931A (en) * 2011-07-26 2012-04-18 西安费斯达自动化工程有限公司 Full-frame-rate image processing method based on FPGA (Field Programmable Gate Array)
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
CN106355244A (en) * 2016-08-30 2017-01-25 深圳市诺比邻科技有限公司 CNN (convolutional neural network) construction method and system
CN106611216A (en) * 2016-12-29 2017-05-03 北京旷视科技有限公司 Computing method and device based on neural network
CN107103113A (en) * 2017-03-23 2017-08-29 中国科学院计算技术研究所 Towards the Automation Design method, device and the optimization method of neural network processor

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10572824B2 (en) * 2003-05-23 2020-02-25 Ip Reservoir, Llc System and method for low latency multi-functional pipeline with correlation logic and selectively activated/deactivated pipelined data processing engines
EP4235646A3 (en) * 2016-03-23 2023-09-06 Google LLC Adaptive audio enhancement for multichannel speech recognition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102118289A (en) * 2010-12-02 2011-07-06 西北工业大学 Real-time image segmentation processing system and high-speed intelligent unified bus interface method based on Institute of Electrical and Electronic Engineers (IEEE) 1394 interface
CN102420931A (en) * 2011-07-26 2012-04-18 西安费斯达自动化工程有限公司 Full-frame-rate image processing method based on FPGA (Field Programmable Gate Array)
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
CN106355244A (en) * 2016-08-30 2017-01-25 深圳市诺比邻科技有限公司 CNN (convolutional neural network) construction method and system
CN106611216A (en) * 2016-12-29 2017-05-03 北京旷视科技有限公司 Computing method and device based on neural network
CN107103113A (en) * 2017-03-23 2017-08-29 中国科学院计算技术研究所 Towards the Automation Design method, device and the optimization method of neural network processor

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Musical notes classification with neuromorphic auditory system using FPGA and a convolutional spiking network;E. Cerezuela-Escudero等;《2015 International Joint Conference on Neural Networks (IJCNN)》;20151001;1-7 *
基于FPGA的卷积神经网络并行结构研究;陆志坚;《中国优秀博硕士学位论文全文数据库(博士)信息科技辑》;20140415(第4期);I140-12 *
深度学习及其在医学图像分析中的应用研究;王媛媛等;《电视技术》;20161017;第40卷(第10期);118-126 *

Also Published As

Publication number Publication date
CN108154229A (en) 2018-06-12

Similar Documents

Publication Publication Date Title
CN108154229B (en) Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN112214726B (en) Operation accelerator
CN109409511B (en) Convolution operation data flow scheduling method for dynamic reconfigurable array
CN110210610B (en) Convolution calculation accelerator, convolution calculation method and convolution calculation device
CN111667051A (en) Neural network accelerator suitable for edge equipment and neural network acceleration calculation method
CN108229671B (en) System and method for reducing storage bandwidth requirement of external data of accelerator
CN109063825A (en) Convolutional neural networks accelerator
CN110543939B (en) Hardware acceleration realization device for convolutional neural network backward training based on FPGA
WO2022037257A1 (en) Convolution calculation engine, artificial intelligence chip, and data processing method
CN108170640B (en) Neural network operation device and operation method using same
US20220083857A1 (en) Convolutional neural network operation method and device
CN112668708B (en) Convolution operation device for improving data utilization rate
CN109146065B (en) Convolution operation method and device for two-dimensional data
CN110555516A (en) FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method
CN113792621B (en) FPGA-based target detection accelerator design method
CN112836813A (en) Reconfigurable pulsation array system for mixed precision neural network calculation
CN111768458A (en) Sparse image processing method based on convolutional neural network
CN113033794A (en) Lightweight neural network hardware accelerator based on deep separable convolution
CN111340198A (en) Neural network accelerator with highly-multiplexed data based on FPGA (field programmable Gate array)
CN116720549A (en) FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache
US20200293863A1 (en) System and method for efficient utilization of multipliers in neural-network computations
CN107783935B (en) Approximate calculation reconfigurable array based on dynamic precision configurable operation
CN116888591A (en) Matrix multiplier, matrix calculation method and related equipment
CN116167425B (en) Neural network acceleration method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant