CN114154621A

CN114154621A - Convolutional neural network image processing method and device based on FPGA

Info

Publication number: CN114154621A
Application number: CN202111449434.6A
Authority: CN
Inventors: 李世星; 安向京; 渠军; 孟德远
Original assignee: Changsha Xingshen Intelligent Technology Co Ltd
Current assignee: Changsha Xingshen Intelligent Technology Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-08

Abstract

The invention discloses a convolution neural network image processing method and a device based on FPGA, the method comprises the following steps: acquiring a convolutional layer configuration file, and initializing a convolutional neural network; the method comprises the steps that an FPGA acquires a target image to be detected, inputs the target image to the initialized convolutional neural network, performs convolutional operation on an input characteristic diagram by using a pre-constructed convolutional logic check to obtain an output characteristic diagram, the convolutional logic check is obtained by decomposing a CNN convolutional layer into a plurality of logic operations and addition operations, and the input characteristic diagram and the output characteristic diagram are stored in two groups of specified storage spaces in a slice respectively in the process of convolutional calculation. The invention has the advantages of simple realization method, high processing efficiency and precision, capability of realizing high frame rate and low delay and the like.

Description

Convolutional neural network image processing method and device based on FPGA

Technical Field

The invention relates to the technical field of image processing, in particular to a convolutional neural network image processing method and device based on an FPGA (field programmable gate array).

Background

CNN (Convolutional Neural Networks) is a kind of feed forward Neural network (fed forward Neural Networks) containing convolution calculation and having a deep structure, and is one of the representative algorithms of deep learning (deep learning). On the hardware level, the current CNN deployment is mainly realized based on CPU, GPU, ASIC and FPGA, and each type of deployment has its own advantages and disadvantages. Compared with a GPU (graphics processing unit), although the FPGA has poor flexibility and mobility and high development difficulty, the FPGA has the advantages of low power consumption and high speed, and meanwhile has a certain cost advantage, so that the FPGA is suitable for embedded terminal deployment on the premise of a certain batch; compared with an ASIC (application specific integrated circuit), the FPGA has larger design space with the increased gate resources and memory bandwidth, and simultaneously, the FPGA also saves the tape-out process required in an ASIC scheme, so that the development period is short and the development cost is low.

When using FPGA to perform CNN deployment, the CNN network needs to be decomposed into a structure suitable for FPGA implementation, the computation unit of FPGA is divided into DSP, multiplier-adder, LUT (logic lookup table), and each operation of CNN needs to be performed according to the computation unit 1 of FPGA: and 1, mapping to a corresponding operation logic, reusing on-chip resources at the FPGA end, and integrating units required by data carrying operation and data calculation operation to form a hardware operation layer.

The FPGA is a field programmable gate array and has the characteristics of high parallelism and low power consumption. When image processing is realized by using CNN on FPGA, at present, convolution operation is usually performed by using 32-bit floating point number or 8-bit fixed point number, and the input feature map and the output feature map are both stored in an external memory (e.g. DDR), which may have the following problems:

1. under the condition that the system performance is basically consistent, floating point convolution or fixed point convolution needs to rely on a large number of multiplications and additions, so that a large number of calculations exist for realizing CNN network quantization in the FPGA, the powerful logical operation capability of the FPGA cannot be fully utilized, the problems of large calculation amount and low CNN realization efficiency are caused, and the actual processing efficiency is not high.

2. The current convolutional neural network has many layers, for example, the classic neural network structure VGG16 has 16 convolutional layers and full connection layers, when the input feature map and the output feature map are stored in an external memory in an FPGA, for VGG16, 16 times of reading the feature map from the external memory is required to calculate one frame of image, so that each time of reading the input feature map takes a lot of time, and the processing efficiency is also reduced.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides the convolution neural network image processing method and device based on the FPGA, which have the advantages of simple implementation method, high processing and precision efficiency and capability of realizing high frame rate and low delay.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a convolution neural network image processing method based on FPGA includes the following steps:

s1, acquiring a convolutional layer configuration file and weight parameters, and initializing a convolutional neural network;

and S2, the FPGA acquires a target image to be detected and inputs the target image to the initialized convolutional neural network, convolution operation is performed on the input feature map by using a pre-constructed convolution logic core to obtain an output feature map, the convolution logic core is obtained by decomposing the CNN convolution layer into a plurality of logic operations and addition operation, and the input feature map and the output feature map are stored in two groups of specified storage spaces in the chip respectively in the convolution calculation process.

Further, in step S2, in the convolution calculation process, when calculating a layer of convolution, reading the input feature map from the first group of storage spaces in the chip, writing the convolution result into the second group of storage spaces in the chip, when calculating the next layer of convolution, reading data from the second group of storage spaces in the chip to obtain the current input feature map, and writing the convolution result into the first group of storage spaces in the chip.

Further, in step S2, the convolution calculation further includes a step of dividing the input feature map into a plurality of groups according to the size of the storage space in the slice.

Further, in step S2, after the input feature map is input to the initialized convolutional neural network, performing convolutional calculation, activation function, and pooling calculation in sequence to obtain a prediction result; and post-processing the prediction result to obtain a target identification result, transmitting the identification result through a soft-core processor, and executing the steps of inputting the input characteristic diagram, calculating the convolutional neural network and post-processing in parallel.

Further, in step S2, the logical operation and the addition operation are respectively implemented in the FPGA by using a plurality of LUTs, so as to implement the N × N convolution logic kernel, where the N × N convolution logic kernel is obtained by splitting the quantization activation values in the CNN convolution layer according to a preset quantization bit number, and splitting the split values into a combination of a plurality of exclusive nor operations and a plurality of addition operations.

Further, in step S1, the training server is used to train the convolutional layer configuration file and the weight parameter in the convolutional neural network, and transmit the convolutional layer configuration file and the weight parameter to the soft core processor, and store the convolutional layer configuration file and the weight parameter in the DDR, the FPGA acquires the convolutional layer configuration file and the weight parameter required for initialization from the DDR, and the identification result output by the FPGA is transmitted to the soft core processor through the DDR.

The convolution neural network image processing device based on the FPGA comprises the FPGA, wherein the FPGA is provided with:

the input module is used for acquiring a convolutional layer configuration file and weight parameters to initialize the convolutional neural network and acquiring a target image to be detected;

the CNN calculation module is used for inputting a target image to be detected to the initialized convolutional neural network, and realizing N x N convolution operation on the input characteristic graph by using a pre-constructed N x N convolution logic kernel to obtain an output characteristic graph, wherein the N x N convolution logic kernel is obtained by decomposing a CNN convolution layer into a plurality of logic operations and converting the logic operations and the addition operations;

and the two groups of on-chip storage spaces are used for respectively storing the input characteristic diagram and the output characteristic diagram in the convolution calculation process.

The DDR is connected with the FPGA at the other end and used for training convolutional layer configuration files and weight parameters in a convolutional neural network, the training server is used for transmitting the convolutional layer configuration files and the weight parameters to the soft core processor and storing the weight parameters in the DDR, and output results of the CNN calculation module are transmitted to the soft core processor through the DDR.

Further, the CNN computation module includes a quantization convolution kernel unit for implementing the N × N convolution logic kernel, the quantization convolution kernel unit including:

a first LUT unit including a plurality of LUTs for calculating a logical operation in converting the CCN convolutional layer into an N × N convolutional logical kernel, the N × N convolutional logical kernel being obtained by decomposing the CNN convolutional layer into a combined conversion of the logical operation and an addition operation;

a second LUT unit comprising a plurality of LUTs for computing an addition operation in converting the CNN convolutional layers to N x N convolutional logic kernels;

and the adder unit is used for summing the data output by the first LUT unit and the second LUT unit to obtain a final result.

Further, the first LUT unit includes 9 LUTs 62, the LUT62 is composed of 2 LUTs 6, the LUT6 is a minimum programmable unit of the FPGA, the second LUT unit includes 1 LUT64, the LUT64 is composed of 4 LUTs 6, an output end of the first LUT unit is further provided with a bit splicing circuit, an output end of the second LUT unit is further provided with a last bit0 complementing circuit, the bit splicing circuit combines bit data of each LUT62 together, and the last bit0 complementing circuit splices 0 data to the end of the data.

Compared with the prior art, the invention has the advantages that:

1. according to the invention, the CNN convolution layer is decomposed into the logic operation and the addition operation, and is converted into the convolution logic core only comprising the logic operation and the addition operation, the convolution logic core is used for the input characteristic diagram in the FPGA to realize the convolution operation, so that the powerful logic operation capability of the FPGA can be fully exerted, and meanwhile, the input characteristic diagram and the output characteristic diagram are respectively stored in two groups of storage spaces in the chip in the convolution calculation process, so that the times of reading the input characteristic diagram outwards can be greatly reduced, the processing efficiency is effectively improved, and the image processing with high frame rate and low delay is realized.

2. The invention further enables the storage of the input characteristic diagram and the output characteristic diagram by the on-chip storage space to be supported by a mode of partitioning the input characteristic diagram, is suitable for the processing of the input characteristic diagrams with various sizes and channel numbers, and can be convenient for realizing the deployment of the complex convolutional neural network on the FPGA on the premise of ensuring the recognition accuracy rate of the deep learning neural network.

3. The invention further adopts INT3 fixed-point of the input and output characteristic diagrams and INT1 fixed-point of the weight, thereby effectively reducing the use of FPGA resources, completing more convolution calculations in less resources and deploying more convolution kernels under the condition of limited resources, thereby effectively improving the frame rate and enhancing the smoothness and the real-time property of processing.

Drawings

Fig. 1 is a schematic flow chart of an implementation of the convolutional neural network image processing method based on the FPGA of the present embodiment.

Fig. 2 is a schematic diagram of the structural principle of the present embodiment for implementing convolutional neural network image processing based on an FPGA.

Fig. 3 is a schematic flow chart of the target detection based on CNN in this embodiment.

Fig. 4 is a schematic structural diagram of an FPGA implementing convolution calculation in an embodiment (3 × 3 convolution) of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.

The network model quantization of the CNN mainly comprises two parts, namely Weight quantization and Activation value quantization, wherein the performance of 32 bits can be equivalent when the Weight and the Activation value are quantized to 8 bits. The basic operation in the neural network is convolution and multiplication-addition operation of the weight and the activation value, if one item is quantized to-1, the multiplication-addition operation can be simplified into addition and subtraction operation, if both items are quantized to-1, the multiplication-addition operation is simplified into bitwise operation, and the quantization process can be conveniently realized by hardware through the reduction of the quantization of CNN into the addition, subtraction and bitwise operation; meanwhile, in the convolution process, if the input characteristic diagram and the output characteristic diagram are directly stored by using an on-chip memory, the input characteristic diagram can be directly read in the convolution process without frequent reading operation, so that a large amount of reading time can be reduced.

Based on the consideration, the CNN convolution layer is decomposed into logic operation and addition operation and is converted into convolution logic core only comprising the logic operation and the addition operation, the convolution logic core is used for input feature maps in the FPGA to realize the convolution operation, the powerful logic operation capability of the FPGA can be fully exerted because only the logic operation and the addition operation are needed, meanwhile, the input feature maps and the output feature maps are respectively stored in two groups of storage spaces in the chip in the convolution calculation process, the times of reading the input feature maps outwards can be greatly reduced, the processing efficiency is effectively improved, and the image processing with high frame rate and low delay is realized.

As shown in fig. 1, the method for processing the convolutional neural network image based on the FPGA of the present embodiment includes the steps of:

s1, the FPGA acquires a convolutional layer configuration file and weight parameters, and initializes a convolutional neural network;

As shown in fig. 2, in step S1 of this embodiment, a training server is specifically used to train a convolutional layer configuration file and weight parameters in a convolutional neural network, transmit the convolutional layer configuration file and the weight parameters to a soft-core processor, and store the convolutional layer configuration file and the weight parameters in a DDR, where an FPGA acquires the convolutional layer configuration file and the weight parameters required for initialization from the DDR; and the soft-core processor updates the configuration file and the weight parameter of the convolutional layer on line, when the new configuration file and the weight parameter of the convolutional layer are obtained through training, the FPGA acquires the new configuration file and the weight parameter of the convolutional layer from the DDR to initialize, and if the new configuration file and the weight parameter of the convolutional layer do not exist, the historical configuration file and the historical weight parameter of the convolutional layer stored in the DDR are used. And the recognition result output by the FPGA is transmitted to the soft-core processor through the DDR, and the recognition result is sent to peripheral equipment by the soft-core processor so as to perform further decision processing and the like. The soft core processor may specifically adopt a CPU or an FPGA soft core. The CNN model and the training parameters can support remote updating through a soft core processor.

In step S2, in the convolution calculation process, when calculating a layer of convolution, the input feature map is read from the first group of storage space in the chip, the convolution result is written into the second group of storage space in the chip, when calculating the next layer of convolution, data is read from the second group of storage space in the chip to obtain the current input feature map, and the convolution result is written into the first group of storage space in the chip.

The storage space in the chip can specifically adopt URAM or BLOCK RAM, namely two groups of URAM/BLOCK RAMs are divided in the chip in advance, an input characteristic diagram and an output characteristic diagram are stored in the two groups of URAM/BLOCK RAMs in the chip, when the first layer of convolution is calculated, the first group of URAM/BLOCK RAMs are read to obtain an input characteristic diagram, the convolution result is written into the second group of URAM/BLOCK RAMs, when the second layer of convolution is calculated, the second group of URAM/BLOCK RAMs are read to obtain an input characteristic diagram, the convolution result is written into the first group of URAM/BLOCK RAMs, and the polling is carried out in sequence. It can be understood that, the division and the type of the on-chip storage space can be selected according to actual requirements besides the above modes.

The resources of the internal storage space (such as URAM/BLOCK RAM) in the FPGA are limited, and the size of the input characteristic diagram may be large or the number of channels is large, so that it is difficult to directly store the input characteristic diagram in the memory space (URAM/BLOCK RAM) of the FPGA. In step S2, the convolution calculation further includes a step of dividing the input feature map into a plurality of groups according to the size of the storage space in the chip, that is, the input feature map is divided into N equal parts of 32 × 512 × 1024 according to the size of the URAM/BLOCK RAM, for example, when the size of the input feature map is 32 × 512 × 1024, when N is 4, the input feature map is 32 × 128 × 1024. The input feature map is partitioned, so that the input feature map and the output feature map can be stored in an on-chip storage space, and the method is suitable for processing the input feature maps with various sizes and channel numbers.

In step S2, after the input feature map is input to the initialized convolutional neural network, performing convolutional calculation, activation function, and pooling calculation in sequence to obtain a prediction result; and post-processing the prediction result to obtain a target identification result, and transmitting the identification result through the soft-core processor. As shown in fig. 2 and 3, when the embodiment is applied to image target recognition, after preprocessing such as obtaining an image to be detected and performing image scaling, feature extraction, bounding box regression and object class prediction are completed by using a convolutional neural network, features are extracted by a trunk CNN network, so that position and class prediction data of a bounding box are obtained, calculation is performed on a convolutional layer, an active layer and a pooling layer in the convolutional neural network in sequence, results of all channels are combined by a full-connection layer to obtain a final feature extraction result, prediction is performed based on the feature extraction result, and an input feature map and an output feature map in the calculation process are respectively stored in two sets of URAM/BLOCK RAMs in a chip; and then, post-processing the prediction information obtained by the convolutional neural network to obtain a target recognition result, transmitting the target recognition result to the CPU/FPGA soft-core processor through the DDR, further sending the target recognition result to peripheral equipment for further decision making, and further configuring the peripheral equipment to display the detection result on an original image.

The convolutional neural network has a feedforward type hierarchical structure, data are transmitted layer by layer from front to back among network layers, so that data dependence exists among different network layers, exploitable parallelism is small, and standard convolution operation in the same convolutional layer is multidimensional operation and can be summarized into three layers among input characteristic graphs, in the input characteristic graphs and in a convolutional window. In step S2 of the present embodiment, the input of the input feature map, the calculation of the convolutional neural network, and the post-processing are performed in parallel, whereby the calculation speed can be further increased and the processing efficiency can be ensured.

As shown in fig. 2 and 3, in order to implement the method, in the embodiment of the present invention, a set of devices is constructed by a training server, a CPU/FPGA soft core, a DDR, an FPGA, a camera, and a peripheral device, and the training server is used as a server for deep learning training and is mainly used for parameter adjustment, pruning, floating point fixed point conversion, and identification accuracy evaluation; the CPU/FPGA soft core loads convolution parameters and convolution layer configuration files, the FPGA receives and calculates video streams from the camera, final CNN results are written into the DDR, and the CPU/FPGA soft core provides final identification results for peripheral equipment. The CPU/FPGA soft core can be an arm processor, a microblaze processor and the like, and is mainly used for updating a new CNN convolutional layer configuration file and weight parameters which are trained and completed by the training server and sending the recognition result of the CNN to peripheral equipment. The peripheral equipment is decision-making equipment and the like, for example, in license plate entrance guard identification, whether the vehicle is a vehicle in the park is judged according to an identification result so as to decide the action of entrance guard. And D, realizing the interaction between the CPU/FPGA soft core and the FPGA, and storing the configuration file of the convolutional layer, the weight parameter and the identification result. The FPGA mainly has the functions of receiving camera video stream, scaling the video stream, performing convolution calculation, activating a function, pooling and post-processing. The camera is used for collecting the images in real time and is used as the input of the whole system.

In a specific application embodiment, the detailed steps of performing image processing by using the above device and the above method of the present invention are as follows:

step 1: system initialization

After the system is powered on, the CPU/FPGA soft core is communicated with a training server, if the training server has newly trained weight parameters or a convolutional layer configuration file, the new weight and convolutional layer configuration file are transmitted to the CPU/FPGA soft core through a network port and written into a DDR; and if the training server does not have the newly trained weight parameters or the convolutional layer configuration file, writing the stored historical weight parameters and the convolutional layer configuration file into the DDR.

Step 2: CNN calculation

After writing the configuration file of the convolutional layer into the BRAM/URAM, the FPGA receives a video stream from a camera, and writes the video stream into the BRAM/URAM after executing graph scaling in a pipeline mode; writing the obtained result into BRAM/URAM after starting to execute convolution calculation, activation function and pooling calculation; and performing post-processing calculation, and writing the calculation result into the DDR. The steps are processed in parallel to improve the computing speed.

And step 3: result transmission

And reading the post-processing result from the DDR by the CPU/FPGA soft core, and transmitting the final classification result to peripheral equipment through a network cable.

In step S2 of this embodiment, the FPGA uses a plurality of LUTs to respectively implement logical operations and addition operations to implement an N × N convolution logic core, where the N × N convolution logic core is obtained by splitting the quantization activation values in the CNN convolution layer according to a preset quantization bit number, and splitting the split values into a combination of a plurality of exclusive nor operations and a plurality of addition operations.

Taking N as 3 and the quantization bit number as 3 as an example, the process of converting the quantization convolution formula into a 3 × 3 convolution logic kernel is as follows:

in the above equation, w is a quantized convolution parameter having a value of [ -1,1 [ ]]And c is the number of input channels; w' is rotationTransform and quantize the convolution parameter to a value of [0, 1]And the value of w is mapped with each other; x is a 3-bit quantized activation value having a value of [0, 1, …,6,7]，x²For quantizing the 2 nd bit of the activation value, its value is [0, 1 ]]；x¹For quantizing the 1 st bit of the activation value, its value is [0, 1 ]]；x⁰For quantizing the 0 th bit of the activation value, its value is [0, 1 ]]。

In the above equation, the activation value x is first quantized_ijSplitting the representation bit by bit, i.e.

Due to w_i,jHas a value of [ -1,1]，

Has a value of [0, 1]After multiplication, there will be a 3-value case, which cannot be represented by 1bit, and therefore will be

Is transformed into

The result of the multiplication after transformation is [ -1,1]Still a value of 2; due to the fact that

Has a value of [ -1,1]1bit cannot be expressed, so that the expression is given

Will be provided with

Performing an equal replacement, wherein | _ is an equal or sign, then converting to obtain a 3 × 3 convolution logic core of

Due to w_i,jAnd c are both known numbers, and are,

i.e. beta_cIs a constant term. The above-mentioned 3 x3 convolution logic kernel is based onThe 3-bit quantization bit is obtained by decomposition, each part only needs one bit to represent, and the 3 x3 convolution kernel logic of the 1bW3bA can be realized through an LTU (logic lookup table) and an adder in the FPGA.

To implement the above-described 3 × 3 convolution logic kernel f_3×3(w′_i,x_i) In particular, multiple LUT calculations may be used

Wherein

And

are

bits

2, 1, 0, w 'of quantized activation value, respectively'₁、w′₂、w′₃The transformed quantized convolution parameters for the corresponding bits are respectively summed using one LTU for three two-bit additions and using an adder for summing the outputs of the branches.

The LUT6 is the FPGA minimum programmable unit, and any 6-input boolean expression, the result of which is one bit of data. In a specific application embodiment, the 3-by-3 convolution process for implementing 3-bit feature diagram 1-bit weight by using LUT6 is shown in fig. 4, and the above-mentioned 3 × 3 convolution logic kernel f is implemented by using 9 LUTs 62 (composed of 2 LUTs 6), 1 LUT64 (composed of 4 LUTs 6), several bit operation circuits and adders_3×3(w′_i,x_i) The calculation is carried out by LUT62

Function, namely the logical operation of each bit of the input characteristic diagram and the weight parameter, the LUT64 realizes three two-digit addition functions, wherein, a1 to a9 are the lowest bits of the input characteristic diagram of 9 3 bits, b1 to b9 are the middle bits of the input characteristic diagram of 9 3 bits, and c1 to c9 are the inputs of 9 3 bitsIn the most significant bits of the feature map, w1 to w9 are the weights of the input 9 1 bits. The LUT62 is 2 LUTs 6, the two LUTs have the same input and only different truth tables, the LUT64 is 4 LUTs 6, and a bit splicing circuit is arranged at the output end of each LUT62 and is used for combining the bit data together, and the LUT resources are not occupied; and the last bit complement 0 circuit is output at the output end of the LUT64 and is used for splicing the 1bit0 data to the end of the data without occupying FPGA resources, and finally, four numbers of X1, X2, X3 and X4 are added to obtain a result of convolution by 3.

In this embodiment, specifically, INT3 is used for convolution calculation in the main network, INT16 is used for calculation for the first-layer convolution and the classification regression convolution, INT3 represents eight values of 0,1,2,3,4,5,6 and 7, and there is no decimal part, that is, 3-by-3 convolution with 1-bit weight is performed on the 3-bit feature diagram, so that the characteristics of the FPGA can be fully utilized, only 37 LUTs need to be used, DSP resources do not need to be occupied, and resource usage can be effectively reduced while precision is ensured. If the signatures are a1 to a9 of the conventional INT8 and W1 to W9 of INT8, the realization of 3 × 3 convolution is a1 × W1+ a2 × W2+. a9 × W9, which requires the use of 9 DSPs and 65 LUTs 6. In the embodiment, by means of INT3 fixed-point processing of the input and output characteristic diagrams and INT1 fixed-point processing of the weights, FPGA resource use can be effectively reduced, more convolution calculations can be completed in fewer resources, more convolution kernels can be deployed under the condition of limited resources, the frame rate can be effectively improved, and the smoothness and the real-time performance of processing can be enhanced.

In addition to the quantization of 3 bits, it is needless to say that other quantization bits (quantization bits 2 or more) may be used, that is, the active value of one n bit is expressed as:

xⁱrepresents the ith bit of x; and then carrying out the same derivation according to the quantization convolution formula, and obtaining different FPGA implementation modes when different quantization bits are obtained. If the quantization bit number n is 4, a 3 × 3 convolution logical kernel f can be obtained_3×3(w′_i,x_i) Is composed of

In addition to constructing a 3 × 3 convolution logic core as described above, other N × N convolutions can of course be constructed, the principle being the same as described above.

In this embodiment, the CNN convolution layer is decomposed into a plurality of logical operations and a plurality of addition operations according to the quantization bit N, so as to construct an nxn convolution logic core, and each operation part in the convolution logic core can be represented by using 1bit, so that the nxn convolution logic core can be efficiently implemented by using the LTU and the adder in the FPGA. By decomposing the CNN network into a structure suitable for FPGA realization, the calculated amount, the memory capacity and the read-write bandwidth required by the CNN can be greatly reduced.

The embodiment is specifically realized by adopting verilog language, and compared with the traditional Open CL realization adopting xilinx HLS or Altrea, the controllability can be effectively improved, and meanwhile, the resource utilization rate is improved.

The convolutional neural network image processing device based on the FPGA comprises the FPGA, wherein the FPGA is provided with:

the CNN calculation module is used for inputting a target image to be detected to the initialized convolutional neural network, performing convolutional operation on the input characteristic diagram by using a pre-constructed convolutional logic core to obtain an output characteristic diagram, and the convolutional logic core is obtained by decomposing the CNN convolutional layer into a plurality of logic operations and addition operation;

In a specific application embodiment, as shown in fig. 2, the convolutional neural network image processing device based on the FPGA includes two sets of on-chip storage spaces, i.e., two sets of BRAM/URAMs in the FPGA, an input module includes a preprocessing unit such as image scaling, a CNN calculation module includes a convolution, activation function, pooling and full-connection layer calculation unit, and the two sets of BRAM/URAMs store an input feature map and an output feature map.

As shown in fig. 2, the convolutional neural network image processing apparatus of this embodiment further includes a training server, a soft-core processor, and a DDR connected in sequence, where the other end of the DDR is connected to the FPGA, the training server is configured to train convolutional layer configuration files and weight parameters in the convolutional neural network, transmit the convolutional layer configuration files and the weight parameters to the soft-core processor, and store the convolutional layer configuration files and the weight parameters in the DDR, and an output result of the CNN calculation module is transmitted to the soft-core processor through the DDR.

In this embodiment, the CNN calculation module includes a quantization convolution kernel unit for implementing N × N convolution logic kernels, and the quantization convolution kernel unit includes:

and the adder unit is used for summing the branches to obtain a final result.

In a specific application embodiment, the quantization convolution kernel unit is shown in fig. 4, the first LUT unit includes 9 LUTs 62, the LUT62 is composed of 2 LUTs 6, the LUT6 is an FPGA minimum programmable unit, the second LUT unit includes 1 LUT64, the LUT64 is composed of 4 LUTs 6, the output end of the first LUT unit is further provided with a bit concatenation circuit, the output end of the second LUT unit is further provided with a last bit0 complement circuit, the bit data of each LUT62 are combined together by the bit concatenation circuit, and the 0 data are spliced to the end of the data by the last bit0 complement circuit, namely, the 3 × 3 convolution logic kernel f is implemented by using 9 LUTs 62, 1 LUT64, several bit operation circuits and an adder_3×3(w′_i,x_i) Each LUT62 calculates logical operations between each bit (a1 to a9, b1 to b9, and c1 to c9) of the input feature map and the weight parameters (w1 to w9), each bit data is combined by a bit splicing circuit, the last bit complement 0 splices the 1bit0 data to the end of the data, and finally four numbers of X1, X2, X3, and X4 are added to obtain the result of convolution by 3.

The convolutional neural network image processing apparatus of the present embodiment has the same principle as the convolutional neural network image processing method of the above embodiment, and is not described in detail herein.

The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims

1. A convolution neural network image processing method based on FPGA is characterized by comprising the following steps:

2. The method according to claim 1, wherein in step S2, in the convolution calculation process, when calculating one layer of convolution, the input feature map is read from a first group of storage spaces in the chip, the convolution result is written into a second group of storage spaces in the chip, when calculating the next layer of convolution, the data is read from the second group of storage spaces in the chip to obtain the current input feature map, and the convolution result is written into the first group of storage spaces in the chip.

3. The method for processing the image of the convolutional neural network based on FPGA of claim 1, wherein in step S2, the convolution calculation further includes a step of dividing the input feature map into a plurality of groups according to the size of the storage space in the chip.

4. The method for processing the convolutional neural network image based on FPGA of claim 1, wherein in step S2, after inputting the input feature map into the initialized convolutional neural network, sequentially performing convolution calculation, activation function and pooling calculation to obtain a prediction result; and post-processing the prediction result to obtain a target identification result, transmitting the identification result through a soft-core processor, and executing the steps of inputting the input characteristic diagram, calculating the convolutional neural network and post-processing in parallel.

5. The method according to any one of claims 1 to 4, wherein in step S2, the logical operation and the addition operation are respectively implemented in the FPGA by using a plurality of LUTs to implement the Nx N convolution logic kernel, and the Nx N convolution logic kernel is obtained by splitting the quantization activation values in the CNN convolution layer according to a preset quantization bit number, and splitting the split values into a combination of a plurality of exclusive OR operations and a plurality of addition operations.

6. The FPGA-based convolutional neural network image processing method of any one of claims 1 to 4, wherein in step S1, a training server is used to train convolutional layer configuration files and weight parameters in the convolutional neural network, transmit the convolutional layer configuration files and the weight parameters to a soft-core processor, and store the convolutional layer configuration files and the weight parameters in the DDR, the FPGA acquires the convolutional layer configuration files and the weight parameters required for initialization from the DDR, and the recognition result output by the FPGA is transmitted to the soft-core processor through the DDR.

7. The convolution neural network image processing device based on the FPGA comprises the FPGA and is characterized in that:

8. The convolutional neural network image processing device based on the FPGA of claim 7, further comprising a training server, a soft core processor and a DDR connected in sequence, wherein the other end of the DDR is connected to the FPGA, the training server is configured to train convolutional layer configuration files and weight parameters in the convolutional neural network, transmit the convolutional layer configuration files and the weight parameters to the soft core processor, and store the convolutional layer configuration files and the weight parameters in the DDR, and an output result of the CNN calculation module is transmitted to the soft core processor through the DDR.

9. The FPGA-based convolutional neural network image processing device of claim 7 or 8, wherein said CNN calculation module comprises a quantization convolution kernel unit for implementing said N x N convolutional logic kernel, said quantization convolution kernel unit comprising:

10. The FPGA-based convolutional neural network image processing device of claim 9, wherein said first LUT unit comprises 9 LUTs 62, said LUT62 is composed of 2 LUTs 6, said LUT6 is a minimum programmable unit of FPGA, said second LUT unit comprises 1 LUT64, said LUT64 is composed of 4 LUTs 6, an output terminal of said first LUT unit is further provided with a bit concatenation circuit, an output terminal of said second LUT unit is further provided with a last bit complement 0 circuit, bit data of each LUT62 are combined together by said bit concatenation circuit, and 0 data are spliced to an end of data by said last bit complement 0 circuit.