CN117853310A

CN117853310A - Convolutional neural network FPGA acceleration-based image processing method and system

Info

Publication number: CN117853310A
Application number: CN202410022558.3A
Authority: CN
Inventors: 殷聪姚慧; 蔡晓军; 蔡文浩; 毕文; 庄佳添
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2024-01-05
Filing date: 2024-01-05
Publication date: 2024-04-09

Abstract

The present disclosure provides an image processing method and system based on convolutional neural network FPGA acceleration, which relates to the technical field of image processing, and is used for acquiring initial image data, preprocessing the initial image data and loading the initial data of the convolutional neural network FPGA; the method comprises the steps of realizing acceleration calculation of initial data on an FPGA through a convolutional neural network, and realizing on-chip storage management of images; wherein the parallel computing includes: by constructing the annular buffer area between different convolution layers, the annular buffer area corresponds to on-chip image storage of the FPGA, image data flow between the convolution layers does not pass through an external storage device, parallel calculation is carried out on the convolution layers from the direction of an output channel of the convolution neural network, the image data flow sequence of the convolution layers is changed, the acceleration process of image processing is realized, and the on-chip storage of the images is managed. The present disclosure saves on-chip storage resources.

Description

Convolutional neural network FPGA acceleration-based image processing method and system

Technical Field

The disclosure relates to the technical field of image processing, in particular to an image processing method and system based on convolutional neural network FPGA acceleration.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The convolutional neural network is used as a feedforward neural network containing convolutional calculation and is commonly used in the fields of image processing, recognition and the like. In some applications with high real-time requirements, the general purpose processors CPU (Central Processing Unit central processing unit) and GPU (Graphics Processing Unit image processor) based on von neumann architecture are limited by instruction decoding and shared memory, and cannot process the transferred image data in time.

The FPGA is used as a hardware programmable logic gate array, can directly convert the operation required to be performed into hardware, does not need instruction decoding in the middle, and has low data delay. In addition, the FPGA has a large number of on-chip memories, which may each belong to different control logic, and when reading the data stored on-chip, there is no need to perform arbitration operations of the shared memory like a general-purpose processor. This makes FPGA more suitable for acceleration of convolutional neural networks than general purpose processors in locations where real-time processing is required.

Convolutional layers are the most important part of convolutional neural networks, involving a large number of multiply-add computations and memory operations. When the traditional convolutional neural network is applied to the field of image processing, the adopted FPGA acceleration scheme is mainly focused on improving the calculation speed of a convolutional layer, and neglecting the influence of memory access operation on the whole performance. In the prior art, although ping-pong pipeline is mentioned in the design method based on the FPGA lightweight convolutional accelerator, only one convolutional layer is arranged inside, and the calculation result of the convolutional layer is stored on an external storage device DDR (Double Data Rate Synchronous Dynamic Random Access Memory double data rate synchronous dynamic random access memory), but accessing the DDR increases the delay of data flow between different convolutional layers in the whole convolutional neural network calculation process, so that the overall performance is difficult to improve.

Disclosure of Invention

In order to solve the problems, the present disclosure provides an image processing method and system based on convolutional neural network FPGA acceleration, which stores image data to be interacted between a plurality of convolutional layers through a ring buffer, modifies a data flow sequence to apply a pipeline technology between the plurality of convolutional layers, and performs parallel computation on the convolutional neural network from a direction of a data output channel to achieve acceleration.

According to some embodiments, the present disclosure employs the following technical solutions:

the image processing method based on convolutional neural network FPGA acceleration comprises the following steps:

acquiring initial image data and preprocessing the initial image data;

the convolutional neural network is subjected to parallel computation and pipeline technology application, so that image data processing is realized;

wherein the parallel computing includes: the method comprises the steps of performing parallel calculation on a convolutional layer and a pooling layer from the direction of an output channel of the convolutional neural network, respectively selecting the calculation parallelism of the convolutional neural network and the management mode of a weight storage area according to the different sizes of the convolutional neural network, and realizing the calculation acceleration of the convolutional neural network while fully utilizing FPGA resources;

the pipeline technology is applied as follows: and building annular buffer areas between different layers of the convolutional neural network for managing on-chip storage of the FPGA, so that image data between the layers flows without passing through external storage equipment, and meanwhile, the annular buffer areas can enable a data flow technology to be applied between the layers, so that the starting delay of each layer is reduced, and acceleration of the convolutional neural network is realized.

the initial image loading module is used for acquiring initial image data and preprocessing the initial image data;

the acceleration calculation module is used for carrying out parallel calculation and pipeline technology application on the convolutional neural network so as to realize image data processing;

Compared with the prior art, the beneficial effects of the present disclosure are:

in order to fully utilize the advantages of low delay and large on-chip storage of the FPGA, the acceleration of the convolution layers is realized, the annular buffer area is established to store image data needing interaction among a plurality of convolution layers, the data flow sequence is modified to apply a pipeline technology among the convolution layers, the data flow among the convolution layers does not pass through an external storage device DDR, the delay of accessing the external storage device DDR is reduced, and on-chip storage resources are saved; meanwhile, the parallel calculation is accelerated from the direction of the data output channel, for a small convolutional neural network, under the condition of sufficient resources, all the directions of the output channels are selected to be parallel, weight data are stored on a chip, and the pipeline technology is applied among a plurality of convolutional layers by changing the data flow sequence of the convolutional layers and combining with a ring buffer zone, so that the idle waiting time of hardware is reduced;

according to different sizes of the convolutional neural networks, the calculation parallelism of the convolutional neural networks and the management mode of the weight storage area are respectively selected, when the convolutional neural networks are faced, the parallelism of the output channels is reduced, an on-chip storage with proper size is added between a convolutional calculation unit of the convolutional neural networks and the DDR of external storage equipment, and the on-chip storage is managed in a ring buffer area mode, so that acceleration is realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.

FIG. 1 is a diagram of an initial data loading architecture of an embodiment of the present disclosure;

FIG. 2 is a diagram of a parallel acceleration computing architecture of an embodiment of the present disclosure, taking 4-channel parallel computing acceleration as an example;

FIG. 3 is a schematic diagram of a ring buffer management module according to an embodiment of the disclosure;

fig. 4 is a flowchart of an image processing method according to an embodiment of the present disclosure.

Detailed Description

The disclosure is further described below with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Example 1

An embodiment of the present disclosure provides an image processing method based on convolutional neural network FPGA acceleration, including:

acquiring initial image data and preprocessing the initial image data;

As an embodiment, the acceleration of the convolutional neural network from the computing and access logging hands using parallel computing, ring buffers and pipelining techniques provides a general design scheme for an internal convolutional IP core in image processing, and the overall design method involves the following:

1. firstly, loading data, namely, giving a flow of loading initial image data to an FPGA acceleration unit, and loading the initial image data according to the flow to ensure the correctness of a convolutional neural network calculation result;

specifically, initial image data is obtained, initial data FPGA of a convolutional neural network is loaded after the initial image data is preprocessed, the initial image data is arranged into data conforming to the acceleration calculation of the convolutional neural network, the initial image data is divided into two types of data which need to be subjected to parallel calculation and data which do not need to be subjected to parallel calculation, and data classification and data splicing processing are added to the initial image which does not need to be subjected to parallel calculation.

That is, if there are image 1 and image 2, image 1 represents the initial data loading mode when parallel computing is not needed, image 2 adds data classification and data stitching on the basis of image 1, describing the initial data loading mode when parallel computing is needed.

For convenience of description, the following provisions are made to construct a grid initial coordinate system, and a cube with X, Y, Z axis scales of integer numbers and 1 length, width, and height corresponding to the upper left corner represents a data point. Where the X-axis represents the number of pieces of image data, i.e., the number of channels of the convolution data. The Y-axis represents the width of each image, i.e., the width of the convolution data. The Z-axis represents the height of each image, i.e., the height of the convolution data. Based on the definition, the initial data loading process is to sequentially output corresponding data points of each image according to the sequence of width and height.

2. And secondly, performing convolution layer acceleration, and multiplying and adding the data transmitted by the annular buffer area and the corresponding point of the convolution kernel, so as to calculate a convolution output result and transmit the convolution output result to the annular buffer area at the next stage.

Specifically, the convolution layers are redesigned for the purpose of pipeline and parallel computation, and the number of convolution kernels is known to be equal to the number of output channels, and the number of channels of each convolution kernel is known to be equal to the number of input channels. Next, the following is provided, where M (M is a multiple of a) convolution kernels are provided for each convolution layer, N channels are provided for each convolution kernel, and K is a width and a height. Let m denote the number of convolution kernels to which the current belongs, n denote the number of channels to which the convolution kernels currently belong, kh denote the number of points to which the convolution kernels are high, kw denote the number of points to which the convolution kernels are wide.

The parallel computing process is as follows: when the convolution layer calculation acceleration module reads one input data, the weight data in M convolution kernels are required to be read and multiplied by the input data, under the parallel condition of a channels, the M weight data are read in M/a times, and each time the a weight data are read, the a weight data are simultaneously read from the on-chip storage and are multiplied by the input data, and the output result is spliced and transmitted once, so that parallel calculation of the a channels is realized.

And according to the different sizes of the convolutional neural networks, the calculation parallelism of the convolutional neural networks and the management mode of the weight storage area are respectively selected.

Describing in a 4-channel and behavioral example, first is a parallel computing module, in which,

after the convolution layer calculation acceleration module reads one input data, the weight data in the M convolution kernels needs to be read to multiply the input data. In the case of 4-channel parallelism, the M weight data will be read in M/4 times, 4 weight data at a time. The 4 weight data are read out from the on-chip storage at the same time, multiplied by the input data, and the output results are spliced and transmitted to the output result control module at one time, so that 4-channel parallel calculation is realized.

Multiplying one input data with all convolution check response data requires M multiplication operations, and when there is only one multiplication unit, M clock cycles are required to complete M multiplication operations.

Through 4-channel parallelism, 4 multiplication units are generated on the FPGA, 4 weight data are taken out at a time to multiply input data, or M multiplication operations are needed, but 4 multiplication operations can be executed in one clock cycle. Performing M multiplication operations requires only M/4 clock cycles. To adapt the calculation speed, it is also one clock cycle to fetch the weight data of 4 convolution kernels at a time, and it takes M/4 clock cycles to fetch the corresponding weight data of M convolution kernels. Similarly, M/4 clock cycles are also required to output M calculation results by first concatenating the calculation results and then outputting them.

The above description only provides a multiplication operation of the input data and the convolution kernel, and in the actual convolution process, all data points in the mth convolution kernel are multiplied by the input data correspondingly and then added to obtain one point with the output data channel m. Therefore, an output result control module is also needed, and the module comprises an array with a depth of M/4 and a width of 4 output data, and M/4 counters, wherein the counters correspond to the data in the array one by one. When the counter is self-increased to a convolution kernel size, the data in the corresponding array can be output to the next-stage ring buffer.

As an embodiment, the above-mentioned ring buffer is included in the ring buffer management module, and the ring buffer is used as a storage medium between the convolution layers, and receives input data from the upper convolution layers and reads the data from the ring buffer according to the actual calculation requirement of the lower convolution layers without communication between the convolution layers.

As shown in FIG. 3, the ring buffer management module is further internally divided into write control logic and read control logic. The write control logic writes the data into the ring buffer area according to the output sequence of the convolution layer. The read control logic can sequentially output logic of corresponding points of each channel according to the data width and the data height of the convolution acceleration module, and the logic can be used for combining data from the annular buffer zone and transmitting the data to the next convolution layer.

As an embodiment, the application flow of the method of the present disclosure is shown in fig. 4, and the specific steps thereof are as follows:

step 1): the initial data loading module sends configuration information through a control interface connected with the image sensor, configures a data coding mode of the image sensor into line scanning, and receives image data shot by the image sensor.

Step 2): and receiving output data of the upper layer and storing the output data into the annular buffer according to basic parameters of the upper layer and the lower layer connected by the annular buffer, and taking out the data from the annular buffer to the lower layer for calculation when the receiving amount of the output data of the upper layer is enough to support the calculation of the lower layer, so that the pipeline technology is applied between all the layers. The starting delay of each layer in the convolutional neural network is reduced, and acceleration is realized.

Step 3): and after receiving the data transmitted by the ring buffer management module, the parallel acceleration module calculates. The parallel acceleration module respectively selects the calculation parallelism of the convolutional neural network and the management mode of the weight storage area according to the different sizes of the convolutional neural network, and the convolutional neural network is parallel from the direction of the output channel.

Step 4): and (3) repeating the second step and the third step according to the structure of the currently selected convolutional neural network.

Step 5): and finally, carrying out final processing on the calculation result of the steps through the full connection layer, and outputting an image processing result.

Example 2

In one embodiment of the present disclosure, an image processing system based on convolutional neural network FPGA acceleration is provided, including:

the initial image loading module is used for acquiring initial image data, preprocessing the initial image data and then loading initial data FPGA of the convolutional neural network;

the acceleration calculation module comprises a parallel calculation module, an output result control module and a ring buffer management module, and is used for realizing parallel calculation of initial data on the FPGA through a convolutional neural network and realizing on-chip storage management of images;

wherein the parallel computing includes: by setting up annular buffer areas between different convolution layers, the annular buffer areas are stored on a chip corresponding to the FPGA, image data flow between the convolution layers does not pass through an external storage device, parallel calculation is carried out on the convolution layers from the direction of an output channel of the convolution neural network, the image data flow sequence of the convolution layers is changed, and the calculation parallelism of the convolution neural network and the management mode of a weight storage area are respectively selected according to the different sizes of the convolution neural network, so that the acceleration process of image processing is realized.

The pipeline technology is applied to the following steps: and building annular buffer areas between different layers of the convolutional neural network for managing on-chip storage of the FPGA, so that image data between the layers flows without passing through external storage equipment, and meanwhile, the annular buffer areas can enable a data flow technology to be applied between the layers, so that the starting delay of each layer is reduced, and acceleration of the convolutional neural network is realized.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the specific embodiments of the present disclosure have been described above with reference to the drawings, it should be understood that the present disclosure is not limited to the embodiments, and that various modifications and changes can be made by one skilled in the art without inventive effort on the basis of the technical solutions of the present disclosure while remaining within the scope of the present disclosure.

Claims

1. The image processing method based on convolutional neural network FPGA acceleration is characterized by comprising the following steps:

acquiring initial image data and preprocessing the initial image data;

2. The method for processing the image based on the FPGA acceleration of the convolutional neural network according to claim 1, wherein initial image data is acquired, the initial image data is organized into data conforming to the parallel computation of the convolutional neural network, the initial image data is divided into two types of data requiring the parallel computation and data requiring no parallel computation, and the initial image requiring the parallel computation is added with data classification and data splicing processing on the basis of the initial image requiring no parallel computation.

3. The convolutional neural network FPGA acceleration-based image processing method of claim 1, wherein the initial image data FPGA loading comprises: constructing a grid initial coordinate system, wherein scales of X, Y and Z axes are integers, cubes with length, width and height of 1 represent one data point, and an X axis represents the number of pieces of image data, namely the number of channels of convolution data; the Y-axis represents the width of each image, i.e., the width of the convolution data; the Z-axis represents the height of each image, i.e., the height of the convolution data, and the initial data loading process sequentially outputs corresponding data points for each image in order of width first and height second.

4. The method for processing the image based on the FPGA acceleration of the convolutional neural network according to claim 1, wherein the annular buffer is built between different convolutional layers, the annular buffer is stored on a chip physically corresponding to the FPGA, and the data transmitted from the annular buffer and the corresponding points of the convolutional kernel are multiplied and added, so that the output result of the convolution is calculated and transmitted to the annular buffer of the next stage.

5. The convolutional neural network FPGA-based accelerated image processing method of claim 1, wherein the parallel computing process is as follows: when the convolution layer calculation acceleration module reads one input data, the weight data in M convolution kernels are required to be read and multiplied by the input data, under the parallel condition of a channels, the M weight data are read in M/a times, and each time the a weight data are read, the a weight data are simultaneously read from the on-chip storage and are multiplied by the input data, and the output result is spliced and transmitted once, so that parallel calculation of the a channels is realized.

6. The method for processing the image accelerated by the FPGA based on the convolutional neural network according to claim 5, wherein the multiplication of one input data and all convolutional check response data requires M multiplication operations, when only one multiplication calculation unit is needed, M clock cycles are needed for completing the M multiplication operations, a multiplication calculation units are generated on the FPGA through a channels in parallel, a weight data are sequentially taken out to be multiplied with the input data, M multiplication operations are needed, a clock cycle is needed for executing a multiplication operations, and M/a clock cycles are needed for executing the M multiplication operations; the weight data of a convolution kernels are taken out each time as one clock period, and M/a clock periods are needed for taking out the corresponding weight data of M convolution kernels; m/a clock cycles are also required to output M calculation results by first concatenating the calculation results and then outputting them.

7. The method for processing the image based on the FPGA acceleration of the convolutional neural network according to claim 5, wherein the parallel computing module only provides multiplication operation of the input data and the convolutional check response weight, in the actual convolution process, all data points in the mth convolution kernel are multiplied by the input data correspondingly and then added to obtain one point with M output data channels, an output result control module is constructed, the output result control module comprises an array with depth of M/a and width of a, the array can accommodate a pieces of output data, M/a counters are arranged at the same time, the counters correspond to the data in the array one by one, and when the counter is increased to the size of one convolution kernel, the data in the corresponding array is output to the next-stage annular buffer zone.

8. The method for processing the image based on the FPGA acceleration of the convolutional neural network according to claim 1, wherein the annular buffer area is a storage medium between the convolutional layers, receives input data from an upper convolutional layer under the condition that the convolutional layers are not communicated, and reads the data from the annular buffer area according to the actual calculation requirement of a lower convolutional layer.

9. The image processing method based on the FPGA acceleration of the convolutional neural network as set forth in claim 8, wherein the part in the annular buffer zone is a write control logic and a read control logic, and the write control logic is sequentially written into the annular buffer zone according to the output sequence of the convolutional layer; the read control logic sequentially outputs logic of corresponding points of each channel according to the data width and the data height, and the logic is used for organizing data from the ring buffer and transmitting the data to the next-stage convolution layer.

10. The image processing system based on convolutional neural network FPGA acceleration is characterized by comprising: