CN117853310A - Convolutional neural network FPGA acceleration-based image processing method and system - Google Patents

Convolutional neural network FPGA acceleration-based image processing method and system Download PDF

Info

Publication number
CN117853310A
CN117853310A CN202410022558.3A CN202410022558A CN117853310A CN 117853310 A CN117853310 A CN 117853310A CN 202410022558 A CN202410022558 A CN 202410022558A CN 117853310 A CN117853310 A CN 117853310A
Authority
CN
China
Prior art keywords
data
neural network
convolutional neural
fpga
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410022558.3A
Other languages
Chinese (zh)
Inventor
殷聪姚慧
蔡晓军
蔡文浩
毕文
庄佳添
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202410022558.3A priority Critical patent/CN117853310A/en
Publication of CN117853310A publication Critical patent/CN117853310A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The present disclosure provides an image processing method and system based on convolutional neural network FPGA acceleration, which relates to the technical field of image processing, and is used for acquiring initial image data, preprocessing the initial image data and loading the initial data of the convolutional neural network FPGA; the method comprises the steps of realizing acceleration calculation of initial data on an FPGA through a convolutional neural network, and realizing on-chip storage management of images; wherein the parallel computing includes: by constructing the annular buffer area between different convolution layers, the annular buffer area corresponds to on-chip image storage of the FPGA, image data flow between the convolution layers does not pass through an external storage device, parallel calculation is carried out on the convolution layers from the direction of an output channel of the convolution neural network, the image data flow sequence of the convolution layers is changed, the acceleration process of image processing is realized, and the on-chip storage of the images is managed. The present disclosure saves on-chip storage resources.

Description

Convolutional neural network FPGA acceleration-based image processing method and system
Technical Field
The disclosure relates to the technical field of image processing, in particular to an image processing method and system based on convolutional neural network FPGA acceleration.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The convolutional neural network is used as a feedforward neural network containing convolutional calculation and is commonly used in the fields of image processing, recognition and the like. In some applications with high real-time requirements, the general purpose processors CPU (Central Processing Unit central processing unit) and GPU (Graphics Processing Unit image processor) based on von neumann architecture are limited by instruction decoding and shared memory, and cannot process the transferred image data in time.
The FPGA is used as a hardware programmable logic gate array, can directly convert the operation required to be performed into hardware, does not need instruction decoding in the middle, and has low data delay. In addition, the FPGA has a large number of on-chip memories, which may each belong to different control logic, and when reading the data stored on-chip, there is no need to perform arbitration operations of the shared memory like a general-purpose processor. This makes FPGA more suitable for acceleration of convolutional neural networks than general purpose processors in locations where real-time processing is required.
Convolutional layers are the most important part of convolutional neural networks, involving a large number of multiply-add computations and memory operations. When the traditional convolutional neural network is applied to the field of image processing, the adopted FPGA acceleration scheme is mainly focused on improving the calculation speed of a convolutional layer, and neglecting the influence of memory access operation on the whole performance. In the prior art, although ping-pong pipeline is mentioned in the design method based on the FPGA lightweight convolutional accelerator, only one convolutional layer is arranged inside, and the calculation result of the convolutional layer is stored on an external storage device DDR (Double Data Rate Synchronous Dynamic Random Access Memory double data rate synchronous dynamic random access memory), but accessing the DDR increases the delay of data flow between different convolutional layers in the whole convolutional neural network calculation process, so that the overall performance is difficult to improve.
Disclosure of Invention
In order to solve the problems, the present disclosure provides an image processing method and system based on convolutional neural network FPGA acceleration, which stores image data to be interacted between a plurality of convolutional layers through a ring buffer, modifies a data flow sequence to apply a pipeline technology between the plurality of convolutional layers, and performs parallel computation on the convolutional neural network from a direction of a data output channel to achieve acceleration.
According to some embodiments, the present disclosure employs the following technical solutions:
the image processing method based on convolutional neural network FPGA acceleration comprises the following steps:
acquiring initial image data and preprocessing the initial image data;
the convolutional neural network is subjected to parallel computation and pipeline technology application, so that image data processing is realized;
wherein the parallel computing includes: the method comprises the steps of performing parallel calculation on a convolutional layer and a pooling layer from the direction of an output channel of the convolutional neural network, respectively selecting the calculation parallelism of the convolutional neural network and the management mode of a weight storage area according to the different sizes of the convolutional neural network, and realizing the calculation acceleration of the convolutional neural network while fully utilizing FPGA resources;
the pipeline technology is applied as follows: and building annular buffer areas between different layers of the convolutional neural network for managing on-chip storage of the FPGA, so that image data between the layers flows without passing through external storage equipment, and meanwhile, the annular buffer areas can enable a data flow technology to be applied between the layers, so that the starting delay of each layer is reduced, and acceleration of the convolutional neural network is realized.
According to some embodiments, the present disclosure employs the following technical solutions:
the initial image loading module is used for acquiring initial image data and preprocessing the initial image data;
the acceleration calculation module is used for carrying out parallel calculation and pipeline technology application on the convolutional neural network so as to realize image data processing;
wherein the parallel computing includes: the method comprises the steps of performing parallel calculation on a convolutional layer and a pooling layer from the direction of an output channel of the convolutional neural network, respectively selecting the calculation parallelism of the convolutional neural network and the management mode of a weight storage area according to the different sizes of the convolutional neural network, and realizing the calculation acceleration of the convolutional neural network while fully utilizing FPGA resources;
the pipeline technology is applied as follows: and building annular buffer areas between different layers of the convolutional neural network for managing on-chip storage of the FPGA, so that image data between the layers flows without passing through external storage equipment, and meanwhile, the annular buffer areas can enable a data flow technology to be applied between the layers, so that the starting delay of each layer is reduced, and acceleration of the convolutional neural network is realized.
Compared with the prior art, the beneficial effects of the present disclosure are:
in order to fully utilize the advantages of low delay and large on-chip storage of the FPGA, the acceleration of the convolution layers is realized, the annular buffer area is established to store image data needing interaction among a plurality of convolution layers, the data flow sequence is modified to apply a pipeline technology among the convolution layers, the data flow among the convolution layers does not pass through an external storage device DDR, the delay of accessing the external storage device DDR is reduced, and on-chip storage resources are saved; meanwhile, the parallel calculation is accelerated from the direction of the data output channel, for a small convolutional neural network, under the condition of sufficient resources, all the directions of the output channels are selected to be parallel, weight data are stored on a chip, and the pipeline technology is applied among a plurality of convolutional layers by changing the data flow sequence of the convolutional layers and combining with a ring buffer zone, so that the idle waiting time of hardware is reduced;
according to different sizes of the convolutional neural networks, the calculation parallelism of the convolutional neural networks and the management mode of the weight storage area are respectively selected, when the convolutional neural networks are faced, the parallelism of the output channels is reduced, an on-chip storage with proper size is added between a convolutional calculation unit of the convolutional neural networks and the DDR of external storage equipment, and the on-chip storage is managed in a ring buffer area mode, so that acceleration is realized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.
FIG. 1 is a diagram of an initial data loading architecture of an embodiment of the present disclosure;
FIG. 2 is a diagram of a parallel acceleration computing architecture of an embodiment of the present disclosure, taking 4-channel parallel computing acceleration as an example;
FIG. 3 is a schematic diagram of a ring buffer management module according to an embodiment of the disclosure;
fig. 4 is a flowchart of an image processing method according to an embodiment of the present disclosure.
Detailed Description
The disclosure is further described below with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Example 1
An embodiment of the present disclosure provides an image processing method based on convolutional neural network FPGA acceleration, including:
acquiring initial image data and preprocessing the initial image data;
the convolutional neural network is subjected to parallel computation and pipeline technology application, so that image data processing is realized;
wherein the parallel computing includes: the method comprises the steps of performing parallel calculation on a convolutional layer and a pooling layer from the direction of an output channel of the convolutional neural network, respectively selecting the calculation parallelism of the convolutional neural network and the management mode of a weight storage area according to the different sizes of the convolutional neural network, and realizing the calculation acceleration of the convolutional neural network while fully utilizing FPGA resources;
the pipeline technology is applied as follows: and building annular buffer areas between different layers of the convolutional neural network for managing on-chip storage of the FPGA, so that image data between the layers flows without passing through external storage equipment, and meanwhile, the annular buffer areas can enable a data flow technology to be applied between the layers, so that the starting delay of each layer is reduced, and acceleration of the convolutional neural network is realized.
As an embodiment, the acceleration of the convolutional neural network from the computing and access logging hands using parallel computing, ring buffers and pipelining techniques provides a general design scheme for an internal convolutional IP core in image processing, and the overall design method involves the following:
1. firstly, loading data, namely, giving a flow of loading initial image data to an FPGA acceleration unit, and loading the initial image data according to the flow to ensure the correctness of a convolutional neural network calculation result;
specifically, initial image data is obtained, initial data FPGA of a convolutional neural network is loaded after the initial image data is preprocessed, the initial image data is arranged into data conforming to the acceleration calculation of the convolutional neural network, the initial image data is divided into two types of data which need to be subjected to parallel calculation and data which do not need to be subjected to parallel calculation, and data classification and data splicing processing are added to the initial image which does not need to be subjected to parallel calculation.
That is, if there are image 1 and image 2, image 1 represents the initial data loading mode when parallel computing is not needed, image 2 adds data classification and data stitching on the basis of image 1, describing the initial data loading mode when parallel computing is needed.
For convenience of description, the following provisions are made to construct a grid initial coordinate system, and a cube with X, Y, Z axis scales of integer numbers and 1 length, width, and height corresponding to the upper left corner represents a data point. Where the X-axis represents the number of pieces of image data, i.e., the number of channels of the convolution data. The Y-axis represents the width of each image, i.e., the width of the convolution data. The Z-axis represents the height of each image, i.e., the height of the convolution data. Based on the definition, the initial data loading process is to sequentially output corresponding data points of each image according to the sequence of width and height.
2. And secondly, performing convolution layer acceleration, and multiplying and adding the data transmitted by the annular buffer area and the corresponding point of the convolution kernel, so as to calculate a convolution output result and transmit the convolution output result to the annular buffer area at the next stage.
Specifically, the convolution layers are redesigned for the purpose of pipeline and parallel computation, and the number of convolution kernels is known to be equal to the number of output channels, and the number of channels of each convolution kernel is known to be equal to the number of input channels. Next, the following is provided, where M (M is a multiple of a) convolution kernels are provided for each convolution layer, N channels are provided for each convolution kernel, and K is a width and a height. Let m denote the number of convolution kernels to which the current belongs, n denote the number of channels to which the convolution kernels currently belong, kh denote the number of points to which the convolution kernels are high, kw denote the number of points to which the convolution kernels are wide.
The parallel computing process is as follows: when the convolution layer calculation acceleration module reads one input data, the weight data in M convolution kernels are required to be read and multiplied by the input data, under the parallel condition of a channels, the M weight data are read in M/a times, and each time the a weight data are read, the a weight data are simultaneously read from the on-chip storage and are multiplied by the input data, and the output result is spliced and transmitted once, so that parallel calculation of the a channels is realized.
And according to the different sizes of the convolutional neural networks, the calculation parallelism of the convolutional neural networks and the management mode of the weight storage area are respectively selected.
Describing in a 4-channel and behavioral example, first is a parallel computing module, in which,
after the convolution layer calculation acceleration module reads one input data, the weight data in the M convolution kernels needs to be read to multiply the input data. In the case of 4-channel parallelism, the M weight data will be read in M/4 times, 4 weight data at a time. The 4 weight data are read out from the on-chip storage at the same time, multiplied by the input data, and the output results are spliced and transmitted to the output result control module at one time, so that 4-channel parallel calculation is realized.
Multiplying one input data with all convolution check response data requires M multiplication operations, and when there is only one multiplication unit, M clock cycles are required to complete M multiplication operations.
Through 4-channel parallelism, 4 multiplication units are generated on the FPGA, 4 weight data are taken out at a time to multiply input data, or M multiplication operations are needed, but 4 multiplication operations can be executed in one clock cycle. Performing M multiplication operations requires only M/4 clock cycles. To adapt the calculation speed, it is also one clock cycle to fetch the weight data of 4 convolution kernels at a time, and it takes M/4 clock cycles to fetch the corresponding weight data of M convolution kernels. Similarly, M/4 clock cycles are also required to output M calculation results by first concatenating the calculation results and then outputting them.
The above description only provides a multiplication operation of the input data and the convolution kernel, and in the actual convolution process, all data points in the mth convolution kernel are multiplied by the input data correspondingly and then added to obtain one point with the output data channel m. Therefore, an output result control module is also needed, and the module comprises an array with a depth of M/4 and a width of 4 output data, and M/4 counters, wherein the counters correspond to the data in the array one by one. When the counter is self-increased to a convolution kernel size, the data in the corresponding array can be output to the next-stage ring buffer.
As an embodiment, the above-mentioned ring buffer is included in the ring buffer management module, and the ring buffer is used as a storage medium between the convolution layers, and receives input data from the upper convolution layers and reads the data from the ring buffer according to the actual calculation requirement of the lower convolution layers without communication between the convolution layers.
As shown in FIG. 3, the ring buffer management module is further internally divided into write control logic and read control logic. The write control logic writes the data into the ring buffer area according to the output sequence of the convolution layer. The read control logic can sequentially output logic of corresponding points of each channel according to the data width and the data height of the convolution acceleration module, and the logic can be used for combining data from the annular buffer zone and transmitting the data to the next convolution layer.
As an embodiment, the application flow of the method of the present disclosure is shown in fig. 4, and the specific steps thereof are as follows:
step 1): the initial data loading module sends configuration information through a control interface connected with the image sensor, configures a data coding mode of the image sensor into line scanning, and receives image data shot by the image sensor.
Step 2): and receiving output data of the upper layer and storing the output data into the annular buffer according to basic parameters of the upper layer and the lower layer connected by the annular buffer, and taking out the data from the annular buffer to the lower layer for calculation when the receiving amount of the output data of the upper layer is enough to support the calculation of the lower layer, so that the pipeline technology is applied between all the layers. The starting delay of each layer in the convolutional neural network is reduced, and acceleration is realized.
Step 3): and after receiving the data transmitted by the ring buffer management module, the parallel acceleration module calculates. The parallel acceleration module respectively selects the calculation parallelism of the convolutional neural network and the management mode of the weight storage area according to the different sizes of the convolutional neural network, and the convolutional neural network is parallel from the direction of the output channel.
Step 4): and (3) repeating the second step and the third step according to the structure of the currently selected convolutional neural network.
Step 5): and finally, carrying out final processing on the calculation result of the steps through the full connection layer, and outputting an image processing result.
Example 2
In one embodiment of the present disclosure, an image processing system based on convolutional neural network FPGA acceleration is provided, including:
the initial image loading module is used for acquiring initial image data, preprocessing the initial image data and then loading initial data FPGA of the convolutional neural network;
the acceleration calculation module comprises a parallel calculation module, an output result control module and a ring buffer management module, and is used for realizing parallel calculation of initial data on the FPGA through a convolutional neural network and realizing on-chip storage management of images;
wherein the parallel computing includes: by setting up annular buffer areas between different convolution layers, the annular buffer areas are stored on a chip corresponding to the FPGA, image data flow between the convolution layers does not pass through an external storage device, parallel calculation is carried out on the convolution layers from the direction of an output channel of the convolution neural network, the image data flow sequence of the convolution layers is changed, and the calculation parallelism of the convolution neural network and the management mode of a weight storage area are respectively selected according to the different sizes of the convolution neural network, so that the acceleration process of image processing is realized.
The pipeline technology is applied to the following steps: and building annular buffer areas between different layers of the convolutional neural network for managing on-chip storage of the FPGA, so that image data between the layers flows without passing through external storage equipment, and meanwhile, the annular buffer areas can enable a data flow technology to be applied between the layers, so that the starting delay of each layer is reduced, and acceleration of the convolutional neural network is realized.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the specific embodiments of the present disclosure have been described above with reference to the drawings, it should be understood that the present disclosure is not limited to the embodiments, and that various modifications and changes can be made by one skilled in the art without inventive effort on the basis of the technical solutions of the present disclosure while remaining within the scope of the present disclosure.

Claims (10)

1. The image processing method based on convolutional neural network FPGA acceleration is characterized by comprising the following steps:
acquiring initial image data and preprocessing the initial image data;
the convolutional neural network is subjected to parallel computation and pipeline technology application, so that image data processing is realized;
wherein the parallel computing includes: the method comprises the steps of performing parallel calculation on a convolutional layer and a pooling layer from the direction of an output channel of the convolutional neural network, respectively selecting the calculation parallelism of the convolutional neural network and the management mode of a weight storage area according to the different sizes of the convolutional neural network, and realizing the calculation acceleration of the convolutional neural network while fully utilizing FPGA resources;
the pipeline technology is applied as follows: and building annular buffer areas between different layers of the convolutional neural network for managing on-chip storage of the FPGA, so that image data between the layers flows without passing through external storage equipment, and meanwhile, the annular buffer areas can enable a data flow technology to be applied between the layers, so that the starting delay of each layer is reduced, and acceleration of the convolutional neural network is realized.
2. The method for processing the image based on the FPGA acceleration of the convolutional neural network according to claim 1, wherein initial image data is acquired, the initial image data is organized into data conforming to the parallel computation of the convolutional neural network, the initial image data is divided into two types of data requiring the parallel computation and data requiring no parallel computation, and the initial image requiring the parallel computation is added with data classification and data splicing processing on the basis of the initial image requiring no parallel computation.
3. The convolutional neural network FPGA acceleration-based image processing method of claim 1, wherein the initial image data FPGA loading comprises: constructing a grid initial coordinate system, wherein scales of X, Y and Z axes are integers, cubes with length, width and height of 1 represent one data point, and an X axis represents the number of pieces of image data, namely the number of channels of convolution data; the Y-axis represents the width of each image, i.e., the width of the convolution data; the Z-axis represents the height of each image, i.e., the height of the convolution data, and the initial data loading process sequentially outputs corresponding data points for each image in order of width first and height second.
4. The method for processing the image based on the FPGA acceleration of the convolutional neural network according to claim 1, wherein the annular buffer is built between different convolutional layers, the annular buffer is stored on a chip physically corresponding to the FPGA, and the data transmitted from the annular buffer and the corresponding points of the convolutional kernel are multiplied and added, so that the output result of the convolution is calculated and transmitted to the annular buffer of the next stage.
5. The convolutional neural network FPGA-based accelerated image processing method of claim 1, wherein the parallel computing process is as follows: when the convolution layer calculation acceleration module reads one input data, the weight data in M convolution kernels are required to be read and multiplied by the input data, under the parallel condition of a channels, the M weight data are read in M/a times, and each time the a weight data are read, the a weight data are simultaneously read from the on-chip storage and are multiplied by the input data, and the output result is spliced and transmitted once, so that parallel calculation of the a channels is realized.
6. The method for processing the image accelerated by the FPGA based on the convolutional neural network according to claim 5, wherein the multiplication of one input data and all convolutional check response data requires M multiplication operations, when only one multiplication calculation unit is needed, M clock cycles are needed for completing the M multiplication operations, a multiplication calculation units are generated on the FPGA through a channels in parallel, a weight data are sequentially taken out to be multiplied with the input data, M multiplication operations are needed, a clock cycle is needed for executing a multiplication operations, and M/a clock cycles are needed for executing the M multiplication operations; the weight data of a convolution kernels are taken out each time as one clock period, and M/a clock periods are needed for taking out the corresponding weight data of M convolution kernels; m/a clock cycles are also required to output M calculation results by first concatenating the calculation results and then outputting them.
7. The method for processing the image based on the FPGA acceleration of the convolutional neural network according to claim 5, wherein the parallel computing module only provides multiplication operation of the input data and the convolutional check response weight, in the actual convolution process, all data points in the mth convolution kernel are multiplied by the input data correspondingly and then added to obtain one point with M output data channels, an output result control module is constructed, the output result control module comprises an array with depth of M/a and width of a, the array can accommodate a pieces of output data, M/a counters are arranged at the same time, the counters correspond to the data in the array one by one, and when the counter is increased to the size of one convolution kernel, the data in the corresponding array is output to the next-stage annular buffer zone.
8. The method for processing the image based on the FPGA acceleration of the convolutional neural network according to claim 1, wherein the annular buffer area is a storage medium between the convolutional layers, receives input data from an upper convolutional layer under the condition that the convolutional layers are not communicated, and reads the data from the annular buffer area according to the actual calculation requirement of a lower convolutional layer.
9. The image processing method based on the FPGA acceleration of the convolutional neural network as set forth in claim 8, wherein the part in the annular buffer zone is a write control logic and a read control logic, and the write control logic is sequentially written into the annular buffer zone according to the output sequence of the convolutional layer; the read control logic sequentially outputs logic of corresponding points of each channel according to the data width and the data height, and the logic is used for organizing data from the ring buffer and transmitting the data to the next-stage convolution layer.
10. The image processing system based on convolutional neural network FPGA acceleration is characterized by comprising:
the initial image loading module is used for acquiring initial image data and preprocessing the initial image data;
the acceleration calculation module is used for carrying out parallel calculation and pipeline technology application on the convolutional neural network so as to realize image data processing;
wherein the parallel computing includes: the method comprises the steps of performing parallel calculation on a convolutional layer and a pooling layer from the direction of an output channel of the convolutional neural network, respectively selecting the calculation parallelism of the convolutional neural network and the management mode of a weight storage area according to the different sizes of the convolutional neural network, and realizing the calculation acceleration of the convolutional neural network while fully utilizing FPGA resources;
the pipeline technology is applied as follows: and building annular buffer areas between different layers of the convolutional neural network for managing on-chip storage of the FPGA, so that image data between the layers flows without passing through external storage equipment, and meanwhile, the annular buffer areas can enable a data flow technology to be applied between the layers, so that the starting delay of each layer is reduced, and acceleration of the convolutional neural network is realized.
CN202410022558.3A 2024-01-05 2024-01-05 Convolutional neural network FPGA acceleration-based image processing method and system Pending CN117853310A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410022558.3A CN117853310A (en) 2024-01-05 2024-01-05 Convolutional neural network FPGA acceleration-based image processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410022558.3A CN117853310A (en) 2024-01-05 2024-01-05 Convolutional neural network FPGA acceleration-based image processing method and system

Publications (1)

Publication Number Publication Date
CN117853310A true CN117853310A (en) 2024-04-09

Family

ID=90534146

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410022558.3A Pending CN117853310A (en) 2024-01-05 2024-01-05 Convolutional neural network FPGA acceleration-based image processing method and system

Country Status (1)

Country Link
CN (1) CN117853310A (en)

Similar Documents

Publication Publication Date Title
CN109543832B (en) Computing device and board card
CN109886400B (en) Convolution neural network hardware accelerator system based on convolution kernel splitting and calculation method thereof
CN109522052B (en) Computing device and board card
CN111897579B (en) Image data processing method, device, computer equipment and storage medium
CN104915322B (en) A kind of hardware-accelerated method of convolutional neural networks
US7447720B2 (en) Method for finding global extrema of a set of bytes distributed across an array of parallel processing elements
CN107085562B (en) Neural network processor based on efficient multiplexing data stream and design method
US11586601B2 (en) Apparatus and method for representation of a sparse matrix in a neural network
US20230026006A1 (en) Convolution computation engine, artificial intelligence chip, and data processing method
CN109284824B (en) Reconfigurable technology-based device for accelerating convolution and pooling operation
CN112799726B (en) Data processing device, method and related product
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
CN112686379B (en) Integrated circuit device, electronic apparatus, board and computing method
CN110059797B (en) Computing device and related product
CN112232517B (en) Artificial intelligence accelerates engine and artificial intelligence treater
CN110414672B (en) Convolution operation method, device and system
CN109753319B (en) Device for releasing dynamic link library and related product
CN109711540B (en) Computing device and board card
Cho et al. FARNN: FPGA-GPU hybrid acceleration platform for recurrent neural networks
CN116090518A (en) Feature map processing method and device based on systolic operation array and storage medium
CN112862079B (en) Design method of running water type convolution computing architecture and residual error network acceleration system
CN117853310A (en) Convolutional neural network FPGA acceleration-based image processing method and system
CN115293978A (en) Convolution operation circuit and method, image processing apparatus
Bai et al. An OpenCL-based FPGA accelerator with the Winograd’s minimal filtering algorithm for convolution neuron networks
CN115437602A (en) Arbitrary-precision calculation accelerator, integrated circuit device, board card and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination