CN112801285B - FPGA-based high-resource-utilization CNN accelerator and acceleration method thereof - Google Patents

FPGA-based high-resource-utilization CNN accelerator and acceleration method thereof Download PDF

Info

Publication number
CN112801285B
CN112801285B CN202110157101.XA CN202110157101A CN112801285B CN 112801285 B CN112801285 B CN 112801285B CN 202110157101 A CN202110157101 A CN 202110157101A CN 112801285 B CN112801285 B CN 112801285B
Authority
CN
China
Prior art keywords
data
layer
calculation
fpga
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110157101.XA
Other languages
Chinese (zh)
Other versions
CN112801285A (en
Inventor
陈雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Weihao Technology Co ltd
Original Assignee
Nanjing Weihao Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Weihao Technology Co ltd filed Critical Nanjing Weihao Technology Co ltd
Priority to CN202110157101.XA priority Critical patent/CN112801285B/en
Publication of CN112801285A publication Critical patent/CN112801285A/en
Application granted granted Critical
Publication of CN112801285B publication Critical patent/CN112801285B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a high-resource utilization CNN accelerator based on an FPGA and an acceleration method thereof, comprising a plurality of layer processors connected end to end, wherein the layer processors finish the calculation of batch continuous tasks in a pipeline mode; the layer processor comprises a convolution calculation unit, an intra-layer data multiplexing unit, an inter-layer data multiplexing unit and an optimizing unit based on resource utilization rate, which are electrically connected with each other. The CNN accelerator uses a convolution calculation unit based on Winograd fast convolution to reduce the multiplication times required by continuous convolution operation and the use amount of multiplier resources, so that the energy efficiency of the accelerator can be effectively improved; and the accelerator is optimized by using an optimization target based on the resource utilization rate, so that the waste of computing resources is reduced, and the upper performance limit of the CNN accelerator on the FPGA is improved.

Description

FPGA-based high-resource-utilization CNN accelerator and acceleration method thereof
Technical Field
The invention relates to the field of hardware acceleration of deep learning algorithms, in particular to a high-resource-utilization-rate CNN accelerator based on an FPGA and an acceleration method thereof.
Background
CNN is one of the most important algorithms in the deep learning algorithm, and is widely used in the fields of target classification, automatic driving, and the like due to its ultra-high accuracy. The excellent recognition accuracy of CNN is derived from huge calculated amount and parameter amount, wherein the calculated amount of a convolution layer occupies more than 90% of a network, meanwhile, convolution operation is complex, and the CPU designed based on serial calculation has the characteristics of frequent data access and low calculation speed. With the development of CNNs, the recognition accuracy is higher and the calculation amount is also increasing sharply, and the calculation on the traditional CPU platform cannot meet the requirement of high and new applications on real-time performance, so a large number of researchers are devoted to designing efficient hardware structures to accelerate CNNs. The FPGA has rich resources, strong parallel computing capability and lower power consumption, and the accelerator designed based on the FPGA has higher flexibility due to the programmable characteristic, and can synchronously update the hardware structure along with the updating of the CNN algorithm with lower cost, so that the CNN accelerator with high resource utilization rate based on the FPGA is widely focused by researchers. At present, a high-resource-utilization CNN accelerator based on an FPGA partially accelerates networks by a single-layer processor structure, so that various CNN networks can be accelerated, and the flexibility is high, but in the process, the access of intermediate data inside and outside a chip is frequent, so that the operation time and the power consumption cost are high, and the continuous batch identification task is weak to process. And the other part of accelerators only accelerate aiming at a specific CNN network, all layers are mapped onto the FPGA, different layers are processed in series by a plurality of layer processors, and meanwhile, batch tasks are executed between the layers in a pipeline mode, so that the output interval is shortened. In addition, all intermediate data in the whole process are not returned to the outside of the chip, so that the power consumption of the access and storage of the data inside and outside the chip is reduced. The disadvantage is that when the network is changed, the code needs to be rewritten, synthesized and burned.
At present, a fully mapped CNN accelerator based on an FPGA generally takes a Roofline model as a reference and takes throughput as an optimization target to allocate resources on an FPGA chip, so that the pipeline efficiency is improved, but the method generally only considers the number of multipliers used, lacks analysis of actual effective operation time of the multipliers, and cannot utilize the calculation resources on the FPGA chip to the maximum extent.
Disclosure of Invention
The invention aims to: the invention provides a high-resource-utilization CNN accelerator based on an FPGA, and further provides an acceleration method based on the CNN accelerator, so as to solve the problems in the prior art.
The technical scheme is as follows: the high resource utilization CNN accelerator based on the FPGA is characterized in that a plurality of layer processors are connected end to end, and the calculation of batch continuous tasks is completed in a pipeline mode. The layer processor is composed of a convolution calculation unit, an intra-layer data multiplexing unit and an inter-layer data multiplexing unit, wherein the intra-layer data multiplexing unit and the inter-layer data multiplexing unit are mapped to specific circuit connection, and the optimizing unit is mapped to the resource usage in the layer processor.
In a further embodiment, the layer processors include at least three, and the first layer processor is configured to receive the input pixels and the corresponding weight parameters, take charge of data storage and calculation, and transfer the calculation result to the second layer processor; the second layer processor receives the data from the first layer processor and the weight parameters transmitted from the outside of the chip, completes data storage and calculation, and transfers the calculation result to the third layer processor; and the third layer processor receives the data from the second layer processor and the weight parameters transmitted from the outside of the chip, completes data storage and calculation, and transfers the calculation result to the outside of the chip.
In a further embodiment, the convolved input data x1, x2, x3 are sequentially updated into the convolution calculation unit in a pipeline manner, and the weight parameters w1, w2, w3 remain unchanged until the calculation of the current batch is completed, and the multiplexer is controlled by the sel signal to realize multiplication of the input data and the corresponding parameters in different periods. The multiplication factor 1/2 in Winograd fast convolution algorithm is embedded in the circuit, and is realized in a signal line dislocation mode, so that the use of a multiplier is reduced.
In a further embodiment, a single input signature stores the data of the parity rows in separate memories, respectively, for two-way parallel storage and provision of the data. The memory is connected with a line cache structure, and periodically provides data input data blocks for the computing unit group. And the efficient convolution operation is realized by matching with a convolution calculation unit.
Based on the CNN accelerator, the embodiment provides a CNN acceleration method with high resource utilization rate, which comprises the following steps:
step 1, cutting a feature map into small blocks for batch operation, continuously transmitting layer and interlayer data blocks, wherein adjacent convolution blocks comprise repeated data blocks;
step 2, starting continuous data storage at the initial address a of the storage unit, and finishing the storage of the data block A;
step 3, storing a data block B at the address B, stopping receiving the data of the previous layer after the AB block is stored, and starting the operation of the current batch;
step 4, after the data calculation of the current batch is completed, a new data block C is stored from an address a, a data block A is covered, then a new characteristic diagram is formed by the new data block C and the data block B, and the data sequence is adjusted by changing the address of the data to be fetched and used for operation;
step 5, calculating the subsequent calculation batch, and updating the necessary non-repeated data.
In a further embodiment, the high resource utilization CNN acceleration method further comprises, for exampleThe method comprises the following steps: first, the total storage space required for simultaneously storing all intermediate results of the accelerated CNN network is calculatedWherein H is j ,W j ,C j The height, the width and the channel number of the j-th layer characteristic diagram of the network are respectively, and n is the network layer number. When the required storage space is higher than the storage capacity on the FPGA chip, the input feature map is cut into small blocks to be calculated in batches, and the method comprises the following steps: a) First, the height of the feature diagram of the last layer is cut into H Tn =H n 2, if the height of the final layer of feature map is odd, adding 0 rows of supplements, and keeping the width and the channel number unchanged; b) If the storage requirement cannot be met after the processing of the steps, the height of the feature map is cut from the penultimate layer to the input layer (the first layer), and the formula is updated to be H Tj =(H Tj+1 -1)S j +K j, Wherein S is j And K j Respectively adding all 0 elements for compensating odd height for the convolution step length and the convolution kernel size of the j-th layer of the network; c) If the processed steps cannot meet the storage requirement, the height of the feature map of the last layer is cut into H Tn =H n And/4, repeating the step b until the storage requirement is met or the height of the final layer of feature map is 2; d) When the clipping height of the final layer of feature images is 2 and still cannot meet the storage requirement, the same clipping flow is started to the width of the feature images until the requirements are met.
The characteristic diagram is cut into small blocks and then operated in batches, the layer and interlayer data blocks are continuously transmitted, and the adjacent convolution blocks comprise repeated data blocks. When in storage, firstly, the data is continuously stored at the initial address a of the storage unit, and the storage of the data block A is completed. And then storing the data block B at the address B, stopping receiving the data of the previous layer after the AB block is stored, and starting the operation of the current batch. After the data calculation of the current batch is completed, a new data block C is stored from the a address, and covers the data block A, then a new characteristic diagram is formed by the new data block C and the data block B, and the data sequence is adjusted by changing the address of the fetched data and is used for operation. In this embodiment, the complete convolution block is provided only in the previous layer in the first calculation batch, and only the necessary non-repeated data is updated in each subsequent calculation batch, so as to reduce the data repetition problem of the data block B, C caused by the clipping of the feature map, and improve the efficiency of layer-to-layer data transfer and calculation.
In a further embodiment, after the FPGA and CNN network are given, according to the accelerator architecture, the network is partitioned first to meet the storage requirement of the intermediate data; then according to the data transmission capacity inside and outside the chip, determining the calculation batch of the full connection layer to reduce the access memory; and finally, determining the quantity of resources allocated to each layer according to the ratio of the effective calculated quantity between the layers and the actual resource utilization rate, realizing the optimal calculated resource utilization rate, completing the configuration of the accelerator parameters, and integrating and burning codes on a chip.
Compared with the prior art, the invention has the remarkable advantages that:
1) The convolution calculation unit based on the fast convolution is used for reducing the multiplication times required by continuous convolution operation and the use amount of multiplier resources, so that the energy efficiency of the accelerator can be effectively improved;
2) And the accelerator is optimized by using an optimization target based on the resource utilization rate, so that the waste of computing resources is reduced, and the upper performance limit of the CNN accelerator on the FPGA is improved.
Drawings
Fig. 1 is a system frame diagram of a CNN accelerator of the present invention.
Fig. 2 is a circuit configuration diagram of the convolution calculating unit of the present invention.
Fig. 3 is a schematic diagram of an intra-layer data multiplexing scheme according to the present invention.
Fig. 4 is a schematic diagram of an interlayer data multiplexing scheme according to the present invention.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the invention may be practiced without one or more of these details. In other instances, well-known features have not been described in detail in order to avoid obscuring the invention.
The applicant believes that at present, a high resource utilization CNN accelerator based in part on an FPGA accelerates a network by a single-layer processor structure, in the process, the access of intermediate data inside and outside a chip is frequent, so that the operation time and the power consumption cost are high, and the processing of continuous batch identification tasks is weak. And the other part of accelerators only accelerate aiming at a specific CNN network, all layers are mapped onto the FPGA, different layers are processed in series by a plurality of layer processors, and meanwhile, batch tasks are executed between the layers in a pipeline mode, so that the output interval is shortened. In addition, all intermediate data in the whole process are not returned to the outside of the chip, so that the power consumption of the access and storage of the data inside and outside the chip is reduced. The disadvantage is that when the network is changed, the code needs to be rewritten, synthesized and burned. The fully mapped CNN accelerator based on the FPGA generally takes the roofine model as a reference and takes throughput as an optimization target to allocate resources on the FPGA chip so as to improve the pipeline efficiency, but the method generally only considers the number of multipliers used, lacks analysis of the actual effective operation time of the multipliers, and cannot utilize the calculation resources on the FPGA chip to the maximum extent.
Therefore, the applicant provides a high-resource utilization CNN accelerator based on an FPGA, and further provides an acceleration method based on the CNN accelerator, and a convolution calculation unit based on Winograd fast convolution is used for reducing the multiplication times and the multiplier resource usage amount required by continuous convolution operation, so that the energy efficiency of the accelerator can be effectively improved. And the accelerator is optimized by using an optimization target based on the resource utilization rate, so that the waste of computing resources is reduced, and the upper performance limit of the CNN accelerator on the FPGA is improved.
Embodiment one:
referring to fig. 1, in the FPGA-based CNN accelerator with high resource utilization rate according to the present embodiment, multiple layer processors are connected end to end, and the calculation of batch continuous tasks is completed in a pipeline manner. The layer processor is composed of a convolution calculation unit, an intra-layer data multiplexing and an inter-layer data multiplexing scheme, the intra-layer data multiplexing and the inter-layer data multiplexing units are mapped to specific circuit connection, and the optimizing unit is mapped to the resource usage in the layer processor. The first layer processor is responsible for receiving input pixels and corresponding weight parameters, storing and calculating data, and transferring the calculation result to the next layer processor. The middle layer processor receives the data of the previous layer and the weight parameters transmitted from the outside of the chip, completes data storage and calculation, and transfers the calculation result to the next layer processor. And the last layer processor receives the data of the previous layer and the weight parameters transmitted from the outside of the chip, completes data storage and calculation, and transfers the calculation result to the outside of the chip.
Embodiment two:
in the embodiment shown in fig. 2, the convolved input data x1, x2, x3, etc. are sequentially updated into the convolution calculation unit in a pipelined manner (x 2 replaces x1, x3 replaces x2, x4 replaces x 3), and the multiply-add operation is periodically completed. In the first calculation period, the calculation of (x 1-x 3) and (x 3+ x 2) is completed by using an adder and a multiplexer, and the calculation result of (x 3-x 2) is transmitted into a register for temporary storage. The calculation results of (x3+x2) and (x 3-x 2) are alternately selected using the sel signal. The weight parameters w1, w2 and w3 are kept unchanged before the calculation of the current batch is completed, and can be independently started to operate with input data in advance, and the operations of (w1+w3+w2) and (w1+w3-w 2) are completed on the chip by using an adder. In addition, MSB and MLB are respectively the highest bit and the lowest bit of binary data, and multiplication factor 1/2 in Winograd fast convolution algorithm is embedded in a circuit and realized in a signal line dislocation mode. The corresponding multiplication factors are synchronously selected by using sel signals, the two multipliers simultaneously output multiplication results, one-dimensional convolution with the input vector length of 4, the convolution length of 3 and the step length of 1 can be completed once every 2 periods, a plurality of structures are piled up and combined, and Winograd rapid convolution algorithm can be completed with less calculation resources.
Embodiment III:
in the embodiment shown in fig. 3, the PE is constituted by the corresponding embodiment of fig. 2. The data of odd-even rows are respectively stored in different memories by a single input characteristic diagram, and the data are stored and provided in parallel in a double-way mode. The memory is connected with a line cache structure, and periodically provides data input data blocks for the computing unit group. In cooperation with the convolution calculation unit or the embodiment in fig. 2, an efficient convolution operation is achieved.
Embodiment four:
in the embodiment shown in fig. 4, the feature map is cut into small block batch operations, with layers being passed in succession with inter-layer data blocks, adjacent convolution blocks containing duplicate data blocks. When in storage, firstly, the data is continuously stored at the initial address a of the storage unit, and the storage of the data block A is completed. And then storing the data block B at the address B, stopping receiving the data of the previous layer after the AB block is stored, and starting the operation of the current batch. After the data calculation of the current batch is completed, a new data block C is stored from the a address, and covers the data block A, then a new characteristic diagram is formed by the new data block C and the data block B, and the data sequence is adjusted by changing the address of the fetched data and is used for operation. In this embodiment, the complete convolution block is provided only in the previous layer in the first calculation batch, and only the necessary non-repeated data is updated in each subsequent calculation batch, so as to reduce the data repetition problem of the data block B, C caused by the clipping of the feature map, and improve the efficiency of layer-to-layer data transfer and calculation.
After the FPGA and the CNN network are given, firstly partitioning the network according to the accelerator architecture to meet the storage requirement of intermediate data; then according to the data transmission capacity inside and outside the chip, determining the calculation batch of the full connection layer to reduce the access memory; and finally, determining the quantity of resources allocated to each layer according to the ratio of the effective calculated quantity between the layers and the actual resource utilization rate, realizing the optimal calculated resource utilization rate, completing the configuration of the accelerator parameters, and integrating and burning codes on a chip.
Fifth embodiment:
in order to verify the effectiveness of the scheme of the invention, the following experiment is carried out by taking xilox VC709 FPGA as a platform and taking AlexNet network as an example.
The off-chip memory is selected to be a DDR3 memory with the capacity of 4GB, the convolution layer of the AlexNet network is cut out less unordered, the original size is kept, and the batch processing number of the full connection layer is 32. All data formats are 16-bit fixed point data.
After the layer processor architecture of each layer is determined, calculating according to the size and calculation characteristics of the AlexNet network convolution layer to obtain the position of each layerThe layer calculation time T of the processor in theory of completing single task j Output interval delta T with continuous batch task, the effective utilization rate of computing resource isWherein n is the number of layers of CNN, d j The number of multipliers per layer processor.
According to the formula, constraint conditions of resources and information of a network are substituted for optimization. When the number of multipliers of each layer of processors of the accelerator is {386, 1296, 368, 488, 368, 160, 64, 32}, the theoretical calculation resource utilization rate of the accelerator reaches 98.8%, and when the CNN accelerator realized based on the parameters processes AlexNet, the throughput of the CNN accelerator is 973GOPs, and the resource efficiency (throughput of unit multiplier) is 0.31GOPs/DSP.
In conclusion, the high-resource-utilization-rate CNN accelerator based on the FPGA can achieve resource efficiency and throughput, and reduces resource waste.
As described above, although the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (3)

1. An acceleration method of a high resource utilization CNN accelerator based on an FPGA is characterized in that the CNN accelerator comprises the following steps: the system comprises a plurality of layer processors connected end to end, wherein the layer processors finish calculation of batch continuous tasks in a pipeline mode;
the layer processor comprises a convolution calculation unit, an intra-layer data multiplexing unit, an inter-layer data multiplexing unit and an optimizing unit based on resource utilization rate, which are electrically connected with each other;
the first layer processor is used for receiving input pixels and corresponding weight parameters, is responsible for data storage and calculation, and transfers calculation results to the second layer processor;
the second layer processor receives the data from the first layer processor and the weight parameters transmitted from the outside of the chip, completes data storage and calculation, and transfers the calculation result to the third layer processor;
the third layer processor receives the data from the second layer processor and the weight parameters transmitted from the outside of the chip, completes data storage and calculation, and transfers the calculation result to the outside of the chip;
the convolved input data x1, x2 and x3 are sequentially updated into the convolution calculation unit in a pipeline mode, weight parameters w1, w2 and w3 are kept unchanged before the calculation of the current batch is completed, and the multiplexer is controlled by the sel signal to realize multiplication of the input data and corresponding parameters in different periods;
the acceleration method comprises the following steps:
step 1, cutting a feature map into small blocks for batch operation, continuously transmitting layer and interlayer data blocks, wherein adjacent convolution blocks comprise repeated data blocks;
step 2, starting continuous data storage at the initial address a of the storage unit, and finishing the storage of the data block A;
step 3, storing a data block B at the address B, stopping receiving the data of the previous layer after the AB block is stored, and starting the operation of the current batch;
step 4, after the data calculation of the current batch is completed, a new data block C is stored from an address a, a data block A is covered, then a new characteristic diagram is formed by the new data block C and the data block B, and the data sequence is adjusted by changing the address of the data to be fetched and used for operation;
step 5, calculating the subsequent calculation batch, and updating necessary non-repeated data;
first, the total storage space required for simultaneously storing all intermediate results of the accelerated CNN network is calculatedWherein H is j ,W j ,C j The height, the width and the channel number of the j-th layer characteristic diagram of the network are respectively shown, and n is the number of network layers;
when the required storage space is higher than the storage capacity on the FPGA chip, cutting the input feature map into small blocks for batch calculation;
the process of clipping the input feature map into small blocks for calculation in batches further comprises:
a) Clipping the height of the characteristic diagram of the last layer to H Tn =H n 2, if the height of the final layer of feature map is odd, adding 0 rows of supplements, and keeping the width and the channel number unchanged;
b) If the storage requirement cannot be met after the processing of the steps, the height of the feature map is cut from the penultimate layer to the input layer, and the update formula is H Tj =(H Tj+1 -1)S j +K j
Wherein S is j And K j Respectively adding all 0 elements for compensating odd height for the convolution step length and the convolution kernel size of the j-th layer of the network;
c) If the processed steps cannot meet the storage requirement, the height of the feature map of the last layer is cut into H Tn =H n And/4, repeating the step b until the storage requirement is met or the height of the final layer of feature map is 2;
d) And when the cutting height of the final layer of feature images is 2 and the storage requirement cannot be met, starting the same cutting process on the width of the feature images until the requirements are met.
2. The acceleration method of the high resource utilization CNN accelerator based on the FPGA of claim 1, wherein the acceleration method comprises the following steps: the single input feature map stores the data of odd-even rows into different memories respectively, and the data is stored and provided in parallel in a double-way manner; the memory is connected with a line cache structure at the rear, and periodically provides data input data blocks for the computing unit group.
3. The acceleration method of the high resource utilization CNN accelerator based on the FPGA of claim 2, wherein the acceleration method is characterized by: after the FPGA and the CNN network are given, firstly, partitioning the network to meet the storage requirement of intermediate data; then according to the data transmission capacity inside and outside the chip, determining the calculation batch of the full connection layer to reduce the access memory; and finally, determining the quantity of resources allocated to each layer according to the ratio of the effective calculated quantity between the layers and the actual resource utilization efficiency, completing the configuration of the accelerator parameters, and integrating and burning codes on a chip.
CN202110157101.XA 2021-02-04 2021-02-04 FPGA-based high-resource-utilization CNN accelerator and acceleration method thereof Active CN112801285B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110157101.XA CN112801285B (en) 2021-02-04 2021-02-04 FPGA-based high-resource-utilization CNN accelerator and acceleration method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110157101.XA CN112801285B (en) 2021-02-04 2021-02-04 FPGA-based high-resource-utilization CNN accelerator and acceleration method thereof

Publications (2)

Publication Number Publication Date
CN112801285A CN112801285A (en) 2021-05-14
CN112801285B true CN112801285B (en) 2024-01-26

Family

ID=75814210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110157101.XA Active CN112801285B (en) 2021-02-04 2021-02-04 FPGA-based high-resource-utilization CNN accelerator and acceleration method thereof

Country Status (1)

Country Link
CN (1) CN112801285B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805272A (en) * 2018-05-03 2018-11-13 东南大学 A kind of general convolutional neural networks accelerator based on FPGA
CN110555516A (en) * 2019-08-27 2019-12-10 上海交通大学 FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method
CN111488983A (en) * 2020-03-24 2020-08-04 哈尔滨工业大学 Lightweight CNN model calculation accelerator based on FPGA
CN111831254A (en) * 2019-04-15 2020-10-27 阿里巴巴集团控股有限公司 Image processing acceleration method, image processing model storage method and corresponding device
CN112306951A (en) * 2020-11-11 2021-02-02 哈尔滨工业大学 CNN-SVM resource efficient acceleration architecture based on FPGA

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2016203619A1 (en) * 2016-05-31 2017-12-14 Canon Kabushiki Kaisha Layer-based operations scheduling to optimise memory for CNN applications
US10621486B2 (en) * 2016-08-12 2020-04-14 Beijing Deephi Intelligent Technology Co., Ltd. Method for optimizing an artificial neural network (ANN)

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805272A (en) * 2018-05-03 2018-11-13 东南大学 A kind of general convolutional neural networks accelerator based on FPGA
CN111831254A (en) * 2019-04-15 2020-10-27 阿里巴巴集团控股有限公司 Image processing acceleration method, image processing model storage method and corresponding device
CN110555516A (en) * 2019-08-27 2019-12-10 上海交通大学 FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method
CN111488983A (en) * 2020-03-24 2020-08-04 哈尔滨工业大学 Lightweight CNN model calculation accelerator based on FPGA
CN112306951A (en) * 2020-11-11 2021-02-02 哈尔滨工业大学 CNN-SVM resource efficient acceleration architecture based on FPGA

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于FPGA和CNN的汉字识别系统的设计与实现;潘思园 等;《信息技术与网络安全》;第38卷(第9期);第44-49页 *

Also Published As

Publication number Publication date
CN112801285A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
WO2021004366A1 (en) Neural network accelerator based on structured pruning and low-bit quantization, and method
US11775802B2 (en) Neural processor
CN110458279B (en) FPGA-based binary neural network acceleration method and system
CN109416754B (en) Accelerator for deep neural network
CN108805274B (en) FPGA (field programmable Gate array) -based acceleration method and system for hardware of Tiny-yolo convolutional neural network
CN108665063B (en) Bidirectional parallel processing convolution acceleration system for BNN hardware accelerator
CN110705703B (en) Sparse neural network processor based on systolic array
CN109948774A (en) Neural network accelerator and its implementation based on network layer binding operation
CN109409511A (en) A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array
CN110851779B (en) Systolic array architecture for sparse matrix operations
WO2022037257A1 (en) Convolution calculation engine, artificial intelligence chip, and data processing method
CN110415157A (en) A kind of calculation method and device of matrix multiplication
CN117933314A (en) Processing device, processing method, chip and electronic device
CN112434801B (en) Convolution operation acceleration method for carrying out weight splitting according to bit precision
CN103970720A (en) Embedded reconfigurable system based on large-scale coarse granularity and processing method of system
CN110334803A (en) Convolutional calculation method and convolutional neural networks accelerator based on rarefaction Winograd algorithm
CN111738433A (en) Reconfigurable convolution hardware accelerator
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
EP4318275A1 (en) Matrix multiplier and method for controlling matrix multiplier
CN111340198A (en) Neural network accelerator with highly-multiplexed data based on FPGA (field programmable Gate array)
CN113762493A (en) Neural network model compression method and device, acceleration unit and computing system
Xiao et al. FPGA-based scalable and highly concurrent convolutional neural network acceleration
CN112801285B (en) FPGA-based high-resource-utilization CNN accelerator and acceleration method thereof
CN113516236A (en) VGG16 network parallel acceleration processing method based on ZYNQ platform
CN113313252A (en) Depth separable convolution implementation method based on pulse array

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant