CN109598338B - Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization - Google Patents

Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization Download PDF

Info

Publication number
CN109598338B
CN109598338B CN201811493592.XA CN201811493592A CN109598338B CN 109598338 B CN109598338 B CN 109598338B CN 201811493592 A CN201811493592 A CN 201811493592A CN 109598338 B CN109598338 B CN 109598338B
Authority
CN
China
Prior art keywords
weight
data
calculation
area
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811493592.XA
Other languages
Chinese (zh)
Other versions
CN109598338A (en
Inventor
陆生礼
庞伟
舒程昊
范雪梅
吴成路
邹涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING SAMPLE TECHNOLOGY CO LTD
Southeast University-Wuxi Institute Of Integrated Circuit Technology
Southeast University
Original Assignee
NANJING SAMPLE TECHNOLOGY CO LTD
Southeast University-Wuxi Institute Of Integrated Circuit Technology
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING SAMPLE TECHNOLOGY CO LTD, Southeast University-Wuxi Institute Of Integrated Circuit Technology, Southeast University filed Critical NANJING SAMPLE TECHNOLOGY CO LTD
Priority to CN201811493592.XA priority Critical patent/CN109598338B/en
Publication of CN109598338A publication Critical patent/CN109598338A/en
Application granted granted Critical
Publication of CN109598338B publication Critical patent/CN109598338B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a convolution neural network accelerator based on calculation optimization of an FPGA, which comprises an AXI4 bus interface, a data cache area, a prefetched data area, a result cache area, a state controller and a PE array; the data buffer area is used for buffering the characteristic diagram data, convolution kernel data and index values read from the external memory DDR through the AXI4 bus interface; the prefetching data area is used for prefetching the feature map data which needs to be input into the PE array in parallel from the feature map sub-cache area; the result buffer area is used for buffering the calculation result of each line of PE; the state controller is used for controlling the working state of the accelerator to realize the conversion between the working states; the PE array is used for reading the data in the prefetched data area and the convolution kernel buffer area to carry out convolution operation. The accelerator finishes redundant calculation in advance by utilizing the parameter sparsity, the repeated weight data and the characteristic of an activation function Relu, reduces the calculated amount and reduces the energy consumption by reducing the access times.

Description

Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization
Technical Field
The invention belongs to the field of electronic information and deep learning, and particularly relates to a FPGA (Filed Programmable Gate Array) -based computational optimization convolutional neural network accelerator hardware structure.
Background
In recent years, the use of deep neural networks has grown rapidly, with significant impact on world economic and social activities. Deep convolutional neural network technology has received a great deal of attention in many machine learning fields, including speech recognition, natural language processing, and intelligent image processing, and in particular in the field of image recognition, deep convolutional neural networks have achieved some remarkable results. In these areas, deep convolutional neural networks can achieve beyond human accuracy. The superiority of the deep convolutional neural network arises from its ability to extract advanced features from raw data after statistical learning of large amounts of data.
Deep convolutional neural networks are well known as computationally intensive networks, with convolutional operations accounting for over 90% of the total number of operations. These extensive calculations are reduced by utilizing the operational information and algorithmic structure of convolution calculations, i.e., reducing the effort required for reasoning, has become a new round of hot spot research.
The high accuracy of the deep convolutional neural network comes at the cost of high computational complexity. In addition to being computationally intensive, convolutional neural networks require the storage of millions or even billions of parameters. The large size of such networks presents throughput and energy efficiency challenges to the underlying acceleration hardware.
Currently, various accelerators designed based on FPGA, GPU (Graphic Processing Unit, graphics processor) and ASIC (Application Specific Integrated Circuit ) have been proposed to improve the performance of deep convolutional neural networks. The accelerator based on the FPGA has the advantages of good performance, high energy efficiency, short development period, strong reconstruction capability and the like, and is widely researched. Unlike a general architecture, FPGAs allow users to customize the functionality of the designed hardware to accommodate various resource and data usage patterns.
Based on the analysis, the scheme is generated by the problem that the redundancy calculation amount is overlarge during convolution calculation in the prior art.
Disclosure of Invention
The invention aims to provide a convolutional neural network accelerator based on calculation optimization of FPGA, which utilizes the characteristics of parameter sparsity, repeated weight data and an activation function Relu to finish redundant calculation in advance, reduce the calculated amount and reduce the energy consumption by reducing the visit times.
In order to achieve the above object, the solution of the present invention is:
a convolution neural network accelerator based on calculation optimization of FPGA comprises an AXI4 bus interface, a data buffer area, a prefetched data area, a result buffer area, a state controller and a PE array;
the AXI4 bus interface is a universal bus interface, and can mount an accelerator to any bus device using an AXI4 protocol to work;
the data buffer area is used for buffering the characteristic diagram data, convolution kernel data and index values read from the external memory DDR through the AXI4 bus interface; the data buffer area comprises M characteristic map sub-buffer areas and C convolution kernel buffer areas, each row of PE is correspondingly provided with one convolution kernel buffer area, and the number of the actually used characteristic map sub-buffer areas is determined according to each layer of actually calculated parameters; the number M of the characteristic map sub-buffers is determined according to the size of a convolution kernel of a current layer of the convolution neural network, the size of an output characteristic map and the offset of a convolution window;
the prefetching data area is used for prefetching the feature map data which needs to be input into the PE array in parallel from the feature map sub-cache area;
the result buffer area comprises R result sub-buffer areas, and each line of PE is correspondingly provided with one result sub-buffer area for buffering the calculation result of each line of PE;
the state controller is used for controlling the working state of the accelerator to realize the conversion between the working states;
the PE array is realized by an FPGA and is used for reading the data in the pre-fetch data area and the convolution kernel buffer area to carry out convolution operation, PE of different columns are calculated to obtain different output characteristic diagrams, and PE of different rows are calculated to obtain different rows of the same output characteristic diagram. The PE array comprises R.times.C PE units; each PE unit contains two calculation optimization modes, a pre-activation mode and a weight repetition mode.
The PE unit comprises an input buffer area, a weight buffer area, an input search area, a weight search area, a PE control unit, a pre-activation unit and a configurable multiply-accumulate unit, wherein the input buffer area and the weight buffer area are respectively used for storing characteristic diagram data and weight data required by convolution calculation; the input retrieval area and the weight retrieval area are respectively used for storing index values of the search feature map data and the weight data; the PE control unit is used for controlling the working state of the PE unit, reading the index value of the index area, reading the data of the buffer area according to the index value, sending the data into the multiply-accumulate unit for calculation, configuring the mode of the multiply-accumulate unit and whether to start the pre-activation unit; the preactivation unit is used for detecting a partial sum of convolution calculation, and stopping calculating output 0 if the partial sum is smaller than 0; the multiply-accumulate unit is used for convolution calculation and can be configured into a normal multiply-accumulate calculation mode or a calculation optimization mode with weight repetition.
The PE control unit determines that the convolution calculation optimization mode of the multiply-accumulate unit is a pre-activation mode or a weight repetition mode, and selects different calculation optimization modes for each layer; the determination method is as follows: determining a calculation optimization mode by adopting a two-bit mode flag bit, and performing normal multiply-accumulate calculation with the high bit being 0; the high order is 1, which is a calculation optimization mode repeated by weight; the low order 0 is not pre-activated; the low order 1 is the preactivation mode.
The weight retrieval area comprises a plurality of weight sub-retrieval areas, weights are written into the weight sub-buffer areas according to the sequence from positive to negative and zero weights, and corresponding input index values and weight index values are also written into the retrieval areas according to the sequence; the operation of sorting the weights and the index values is finished offline; and during convolution calculation, sequentially reading the weights of the weight buffer areas according to the weight index values.
The weight index value uses a one-bit weight conversion flag bit to indicate whether to replace the calculated weight, if the flag bit is 0, the weight is unchanged, and the previous clock weight is prolonged; and if the flag bit is 1, the weight is changed, and the next clock reads the next weight in the weight sub-buffer in sequence.
The PE unit comprises two calculation optimization modes, namely a preactivation mode and a weight repetition mode, wherein the preactivation mode is used for monitoring the convolution part and the positive and negative in real time, if the preactivation mode is negative, feeding back to the PE control unit to terminate calculation, directly outputting a Relu result zero, and if the preactivation mode is regular, continuing the convolution calculation; the weight repetition mode is to add the feature map data with the same weight to the convolution operation with the same weight, then multiply the feature map data with the same weight, and reduce the multiplication times and the access times to the weight data.
In the weight repetition mode, the input feature map is firstly accumulated when the weight conversion flag bit is 0, and the accumulated result is stored in the register. When the weight conversion flag bit is 1, after the accumulation operation is finished, the accumulated part and the weight value are sent to the multiplication unit to be multiplied, and the result is stored in the register.
The state controller is composed of 7 states, namely: waiting, writing a feature map, writing an input index, writing a convolution kernel, writing a weight index, convolving calculation, and sending a calculation result, wherein each state sends a corresponding control signal to a corresponding sub-module to complete a corresponding function.
The data bit width of the AXI4 bus interface is larger than that of a single weight or a characteristic diagram, so that a plurality of data are spliced into a multi-bit data to be sent, and the data transmission speed is improved.
By adopting the scheme, the invention reduces redundant useless calculation and reading of parameter data by utilizing the operation information and the algorithm structure in the convolution calculation, accelerates the convolution neural network by utilizing the FPGA hardware platform, can improve the instantaneity of the DCNN, realizes higher calculation performance and reduces energy consumption.
Drawings
FIG. 1 is a schematic diagram of the structure of the present invention;
FIG. 2 is a schematic diagram of the PE structure of the present invention;
FIG. 3 is a schematic diagram of the input index and weight index operations;
fig. 4 is a schematic diagram of the operation of the preactivation unit.
Detailed Description
The technical scheme and beneficial effects of the present invention will be described in detail below with reference to the accompanying drawings.
As shown in fig. 1, in the hardware structure of the convolutional neural network accelerator designed in the present invention, taking the size of the PE array as 16×16, the size of the convolutional kernel as 3*3, and the convolutional kernel step size 1 as an example, the working manner is as follows:
the PC caches the data partition in the external memory DDR through the PCI-E interface, the data cache area reads the characteristic diagram data through the AXI4 bus interface and caches the characteristic diagram data in 3 characteristic diagram sub-cache areas according to lines, and the input index value is cached in the characteristic diagram sub-cache areas in the same mode. Weight data read through an AXI4 bus interface are sequentially cached in 16 convolution kernel cache areas, and weight index values are cached in the convolution kernel cache areas in the same mode. The pre-fetching buffer zone sequentially reads 3 feature map sub-buffer zone data according to a row sequence, reads 3 x 18 16-bit feature map data in total, outputs 16-bit feature map data in parallel in each clock cycle, and inputs 3 feature map data in parallel. The output data of the prefetching buffer is sent to the first PE of each line of the PE array, and is sequentially transmitted to each adjacent PE of each line. The input index value is fed into the PE array in the same manner. The input feature map data is cached in the input sub-cache area of each PE, and the input index value is cached in the input retrieval area. The weight data and the weight index value are input into the first PE of each column of the PE array in parallel through 16 convolution kernel buffers, and each column of adjacent PEs are sequentially transmitted. And finally, the weight sub-buffer memory area and the weight retrieval area in the PE are cached. And the PE unit reads data from the input sub-buffer and the weight sub-buffer according to the configured calculation optimization mode and index values, performs convolution calculation, and sends accumulated results into 16 result sub-buffers in parallel, wherein the calculation results of each line of PE are stored in the same result sub-buffer.
Referring to fig. 2, the PE unit may configure two calculation optimization modes, a pre-activation mode and a weight repetition mode through a two-bit mode flag bit S1S 0. S1S0 is configured to be in a pre-activation mode when being 01, a pre-activation unit is started to monitor a part and a result of multiply-accumulate operation, and if the part and the value are negative, a Relu result 0 is output in advance and the current convolution window calculation is stopped; S1S0 is configured as a weight repetition mode when 10, an input accumulation unit is started, multiplication operation with the same weight is performed, addition is performed first, input data is accumulated and stored in a register first until the weight is changed, and an accumulated result is sent to the multiplication accumulation unit to perform multiplication accumulation operation. When the weight is 0, the PE unit will turn off the computing unit, outputting the portion and result directly.
Referring to fig. 3, the pe control unit sequentially takes the feature map data from the input sub-buffer and sends the feature map data to the calculation unit by inputting the index value. The weight index value is represented by a one-bit weight conversion flag bit, the weight is unchanged if the weight index value is 0, and the next weight is read in sequence if the weight index value is 1. The weights and index values are arranged in the order of the weights from positive to negative, zero weight is placed at the end, and the sorting work is finished offline. As shown in fig. 3, the first four input data correspond to the same weight x, the middle two input data correspond to the same weight y, and the last three input data correspond to the same weight z.
Referring to fig. 4, after the pre-activation unit is started, comparing the partial sum value with a zero value, and if the partial sum value is greater than the zero value, continuing to calculate and output a final result; if the partial sum is smaller than zero, a termination calculation signal is sent to the PE control unit, the PE control unit closes the calculation, and the result zero after the Relu is directly output.
The convolution operation is unfolded into vector multiply-accumulate operation, so that the network structure and the hardware architecture are more matched, the calculation is simplified according to the operation information and the algorithm structure, the calculation efficiency is improved, and the energy consumption is reduced. The specific state transition process in this embodiment is as follows:
after initialization, the accelerator enters a waiting state, a state controller waits for a state signal state sent by an AXI4 bus interface, and when the state is 00001, the state controller enters a convolution writing core state; when the state is 00010, entering a writing weight index state; when the state is 00100, entering a state of writing a feature map; when the state is 01000, entering a write input index state; after the data is received, the waiting state is 10000, and the state of convolution calculation is entered. And after the calculation is finished, automatically jumping into a state of sending the calculation result, and jumping back to a waiting state after the completion of the sending.
Writing a feature map: if the state is entered, waiting for an AXI4 bus interface data valid signal to be pulled high, and simultaneously enabling 3 feature map sub-cache areas in sequence, wherein a first sub-cache area stores first line data of the feature map; the second sub-buffer stores second line data of the feature map; the third sub-buffer stores the third line data of the feature map; after the fourth data of the feature map is jumped back to be stored in the first sub-buffer … and the feature map data is stored in the first sub-buffer in this order, the first data of the first, second and third lines of the feature map stored in the three sub-buffers is taken in the first clock cycle and sent to the pre-fetch buffer; the second clock cycle takes the first data of the fourth, fifth and sixth three lines of the feature map stored in the three sub-caches to send to the pre-fetch cache …, and after the feature map line traverses, sequentially takes the second and third … data to send to the pre-fetch cache according to the sequence. And 3.18 pieces of characteristic map data are stored in the prefetching buffer area, after the storage is completed, 16 pieces of characteristic map data are output in parallel in each clock period and are sent to the first PE of each line of the PE array, and are sequentially transmitted to each line of adjacent PE, and finally are stored in the input sub-buffer area of the PE.
Writing an input index: when this state is entered, data is finally stored in the input search area of the PE in the feature map data storage mode.
Writing a convolution kernel: if the state is entered, waiting for the AXI4 bus interface data valid signal to be pulled high, and simultaneously enabling 16 convolution kernel buffer areas in sequence, wherein a first convolution kernel buffer area stores a convolution kernel value corresponding to a first output channel; after the second convolution kernel buffer area stores the convolution kernel values … corresponding to the second output channel, each sub-buffer area outputs one data per clock, 16 weight data are input into each first PE in the PE array in parallel, and adjacent PEs in the same column are sequentially transferred, and finally buffered in the weight buffer area in the PE unit.
Writing weight index: if the state is entered, the data is finally stored in the weight search area of the PE according to the convolution kernel data storage mode.
And (3) convolution calculation: if the state is entered, the PE control unit configures a calculation optimization mode of the PE unit according to the mode flag bit S1S0, reads data from the weight sub-buffer and the input sub-buffer according to the weight index value and the input index value, sends the data to the multiply-accumulate unit for calculation, and marks that all data are calculated after the multiply-accumulate calculation is performed for 3*3 times of input channels, and the next clock jumps to a state of sending calculation results.
And sending a calculation result: if this state is entered, the calculation results are sequentially read out from the 16 calculation result buffers. The first output channel data in each calculation result buffer area is taken out, every four output channel data are spliced into 64-bit output data, and the 64-bit output data are sent to an external memory DDR through an AXI4 bus interface. All the 16 output channel data are sequentially sent to the external memory DDR, and the accelerator jumps back to the waiting state.
Parameters can be modified through the state controller, and the image size, the convolution kernel size, the step size, the output characteristic diagram size and the number of output channels during modification operation are supported. By utilizing the running state and the algorithm structure, redundant calculation is skipped, so that unnecessary calculation and access are reduced, the efficiency of the convolutional neural network accelerator is improved, and the energy consumption is reduced.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereto, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the present invention.

Claims (6)

1. The utility model provides a convolutional neural network accelerator of calculation optimization based on FPGA which characterized in that: the system comprises an AXI4 bus interface, a data cache area, a prefetched data area, a result cache area, a state controller and a PE array;
the data buffer area is used for buffering the characteristic diagram data, convolution kernel data and index values read from the external memory DDR through the AXI4 bus interface; the data buffer area comprises M characteristic map sub-buffer areas and C convolution kernel sub-buffer areas;
the prefetching data area is used for prefetching the feature map data which needs to be input into the PE array in parallel from the feature map sub-cache area;
the PE array is realized by an FPGA and comprises R.times.C PE units, each row of PE units is correspondingly provided with a convolution nucleon cache region, and the number of actually used feature map sub-cache regions is determined according to each layer of actually calculated parameters; the PE array is used for reading the data in the pre-fetch data area and the convolution kernel buffer area to carry out convolution operation, PE units in different columns calculate different output characteristic diagrams, and PE units in different rows calculate different rows of the same output characteristic diagram;
the PE unit comprises an input buffer area, a weight buffer area, an input search area, a weight search area, a PE control unit, a pre-activation unit and a multiplication accumulation unit, wherein the input buffer area and the weight buffer area are respectively used for storing characteristic map data and weight data required by convolution calculation, and the input search area and the weight search area are respectively used for storing index values for searching the characteristic map data and the weight data; the PE control unit is used for controlling the working state of the PE unit, reading the index value of the index area, reading the data of the buffer area according to the index value, sending the data into the multiply-accumulate unit for calculation, configuring the mode of the multiply-accumulate unit and whether to start the pre-activation unit; the preactivation unit is used for detecting a partial sum of convolution calculation, and stopping calculating output 0 if the partial sum is smaller than 0; the multiplication accumulation unit is used for carrying out convolution calculation and can be configured into a normal multiplication accumulation calculation mode or a calculation optimization mode with weight repetition;
the PE control unit determines that the convolution calculation optimization mode of the multiply-accumulate unit is a pre-activation mode or a weight repetition mode, and selects different calculation optimization modes for each layer; the determination method is as follows: determining a calculation optimization mode by adopting a two-bit mode flag bit, and performing normal multiply-accumulate calculation with the high bit being 0; the high order is 1, which is a calculation optimization mode repeated by weight; the low order 0 is not pre-activated; the lower 1 is the preactivation mode;
the PE unit comprises two calculation optimization modes, namely a preactivation mode and a weight repetition mode, wherein the preactivation mode is used for monitoring the convolution part and the positive and negative in real time, stopping calculation if the preactivation mode is negative, directly outputting a Relu result zero, and continuing the convolution calculation if the preactivation mode is regular; the weight repetition mode is to add the feature map data with the same weight to the convolution operation with the same weight, then multiply the feature map data with the same weight, and reduce the multiplication times and the access times to the weight data;
the result buffer area comprises R result sub-buffer areas, and each line of PE units is correspondingly provided with one result sub-buffer area for buffering the calculation result of each line of PE units;
the state controller is used for controlling the working state of the accelerator and realizing the conversion between the working states.
2. A FPGA-based computational optimization convolutional neural network accelerator as defined in claim 1, wherein: the weight retrieval area comprises a plurality of weight sub-retrieval areas, weights are written into the weight sub-buffer areas according to the sequence from positive to negative and zero weights, and corresponding input index values and weight index values are also written into the retrieval areas according to the sequence; the operation of sorting the weights and the index values is finished offline; and during convolution calculation, sequentially reading the weights of the weight buffer areas according to the weight index values.
3. A FPGA-based computational-optimized convolutional neural network accelerator as recited in claim 2, wherein: the weight index value uses a one-bit weight conversion flag bit to indicate whether to replace the calculated weight, if the flag bit is 0, the weight is unchanged, and the last clock weight is prolonged; and if the flag bit is 1, the weight is changed, and the next clock reads the next weight in the weight sub-buffer in sequence.
4. A FPGA-based computational-optimized convolutional neural network accelerator as recited in claim 3, wherein: in the weight repetition mode, an accumulation operation is firstly carried out on an input feature map when a weight conversion zone bit is 0, and an accumulation result is stored in a register; when the weight conversion flag bit is 1, after the accumulation operation is finished, the accumulated part and the weight value are sent to the multiplication unit to be multiplied, and the result is stored in the register.
5. A FPGA-based computational optimization convolutional neural network accelerator as defined in claim 1, wherein: the state controller consists of 7 states, namely: waiting, writing a feature map, writing an input index, writing a convolution kernel, writing a weight index, convolving calculation, and sending a calculation result, wherein each state sends a corresponding control signal to a corresponding sub-module to complete a corresponding function.
6. A FPGA-based computational optimization convolutional neural network accelerator as defined in claim 1, wherein: the AXI4 bus interface concatenates multiple data into one multi-bit data transmission.
CN201811493592.XA 2018-12-07 2018-12-07 Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization Active CN109598338B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811493592.XA CN109598338B (en) 2018-12-07 2018-12-07 Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811493592.XA CN109598338B (en) 2018-12-07 2018-12-07 Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization

Publications (2)

Publication Number Publication Date
CN109598338A CN109598338A (en) 2019-04-09
CN109598338B true CN109598338B (en) 2023-05-19

Family

ID=65961420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811493592.XA Active CN109598338B (en) 2018-12-07 2018-12-07 Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization

Country Status (1)

Country Link
CN (1) CN109598338B (en)

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097174B (en) * 2019-04-22 2021-04-20 西安交通大学 Method, system and device for realizing convolutional neural network based on FPGA and row output priority
CN110222835A (en) * 2019-05-13 2019-09-10 西安交通大学 A kind of convolutional neural networks hardware system and operation method based on zero value detection
CN110163295A (en) * 2019-05-29 2019-08-23 四川智盈科技有限公司 It is a kind of based on the image recognition reasoning accelerated method terminated in advance
CN110059808B (en) * 2019-06-24 2019-10-18 深兰人工智能芯片研究院(江苏)有限公司 A kind of method for reading data and reading data device of convolutional neural networks
CN110390383B (en) * 2019-06-25 2021-04-06 东南大学 Deep neural network hardware accelerator based on power exponent quantization
CN110390384B (en) * 2019-06-25 2021-07-06 东南大学 Configurable general convolutional neural network accelerator
CN110399883A (en) * 2019-06-28 2019-11-01 苏州浪潮智能科技有限公司 Image characteristic extracting method, device, equipment and computer readable storage medium
CN110378468B (en) * 2019-07-08 2020-11-20 浙江大学 Neural network accelerator based on structured pruning and low bit quantization
CN110414677B (en) * 2019-07-11 2021-09-03 东南大学 Memory computing circuit suitable for full-connection binarization neural network
WO2021031154A1 (en) * 2019-08-21 2021-02-25 深圳市大疆创新科技有限公司 Method and device for loading feature map of neural network
CN110673786B (en) * 2019-09-03 2020-11-10 浪潮电子信息产业股份有限公司 Data caching method and device
CN110705687B (en) * 2019-09-05 2020-11-03 北京三快在线科技有限公司 Convolution neural network hardware computing device and method
CN110738312A (en) * 2019-10-15 2020-01-31 百度在线网络技术(北京)有限公司 Method, system, device and computer readable storage medium for data processing
US11249651B2 (en) * 2019-10-29 2022-02-15 Samsung Electronics Co., Ltd. System and method for hierarchical sort acceleration near storage
CN110910434B (en) * 2019-11-05 2023-05-12 东南大学 Method for realizing deep learning parallax estimation algorithm based on FPGA (field programmable Gate array) high energy efficiency
CN111062472B (en) * 2019-12-11 2023-05-12 浙江大学 Sparse neural network accelerator based on structured pruning and acceleration method thereof
CN111178519B (en) * 2019-12-27 2022-08-02 华中科技大学 Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
CN113111995A (en) * 2020-01-09 2021-07-13 北京君正集成电路股份有限公司 Method for shortening model reasoning and model post-processing operation time
CN113095471B (en) * 2020-01-09 2024-05-07 北京君正集成电路股份有限公司 Method for improving efficiency of detection model
CN111414994B (en) * 2020-03-03 2022-07-12 哈尔滨工业大学 FPGA-based Yolov3 network computing acceleration system and acceleration method thereof
CN111416743B (en) * 2020-03-19 2021-09-03 华中科技大学 Convolutional network accelerator, configuration method and computer readable storage medium
CN111340198B (en) * 2020-03-26 2023-05-05 上海大学 Neural network accelerator for data high multiplexing based on FPGA
CN111898733B (en) * 2020-07-02 2022-10-25 西安交通大学 Deep separable convolutional neural network accelerator architecture
CN111984548B (en) * 2020-07-22 2024-04-02 深圳云天励飞技术股份有限公司 Neural network computing device
CN112149814A (en) * 2020-09-23 2020-12-29 哈尔滨理工大学 Convolutional neural network acceleration system based on FPGA
CN112187954A (en) * 2020-10-15 2021-01-05 中国电子科技集团公司第五十四研究所 Flow control method of offline file in measurement and control data link transmission
CN112580793B (en) * 2020-12-24 2022-08-12 清华大学 Neural network accelerator based on time domain memory computing and acceleration method
CN114692847B (en) * 2020-12-25 2024-01-09 中科寒武纪科技股份有限公司 Data processing circuit, data processing method and related products
CN112668708B (en) * 2020-12-28 2022-10-14 中国电子科技集团公司第五十二研究所 Convolution operation device for improving data utilization rate
CN113094118B (en) * 2021-04-26 2023-05-30 深圳思谋信息科技有限公司 Data processing system, method, apparatus, computer device, and storage medium
CN113780529B (en) * 2021-09-08 2023-09-12 北京航空航天大学杭州创新研究院 FPGA-oriented sparse convolutional neural network multi-stage storage computing system
CN113869494A (en) * 2021-09-28 2021-12-31 天津大学 Neural network convolution FPGA embedded hardware accelerator based on high-level synthesis
CN114780910B (en) * 2022-06-16 2022-09-06 千芯半导体科技(北京)有限公司 Hardware system and calculation method for sparse convolution calculation
CN115311536B (en) * 2022-10-11 2023-01-24 绍兴埃瓦科技有限公司 Sparse convolution processing method and device in image processing
CN116187408B (en) * 2023-04-23 2023-07-21 成都甄识科技有限公司 Sparse acceleration unit, calculation method and sparse neural network hardware acceleration system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100076915A1 (en) * 2008-09-25 2010-03-25 Microsoft Corporation Field-Programmable Gate Array Based Accelerator System
US20180032859A1 (en) * 2016-07-27 2018-02-01 Samsung Electronics Co., Ltd. Accelerator in convolutional neural network and method for operating the same
CN108241890A (en) * 2018-01-29 2018-07-03 清华大学 A kind of restructural neural network accelerated method and framework
CN108537334A (en) * 2018-04-26 2018-09-14 济南浪潮高新科技投资发展有限公司 A kind of acceleration array design methodology for CNN convolutional layer operations
CN108665059A (en) * 2018-05-22 2018-10-16 中国科学技术大学苏州研究院 Convolutional neural networks acceleration system based on field programmable gate array
CN108805272A (en) * 2018-05-03 2018-11-13 东南大学 A kind of general convolutional neural networks accelerator based on FPGA

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100076915A1 (en) * 2008-09-25 2010-03-25 Microsoft Corporation Field-Programmable Gate Array Based Accelerator System
US20180032859A1 (en) * 2016-07-27 2018-02-01 Samsung Electronics Co., Ltd. Accelerator in convolutional neural network and method for operating the same
CN108241890A (en) * 2018-01-29 2018-07-03 清华大学 A kind of restructural neural network accelerated method and framework
CN108537334A (en) * 2018-04-26 2018-09-14 济南浪潮高新科技投资发展有限公司 A kind of acceleration array design methodology for CNN convolutional layer operations
CN108805272A (en) * 2018-05-03 2018-11-13 东南大学 A kind of general convolutional neural networks accelerator based on FPGA
CN108665059A (en) * 2018-05-22 2018-10-16 中国科学技术大学苏州研究院 Convolutional neural networks acceleration system based on field programmable gate array

Also Published As

Publication number Publication date
CN109598338A (en) 2019-04-09

Similar Documents

Publication Publication Date Title
CN109598338B (en) Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization
CN110390385B (en) BNRP-based configurable parallel general convolutional neural network accelerator
US11574659B2 (en) Parallel access to volatile memory by a processing device for machine learning
CN104915322B (en) A kind of hardware-accelerated method of convolutional neural networks
WO2020258528A1 (en) Configurable universal convolutional neural network accelerator
CN108805272A (en) A kind of general convolutional neural networks accelerator based on FPGA
Ma et al. Performance modeling for CNN inference accelerators on FPGA
CN108537331A (en) A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
CN108427990A (en) Neural computing system and method
CN113743599B (en) Computing device and server of convolutional neural network
CN112487750A (en) Convolution acceleration computing system and method based on memory computing
CN112950656A (en) Block convolution method for pre-reading data according to channel based on FPGA platform
CN103927270A (en) Shared data caching device for a plurality of coarse-grained dynamic reconfigurable arrays and control method
Liu et al. FPGA-NHAP: A general FPGA-based neuromorphic hardware acceleration platform with high speed and low power
CN111488051A (en) Cloud deep neural network optimization method based on CPU and FPGA cooperative computing
Song et al. BRAHMS: Beyond conventional RRAM-based neural network accelerators using hybrid analog memory system
Kang et al. A framework for accelerating transformer-based language model on ReRAM-based architecture
US11436486B2 (en) Neural network internal data fast access memory buffer
CN117234720A (en) Dynamically configurable memory computing fusion data caching structure, processor and electronic equipment
CN112183744A (en) Neural network pruning method and device
CN103577160A (en) Characteristic extraction parallel-processing method for big data
EP3859535A1 (en) Streaming access memory device, system and method
US11526305B2 (en) Memory for an artificial neural network accelerator
US20220164127A1 (en) Memory for an Artificial Neural Network Accelerator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant