CN109598338A - A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA - Google Patents

A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA Download PDF

Info

Publication number
CN109598338A
CN109598338A CN201811493592.XA CN201811493592A CN109598338A CN 109598338 A CN109598338 A CN 109598338A CN 201811493592 A CN201811493592 A CN 201811493592A CN 109598338 A CN109598338 A CN 109598338A
Authority
CN
China
Prior art keywords
weight
data
buffer area
area
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811493592.XA
Other languages
Chinese (zh)
Other versions
CN109598338B (en
Inventor
陆生礼
庞伟
舒程昊
范雪梅
吴成路
邹涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sanbao Sci & Tech Co Ltd Nanjing
Southeast University - Wuxi Institute Of Technology Integrated Circuits
Southeast University
Original Assignee
Sanbao Sci & Tech Co Ltd Nanjing
Southeast University - Wuxi Institute Of Technology Integrated Circuits
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sanbao Sci & Tech Co Ltd Nanjing, Southeast University - Wuxi Institute Of Technology Integrated Circuits, Southeast University filed Critical Sanbao Sci & Tech Co Ltd Nanjing
Priority to CN201811493592.XA priority Critical patent/CN109598338B/en
Publication of CN109598338A publication Critical patent/CN109598338A/en
Application granted granted Critical
Publication of CN109598338B publication Critical patent/CN109598338B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention discloses a kind of convolutional neural networks accelerator of calculation optimization based on FPGA, including AXI4 bus interface, data buffer area, prefetches data field, result cache area, state controller and PE array;Data buffer area is used to cache feature diagram data, convolution Nuclear Data and the index value read from external memory DDR by AXI4 bus interface;Data field is prefetched for prefetching the feature diagram data for needing parallel input PE array from the sub- buffer area of characteristic pattern;Result cache area is used to cache the calculated result of every row PE;State controller realizes the conversion between working condition for controlling accelerator working condition;PE array is used to read the data prefetched in data field and convolution nucleon buffer area and carries out convolution operation.Such accelerator terminates in advance redundant computation using parameter sparsity, the characteristic of repetition weighted data and activation primitive Relu, reduces calculation amount, and reduce energy consumption by reducing memory access number.

Description

A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA
Technical field
The invention belongs to electronic information and deep learning fields, in particular to a kind of to be based on FPGA (Filed Programmable Gate Array) calculation optimization convolutional neural networks accelerator hardware structure.
Background technique
In recent years, the use rapid development of deep neural network, produces significant impact to world economy and social activities. Depth convolutional neural networks technology receives significant attention in many machine learning fields, including speech recognition, natural language processing With intelligent image processing, especially field of image recognition, depth convolutional neural networks achieve some significant achievements.At these In field, depth convolutional neural networks can be realized the accuracy for surmounting the mankind.The brilliance of depth convolutional neural networks is originated from Mass data can be carried out in it to extract advanced features from initial data after statistical learning.
Depth convolutional neural networks are well-known computation-intensive network, and the 90% of the total operand of convolution operation Zhan More than.Operation information and algorithm structure when by using convolutional calculation largely calculate to reduce these, i.e. reduction reasoning institute Work is needed to become the hot research direction of a new round.
The high-precision of depth convolutional neural networks is using high computation complexity as cost.In addition to computation-intensive, convolution mind Need to store millions of or even nearly hundred million parameters through network.The large scale of this network proposes handling capacity to bottom accelerating hardware With the challenge of energy efficiency.
At present, it has been proposed that be based on FPGA, GPU (Graphic Processing Unit, graphics processor) and ASIC The various accelerators of (Application Specific Integrated Circuit, specific integrated circuit) design improve The performance of depth convolutional neural networks.Based on the accelerator of FPGA since its performance is good, energy efficiency is high, and the development cycle is short, weight The advantages that structure ability is strong is widely studied.Different from generic structure, FPGA allows the function of the designed hardware of user's customization, with suitable Answer various resources and data use pattern.
Based on Such analysis, when convolutional calculation exists in the prior art thus the excessive problem of redundant computation amount, this case are produced It is raw.
Summary of the invention
The purpose of the present invention is to provide a kind of convolutional neural networks accelerator of calculation optimization based on FPGA, sharp With parameter sparsity, the characteristic of repetition weighted data and activation primitive Relu, redundant computation is terminated in advance, reduces calculation amount, and Energy consumption is reduced by reducing memory access number.
In order to achieve the above objectives, solution of the invention is:
A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA, including AXI4 bus interface, data buffer storage Area prefetches data field, result cache area, state controller and PE array;
AXI4 bus interface is general bus interface, and accelerator can be mounted to any bus using AXI4 agreement It works in equipment;
Data buffer area be used for cache pass through AXI4 bus interface read from external memory DDR feature diagram data, Convolution Nuclear Data and index value;Data buffer area includes the sub- buffer area of M characteristic pattern and C convolution nucleon buffer area, each column PE One convolution nucleon buffer area of corresponding configuration, the sub- buffer area of characteristic pattern of actual use is determined according to the every layer parameter actually calculated Number;Wherein, the sub- buffer area number M of characteristic pattern according to convolutional neural networks current layer convolution kernel size, output characteristic pattern size, Convolution window offset determines;
Data field is prefetched for prefetching the feature diagram data for needing parallel input PE array from the sub- buffer area of characteristic pattern;
Result cache area includes the R buffer areas that bear fruit, every row PE one buffer area that bears fruit of corresponding configuration, for caching The calculated result of every row PE;
State controller realizes the conversion between working condition for controlling accelerator working condition;
PE array is realized by FPGA, carries out convolution behaviour for reading the data prefetched in data field and convolution nucleon buffer area Make, different output characteristic patterns are calculated in the PE of different lines, and not going together for same output characteristic pattern is calculated in the PE not gone together. PE array includes R*C PE unit;Each PE unit includes two kinds of calculation optimization modes, is previously active mode and weight repeats mould Formula.
Above-mentioned PE unit include input-buffer area, weight buffer area, input retrieval area, weight retrieval area, PE control unit, It is previously active unit and configurable multiply-accumulate unit, input-buffer area and weight buffer area are respectively used to needed for storage convolutional calculation The feature diagram data and weighted data wanted;Input retrieval area and weight retrieval area are respectively used to storage and search feature diagram data and power The index value of tuple evidence;PE control unit reads index area index value, is read according to index value for controlling PE cell operation state The data of buffer area are taken, multiply-accumulate unit is sent into and calculates, and configure multiply-accumulate unit mode and whether start and be previously active unit; Be previously active unit for detect convolutional calculation part and, if part and less than 0, stop calculate output 0;Multiply-accumulate unit For carrying out convolutional calculation, can be configured to normally multiply accumulating calculating mode or the duplicate calculation optimization mode of exploitation right weight.
Above-mentioned PE control unit determines that the convolutional calculation Optimizing Mode of multiply-accumulate unit is to be previously active mode or weight weight Complex pattern selects different calculation optimization modes for each layer choosing;The method of determination is: determining that calculating is excellent using two bit pattern flag bits Change mode, a high position carry out multiplying accumulating calculating normally for 0;A high position is the duplicate calculation optimization mode of exploitation right weight for 1;Low level is 0 Without being previously active;It is to be previously active mode that low level, which is 1,.
Above-mentioned weight retrieval area includes that multiple weight retrieve area, weight by from positive to negative, the last sequential write of weight of zero Enter the sub- buffer area of weight, also area is retrieved in write-in in the order for corresponding input index value and weight index value;By weight and index The operation of value sequence is completed offline;When convolutional calculation, according to weight index value, it is successively read the weight of weight buffer.
Above-mentioned weight index value indicated whether with a weight transfer flag bit replacement calculate weight, flag bit 0, then Weight is constant, adopts a clock weight;Flag bit is 1, then weight changes, and following clock reads the sub- buffer area of weight in order In next weight.
Above-mentioned PE unit includes that two kinds of calculation optimization modes are previously active to be previously active mode and weight repeat pattern Mode refers to real time monitoring conventional part and positive and negative, PE control unit is fed back to if being negative terminates and calculate, directly output Relu As a result zero, continue convolutional calculation if canonical;Weight repeat pattern refers to convolution operation identical for weight, first weighs corresponding The identical feature diagram data of weight is added, then and multiplied by weight, reduction multiplication number and the memory access number to weighted data.
In above-mentioned weight repeat pattern, the input feature vector figure when weight transfer flag bit is 0 first does accumulation operations, and will Accumulation result saves in a register.When weight transfer flag bit is 1, after finishing accumulation operations, by cumulative part and send Enter multiplication unit to be multiplied with weight, and in result deposit register.
Above-mentioned state controller is made of 7 states, be respectively as follows: waiting, write characteristic pattern, write input index, write convolution kernel, Weight index, convolutional calculation, calculated result to be write to send, corresponding control signal is sent corresponding submodule by each state, Complete corresponding function.
Above-mentioned AXI4 bus interface data bit wide is greater than single weight or characteristic pattern data bit width, therefore multiple data are spliced It is sent at a long numeric data, improve data transfer speed.
After adopting the above scheme, the present invention utilize convolutional calculation when operation information and algorithm structure, reduce redundancy without The reading of calculating and supplemental characteristic, and convolutional neural networks are accelerated using FPGA hardware platform, it can be improved The real-time of DCNN realizes higher calculated performance, and reduces energy consumption.
Detailed description of the invention
Fig. 1 is structural schematic diagram of the invention;
Fig. 2 is PE structural schematic diagram of the present invention;
Fig. 3 is input index and weight indexing service schematic diagram;
Fig. 4 is to be previously active cell operation schematic diagram.
Specific embodiment
Below with reference to attached drawing, technical solution of the present invention and beneficial effect are described in detail.
As shown in Figure 1, for the convolutional neural networks accelerator hardware structure that the present invention designs, with PE array size with 16* 16, convolution kernel size 3*3, for convolution kernel step-length 1, working method is as follows:
PC passes through PCI-E interface by data partition cache in external memory DDR, and data buffer area passes through AXI4 bus Interface reads feature diagram data and is buffered in 3 sub- buffer areas of characteristic pattern by row, and input index value is buffered in spy in the same manner Sign schemes sub- buffer area.The weighted data read by AXI4 bus interface is successively buffered in 16 convolution nucleon buffer areas, weight Index value is buffered in convolution nucleon buffer area in the same manner.Prefetching buffer area, by row sequence to be successively read 3 characteristic pattern slow Area's data are deposited, altogether 3*18 16 feature diagram datas of reading, 16 position feature diagram datas of each clock cycle parallel output, parallel Input 3 feature diagram datas.The output data for prefetching buffer area is sent into every the first PE of row of PE array, and successively passes to every row phase Adjacent PE.Input index value is sent into PE array in the same manner.Input feature vector diagram data is buffered in the sub- buffer area of input of each PE In, input index value is buffered in input retrieval area.Weighted data and weight index value pass through 16 convolution nucleon buffer areas, and In row input first PE of PE array each column, and successively transmit the adjacent PE of each column.The weight caching being finally buffered in PE In area and weight retrieval area.PE unit is according to the calculation optimization mode of configuration, according to index value, from inputting sub- buffer area and weight Data are read in sub- buffer area, carry out convolutional calculation, and accumulation result is sent into 16 buffer areas that bear fruit parallel, every row PE's Calculated result is stored in the same buffer area that bears fruit.
As shown in connection with fig. 2, PE unit can configure two kinds of calculation optimization modes by two bit pattern flag bit S1S0, in advance Activation pattern and weight repeat pattern.To be previously active mode when S1S0 is configured to 01, starting is previously active unit, to multiplying accumulating The part of operation and result are monitored, and if fruit part and value are negative, are then exported Relu result 0 in advance and are stopped current convolution window It calculates;S1S0 is configured to be weight repeat pattern when 10, and starting input summing elements operate the identical multiplication of weight, first do Addition, input data is first carried out it is cumulative be stored in register, until weight changes, accumulation result feeding is multiplied accumulating Unit carries out multiplying accumulating operation.When weight is 0, PE unit will close computing unit, direct output par, c and result.
Referring to Fig. 3, PE control unit successively takes feature diagram data to be sent into meter by input index value from the sub- buffer area of input Calculate unit.Weight index value indicates that weight is constant if weight index value is 0, if weight with a weight transfer flag bit Index value then reads next weight for 1 in order.Weight and index value are sequentially arranged from positive to negative by weight, and weight of zero is placed on most Afterwards, which completes offline.As shown in figure 3, first four input data corresponds to same weight x, intermediate two input numbers According to same weight y is corresponded to, last three input datas correspond to same weight z.
It referring to Fig. 4, is previously active after unit enables, part and value and zero will be compared, if part and being worth greater than zero Then continue to calculate output final result;If part and value less than zero, will terminate calculating signal and be sent to PE control unit, PE control Unit processed, which is closed, to be calculated, the result zero after directly output Relu.
Convolution algorithm is expanded into vector and multiplies accumulating operation so that network structure and hardware structure more match, according to operation Information and algorithm structure, which simplify, to be calculated, and is improved computational efficiency and is reduced energy consumption.The present embodiment particular state conversion process is such as Under:
Initialization postaccelerator enters wait state, the status signal that state controller waits AXI4 bus interface to send State, when state is 00001, into writing convolution nuclear state;When state is 00010, into writing weight Index Status; When state is 00100, into writing characteristic pattern state;When state is 01000, into writing input Index Status;Work as data After receiving, waiting state is 10000, into convolutional calculation state.After calculating, transmission is jumped into automatically and calculates knot Fruit state, and wait state is jumped back to after being sent completely.
It writes characteristic pattern: if waiting AXI4 bus interface data useful signal to draw high into the state, while successively enabling 3 A sub- buffer area of characteristic pattern, first sub- buffer area store characteristic pattern the first row data;Second sub- buffer area stores characteristic pattern Second row data;The sub- buffer area of third stores characteristic pattern the third line data;The rebound of characteristic pattern fourth line data is stored in first In a sub- buffer area ... after having stored feature diagram data in this order, first clock cycle takes the spy of three sub- buffer area storages First data feeding of the first, second and third row of sign figure prefetches buffer area;Second clock cycle takes three sub- buffer area storages Characteristic pattern fourth, fifth, six or three rows first data feeding prefetch buffer area ... characteristic pattern row traversal after, in this order successively Taking second and third ..., a data feeding prefetches buffer area.It prefetches buffer area and 3*18 feature diagram data of storage coexists, after the completion of storage, 16 feature diagram datas of each clock cycle parallel output are sent into every first PE of row of PE array, and successively pass to every row phase Adjacent PE is ultimately stored in the sub- buffer area of input of PE.
Write input index: if, by characteristic pattern data model storage, finally storing data in the input of PE into the state It retrieves in area.
It writes convolution kernel: if waiting AXI4 bus interface data useful signal to draw high into the state, while successively enabling 16 A convolution nucleon buffer area, first convolution nucleon buffer area store the corresponding convolution kernel value of first output channel;Second After convolution nucleon buffer area stores the 16 convolution nucleon buffer area storages of the corresponding convolution kernel value ... of second output channel, Every sub- each one data of clock output of buffer area, 16 weighted datas input first PE of PE array each column parallel, and according to It is secondary to transmit the adjacent PE of same column, it is finally buffered in the weight buffer in PE unit.
It writes weight index: if into the state, by convolution kernel data model storage, finally storing data in the weight of PE It retrieves in area.
Convolutional calculation: if into the state, PE control unit is excellent according to the calculating that mode flags position S1S0 configures PE unit Change mode, and according to weight index value and input index value, data, which are read, from the sub- buffer area of weight and the sub- buffer area of input send Enter multiply-accumulate unit to be calculated, after having carried out 3*3* input channel and multiplying accumulating calculating for several times, indicates all data all It calculates and completes, next clock, which will be jumped into, sends calculated result state.
Send calculated result: if into the state, calculated result is sequential read out from 16 calculated result buffer areas.It will be every First output channel data in a calculated result buffer area are taken out, and every four scrabble up 64 output datas, pass through AXI4 bus interface is sent to external memory DDR.Successively external memory DDR is all sent by 16 output channel data In, accelerator jumps back to wait state.
It can be modified to parameter by state controller, image size when modification being supported to run, convolution kernel size, step Long size exports characteristic pattern size, and how much is output channel number.Using operating status and algorithm structure, redundant computation is skipped, therefore Reduce unnecessary calculating and memory access, improves convolutional neural networks accelerator efficiency, and reduce energy consumption.
The above examples only illustrate the technical idea of the present invention, and this does not limit the scope of protection of the present invention, all According to the technical idea provided by the invention, any changes made on the basis of the technical scheme each falls within the scope of the present invention Within.

Claims (9)

1. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA, it is characterised in that: including AXI4 bus interface, Data buffer area prefetches data field, result cache area, state controller and PE array;
The data buffer area be used for cache pass through AXI4 bus interface read from external memory DDR feature diagram data, Convolution Nuclear Data and index value;Data buffer area includes the sub- buffer area of M characteristic pattern and C convolution nucleon buffer area;
Data field is prefetched for prefetching the feature diagram data for needing parallel input PE array from the sub- buffer area of characteristic pattern;
PE array is realized by FPGA, includes R*C PE unit, each column PE unit one convolution nucleon buffer area of corresponding configuration, root Every layer parameter that factually border calculates determines the sub- buffer area number of characteristic pattern of actual use;The PE array is for reading prefectching Convolution operation is carried out according to the data in area and convolution nucleon buffer area, different output features are calculated in the PE unit of different lines Not going together for same output characteristic pattern is calculated in figure, the PE not gone together;
Result cache area includes the R buffer areas that bear fruit, every row PE unit one buffer area that bears fruit of corresponding configuration, for caching The calculated result of every row PE unit;
State controller realizes the conversion between working condition for controlling accelerator working condition.
2. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as described in claim 1, it is characterised in that: The PE unit includes input-buffer area, weight buffer area, input retrieval area, weight retrieval area, PE control unit, is previously active Unit and multiply-accumulate unit, wherein input-buffer area and weight buffer area are respectively used to feature required for storage convolutional calculation Diagram data and weighted data, input retrieval area and weight retrieval area are respectively used to storage and search feature diagram data and weighted data Index value;PE control unit reads index area index value, reads buffer area according to index value for controlling PE cell operation state Data, be sent into multiply-accumulate unit and calculate, and configure multiply-accumulate unit mode and whether start and be previously active unit;It is previously active Unit be used for detect convolutional calculation part and, if part and less than 0, stop calculate output 0;Multiply-accumulate unit is for carrying out Convolutional calculation can be configured to normally multiply accumulating calculating mode or the duplicate calculation optimization mode of exploitation right weight.
3. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as claimed in claim 2, it is characterised in that: The PE control unit determines that the convolutional calculation Optimizing Mode of multiply-accumulate unit is to be previously active mode or weight repeat pattern, needle Different calculation optimization modes are selected to each layer choosing;The method of determination is: calculation optimization mode is determined using two bit pattern flag bits, it is high Position carries out multiplying accumulating calculating normally for 0;A high position is the duplicate calculation optimization mode of exploitation right weight for 1;Low level is 0 without preparatory Activation;It is to be previously active mode that low level, which is 1,.
4. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as claimed in claim 2, it is characterised in that: Weight retrieval area includes that multiple weight retrieve area, weight by from positive to negative, weight of zero it is last to be sequentially written in weight sub Also area is retrieved in write-in in the order for buffer area, corresponding input index value and weight index value;Weight and index value are sorted Operation is offline to be completed;When convolutional calculation, according to weight index value, it is successively read the weight of weight buffer.
5. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as claimed in claim 4, it is characterised in that: The weight index value indicates whether that replacement calculates weight with a weight transfer flag bit, and flag bit 0, then weight is not Become, adopts a clock weight;Flag bit is 1, then weight changes, under following clock is read in order in the sub- buffer area of weight One weight.
6. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as described in claim 1, it is characterised in that: The PE unit includes two kinds of calculation optimization modes, and to be previously active mode and weight repeat pattern, the mode of being previously active refers to It monitors conventional part and positive and negative in real time, calculatings is terminated if being negative, directly export Relu result zero, if canonical continuation convolution meter It calculates;Weight repeat pattern refers to convolution operation identical for weight, is first added the identical feature diagram data of respective weights, then And multiplied by weight, reduce multiplication number and the memory access number to weighted data.
7. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as claimed in claim 6, it is characterised in that: In the weight repeat pattern, the input feature vector figure when weight transfer flag bit is 0 first does accumulation operations, and by accumulation result It saves in a register;When weight transfer flag bit is 1, after finishing accumulation operations, by cumulative part and it is sent into multiplication list Member is multiplied with weight, and in result deposit register.
8. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as described in claim 1, it is characterised in that: The state controller is made of 7 states, is respectively as follows: waiting, is write characteristic pattern, write input index, write convolution kernel, write weight rope Draw, the transmission of convolutional calculation, calculated result, corresponding control signal is sent corresponding submodule by each state, completes corresponding Function.
9. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as described in claim 1, it is characterised in that: Multiple data are spliced into a long numeric data and sent by the AXI4 bus interface.
CN201811493592.XA 2018-12-07 2018-12-07 Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization Active CN109598338B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811493592.XA CN109598338B (en) 2018-12-07 2018-12-07 Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811493592.XA CN109598338B (en) 2018-12-07 2018-12-07 Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization

Publications (2)

Publication Number Publication Date
CN109598338A true CN109598338A (en) 2019-04-09
CN109598338B CN109598338B (en) 2023-05-19

Family

ID=65961420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811493592.XA Active CN109598338B (en) 2018-12-07 2018-12-07 Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization

Country Status (1)

Country Link
CN (1) CN109598338B (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059808A (en) * 2019-06-24 2019-07-26 深兰人工智能芯片研究院(江苏)有限公司 A kind of method for reading data and reading data device of convolutional neural networks
CN110097174A (en) * 2019-04-22 2019-08-06 西安交通大学 Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row
CN110163295A (en) * 2019-05-29 2019-08-23 四川智盈科技有限公司 It is a kind of based on the image recognition reasoning accelerated method terminated in advance
CN110222835A (en) * 2019-05-13 2019-09-10 西安交通大学 A kind of convolutional neural networks hardware system and operation method based on zero value detection
CN110378468A (en) * 2019-07-08 2019-10-25 浙江大学 A kind of neural network accelerator quantified based on structuring beta pruning and low bit
CN110390383A (en) * 2019-06-25 2019-10-29 东南大学 A kind of deep neural network hardware accelerator based on power exponent quantization
CN110390384A (en) * 2019-06-25 2019-10-29 东南大学 A kind of configurable general convolutional neural networks accelerator
CN110399883A (en) * 2019-06-28 2019-11-01 苏州浪潮智能科技有限公司 Image characteristic extracting method, device, equipment and computer readable storage medium
CN110414677A (en) * 2019-07-11 2019-11-05 东南大学 It is a kind of to deposit interior counting circuit suitable for connect binaryzation neural network entirely
CN110673786A (en) * 2019-09-03 2020-01-10 浪潮电子信息产业股份有限公司 Data caching method and device
CN110705687A (en) * 2019-09-05 2020-01-17 北京三快在线科技有限公司 Convolution neural network hardware computing device and method
CN110738312A (en) * 2019-10-15 2020-01-31 百度在线网络技术(北京)有限公司 Method, system, device and computer readable storage medium for data processing
CN110910434A (en) * 2019-11-05 2020-03-24 东南大学 Method for realizing deep learning parallax estimation algorithm based on FPGA (field programmable Gate array) high energy efficiency
CN111062472A (en) * 2019-12-11 2020-04-24 浙江大学 Sparse neural network accelerator based on structured pruning and acceleration method thereof
CN111178519A (en) * 2019-12-27 2020-05-19 华中科技大学 Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
CN111340198A (en) * 2020-03-26 2020-06-26 上海大学 Neural network accelerator with highly-multiplexed data based on FPGA (field programmable Gate array)
CN111414994A (en) * 2020-03-03 2020-07-14 哈尔滨工业大学 FPGA-based Yolov3 network computing acceleration system and acceleration method thereof
CN111416743A (en) * 2020-03-19 2020-07-14 华中科技大学 Convolutional network accelerator, configuration method and computer readable storage medium
CN111898733A (en) * 2020-07-02 2020-11-06 西安交通大学 Deep separable convolutional neural network accelerator architecture
CN111984548A (en) * 2020-07-22 2020-11-24 深圳云天励飞技术有限公司 Neural network computing device
CN112149814A (en) * 2020-09-23 2020-12-29 哈尔滨理工大学 Convolutional neural network acceleration system based on FPGA
CN112187954A (en) * 2020-10-15 2021-01-05 中国电子科技集团公司第五十四研究所 Flow control method of offline file in measurement and control data link transmission
WO2021031154A1 (en) * 2019-08-21 2021-02-25 深圳市大疆创新科技有限公司 Method and device for loading feature map of neural network
CN112580793A (en) * 2020-12-24 2021-03-30 清华大学 Neural network accelerator based on time domain memory computing and acceleration method
CN112668708A (en) * 2020-12-28 2021-04-16 中国电子科技集团公司第五十二研究所 Convolution operation device for improving data utilization rate
CN113094118A (en) * 2021-04-26 2021-07-09 深圳思谋信息科技有限公司 Data processing system, method, apparatus, computer device and storage medium
CN113095471A (en) * 2020-01-09 2021-07-09 北京君正集成电路股份有限公司 Method for improving efficiency of detection model
CN113111995A (en) * 2020-01-09 2021-07-13 北京君正集成电路股份有限公司 Method for shortening model reasoning and model post-processing operation time
CN113780529A (en) * 2021-09-08 2021-12-10 北京航空航天大学杭州创新研究院 FPGA-oriented sparse convolution neural network multi-level storage computing system
CN113869494A (en) * 2021-09-28 2021-12-31 天津大学 Neural network convolution FPGA embedded hardware accelerator based on high-level synthesis
WO2022134688A1 (en) * 2020-12-25 2022-06-30 中科寒武纪科技股份有限公司 Data processing circuit, data processing method, and related products
CN114780910A (en) * 2022-06-16 2022-07-22 千芯半导体科技(北京)有限公司 Hardware system and calculation method for sparse convolution calculation
CN115311536A (en) * 2022-10-11 2022-11-08 绍兴埃瓦科技有限公司 Sparse convolution processing method and device in image processing
CN116187408A (en) * 2023-04-23 2023-05-30 成都甄识科技有限公司 Sparse acceleration unit, calculation method and sparse neural network hardware acceleration system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100076915A1 (en) * 2008-09-25 2010-03-25 Microsoft Corporation Field-Programmable Gate Array Based Accelerator System
US20180032859A1 (en) * 2016-07-27 2018-02-01 Samsung Electronics Co., Ltd. Accelerator in convolutional neural network and method for operating the same
CN108241890A (en) * 2018-01-29 2018-07-03 清华大学 A kind of restructural neural network accelerated method and framework
CN108537334A (en) * 2018-04-26 2018-09-14 济南浪潮高新科技投资发展有限公司 A kind of acceleration array design methodology for CNN convolutional layer operations
CN108665059A (en) * 2018-05-22 2018-10-16 中国科学技术大学苏州研究院 Convolutional neural networks acceleration system based on field programmable gate array
CN108805272A (en) * 2018-05-03 2018-11-13 东南大学 A kind of general convolutional neural networks accelerator based on FPGA

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100076915A1 (en) * 2008-09-25 2010-03-25 Microsoft Corporation Field-Programmable Gate Array Based Accelerator System
US20180032859A1 (en) * 2016-07-27 2018-02-01 Samsung Electronics Co., Ltd. Accelerator in convolutional neural network and method for operating the same
CN108241890A (en) * 2018-01-29 2018-07-03 清华大学 A kind of restructural neural network accelerated method and framework
CN108537334A (en) * 2018-04-26 2018-09-14 济南浪潮高新科技投资发展有限公司 A kind of acceleration array design methodology for CNN convolutional layer operations
CN108805272A (en) * 2018-05-03 2018-11-13 东南大学 A kind of general convolutional neural networks accelerator based on FPGA
CN108665059A (en) * 2018-05-22 2018-10-16 中国科学技术大学苏州研究院 Convolutional neural networks acceleration system based on field programmable gate array

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097174A (en) * 2019-04-22 2019-08-06 西安交通大学 Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row
CN110097174B (en) * 2019-04-22 2021-04-20 西安交通大学 Method, system and device for realizing convolutional neural network based on FPGA and row output priority
CN110222835A (en) * 2019-05-13 2019-09-10 西安交通大学 A kind of convolutional neural networks hardware system and operation method based on zero value detection
CN110163295A (en) * 2019-05-29 2019-08-23 四川智盈科技有限公司 It is a kind of based on the image recognition reasoning accelerated method terminated in advance
CN110059808A (en) * 2019-06-24 2019-07-26 深兰人工智能芯片研究院(江苏)有限公司 A kind of method for reading data and reading data device of convolutional neural networks
CN110390383B (en) * 2019-06-25 2021-04-06 东南大学 Deep neural network hardware accelerator based on power exponent quantization
WO2020258527A1 (en) * 2019-06-25 2020-12-30 东南大学 Deep neural network hardware accelerator based on power exponent quantisation
WO2020258528A1 (en) * 2019-06-25 2020-12-30 东南大学 Configurable universal convolutional neural network accelerator
WO2020258841A1 (en) * 2019-06-25 2020-12-30 东南大学 Deep neural network hardware accelerator based on power exponent quantisation
CN110390384A (en) * 2019-06-25 2019-10-29 东南大学 A kind of configurable general convolutional neural networks accelerator
CN110390384B (en) * 2019-06-25 2021-07-06 东南大学 Configurable general convolutional neural network accelerator
CN110390383A (en) * 2019-06-25 2019-10-29 东南大学 A kind of deep neural network hardware accelerator based on power exponent quantization
CN110399883A (en) * 2019-06-28 2019-11-01 苏州浪潮智能科技有限公司 Image characteristic extracting method, device, equipment and computer readable storage medium
CN110378468A (en) * 2019-07-08 2019-10-25 浙江大学 A kind of neural network accelerator quantified based on structuring beta pruning and low bit
WO2021004366A1 (en) * 2019-07-08 2021-01-14 浙江大学 Neural network accelerator based on structured pruning and low-bit quantization, and method
CN110414677A (en) * 2019-07-11 2019-11-05 东南大学 It is a kind of to deposit interior counting circuit suitable for connect binaryzation neural network entirely
WO2021031154A1 (en) * 2019-08-21 2021-02-25 深圳市大疆创新科技有限公司 Method and device for loading feature map of neural network
CN110673786A (en) * 2019-09-03 2020-01-10 浪潮电子信息产业股份有限公司 Data caching method and device
CN110673786B (en) * 2019-09-03 2020-11-10 浪潮电子信息产业股份有限公司 Data caching method and device
CN110705687A (en) * 2019-09-05 2020-01-17 北京三快在线科技有限公司 Convolution neural network hardware computing device and method
CN110738312A (en) * 2019-10-15 2020-01-31 百度在线网络技术(北京)有限公司 Method, system, device and computer readable storage medium for data processing
CN110910434A (en) * 2019-11-05 2020-03-24 东南大学 Method for realizing deep learning parallax estimation algorithm based on FPGA (field programmable Gate array) high energy efficiency
CN110910434B (en) * 2019-11-05 2023-05-12 东南大学 Method for realizing deep learning parallax estimation algorithm based on FPGA (field programmable Gate array) high energy efficiency
CN111062472A (en) * 2019-12-11 2020-04-24 浙江大学 Sparse neural network accelerator based on structured pruning and acceleration method thereof
CN111178519B (en) * 2019-12-27 2022-08-02 华中科技大学 Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
CN111178519A (en) * 2019-12-27 2020-05-19 华中科技大学 Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
CN113095471A (en) * 2020-01-09 2021-07-09 北京君正集成电路股份有限公司 Method for improving efficiency of detection model
CN113095471B (en) * 2020-01-09 2024-05-07 北京君正集成电路股份有限公司 Method for improving efficiency of detection model
CN113111995A (en) * 2020-01-09 2021-07-13 北京君正集成电路股份有限公司 Method for shortening model reasoning and model post-processing operation time
CN111414994A (en) * 2020-03-03 2020-07-14 哈尔滨工业大学 FPGA-based Yolov3 network computing acceleration system and acceleration method thereof
CN111416743A (en) * 2020-03-19 2020-07-14 华中科技大学 Convolutional network accelerator, configuration method and computer readable storage medium
CN111340198B (en) * 2020-03-26 2023-05-05 上海大学 Neural network accelerator for data high multiplexing based on FPGA
CN111340198A (en) * 2020-03-26 2020-06-26 上海大学 Neural network accelerator with highly-multiplexed data based on FPGA (field programmable Gate array)
CN111898733B (en) * 2020-07-02 2022-10-25 西安交通大学 Deep separable convolutional neural network accelerator architecture
CN111898733A (en) * 2020-07-02 2020-11-06 西安交通大学 Deep separable convolutional neural network accelerator architecture
CN111984548A (en) * 2020-07-22 2020-11-24 深圳云天励飞技术有限公司 Neural network computing device
CN111984548B (en) * 2020-07-22 2024-04-02 深圳云天励飞技术股份有限公司 Neural network computing device
CN112149814A (en) * 2020-09-23 2020-12-29 哈尔滨理工大学 Convolutional neural network acceleration system based on FPGA
CN112187954A (en) * 2020-10-15 2021-01-05 中国电子科技集团公司第五十四研究所 Flow control method of offline file in measurement and control data link transmission
CN112580793B (en) * 2020-12-24 2022-08-12 清华大学 Neural network accelerator based on time domain memory computing and acceleration method
CN112580793A (en) * 2020-12-24 2021-03-30 清华大学 Neural network accelerator based on time domain memory computing and acceleration method
WO2022134688A1 (en) * 2020-12-25 2022-06-30 中科寒武纪科技股份有限公司 Data processing circuit, data processing method, and related products
CN112668708B (en) * 2020-12-28 2022-10-14 中国电子科技集团公司第五十二研究所 Convolution operation device for improving data utilization rate
CN112668708A (en) * 2020-12-28 2021-04-16 中国电子科技集团公司第五十二研究所 Convolution operation device for improving data utilization rate
CN113094118A (en) * 2021-04-26 2021-07-09 深圳思谋信息科技有限公司 Data processing system, method, apparatus, computer device and storage medium
CN113780529B (en) * 2021-09-08 2023-09-12 北京航空航天大学杭州创新研究院 FPGA-oriented sparse convolutional neural network multi-stage storage computing system
CN113780529A (en) * 2021-09-08 2021-12-10 北京航空航天大学杭州创新研究院 FPGA-oriented sparse convolution neural network multi-level storage computing system
CN113869494A (en) * 2021-09-28 2021-12-31 天津大学 Neural network convolution FPGA embedded hardware accelerator based on high-level synthesis
CN114780910A (en) * 2022-06-16 2022-07-22 千芯半导体科技(北京)有限公司 Hardware system and calculation method for sparse convolution calculation
CN115311536A (en) * 2022-10-11 2022-11-08 绍兴埃瓦科技有限公司 Sparse convolution processing method and device in image processing
CN116187408A (en) * 2023-04-23 2023-05-30 成都甄识科技有限公司 Sparse acceleration unit, calculation method and sparse neural network hardware acceleration system

Also Published As

Publication number Publication date
CN109598338B (en) 2023-05-19

Similar Documents

Publication Publication Date Title
CN109598338A (en) A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA
Lu et al. An efficient hardware accelerator for sparse convolutional neural networks on FPGAs
CN111242289B (en) Convolutional neural network acceleration system and method with expandable scale
CN107169563B (en) Processing system and method applied to two-value weight convolutional network
CN110390383A (en) A kind of deep neural network hardware accelerator based on power exponent quantization
CN108805272A (en) A kind of general convolutional neural networks accelerator based on FPGA
CN110390384A (en) A kind of configurable general convolutional neural networks accelerator
CN108665059A (en) Convolutional neural networks acceleration system based on field programmable gate array
CN107437110A (en) The piecemeal convolution optimization method and device of convolutional neural networks
CN108197705A (en) Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium
CN103699360B (en) A kind of vector processor and carry out vector data access, mutual method
CN108537331A (en) A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic
CN112487750B (en) Convolution acceleration computing system and method based on in-memory computing
CN109325591A (en) Neural network processor towards Winograd convolution
CN110163355A (en) A kind of computing device and method
Liu et al. FPGA-NHAP: A general FPGA-based neuromorphic hardware acceleration platform with high speed and low power
Zhang et al. Implementation and optimization of the accelerator based on FPGA hardware for LSTM network
Duan et al. Energy-efficient architecture for FPGA-based deep convolutional neural networks with binary weights
Liu et al. CASSANN-v2: A high-performance CNN accelerator architecture with on-chip memory self-adaptive tuning
CN109240644A (en) A kind of local search approach and circuit for Yi Xin chip
Tao et al. Hima: A fast and scalable history-based memory access engine for differentiable neural computer
Zhou et al. Mat: Processing in-memory acceleration for long-sequence attention
CN116822600A (en) Neural network search chip based on RISC-V architecture
Feng et al. An Efficient Model-Compressed EEGNet Accelerator for Generalized Brain-Computer Interfaces With Near Sensor Intelligence
CN103365821A (en) Address generator of heterogeneous multi-core processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant