CN109598338A - A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA - Google Patents
A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA Download PDFInfo
- Publication number
- CN109598338A CN109598338A CN201811493592.XA CN201811493592A CN109598338A CN 109598338 A CN109598338 A CN 109598338A CN 201811493592 A CN201811493592 A CN 201811493592A CN 109598338 A CN109598338 A CN 109598338A
- Authority
- CN
- China
- Prior art keywords
- weight
- data
- buffer area
- area
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The present invention discloses a kind of convolutional neural networks accelerator of calculation optimization based on FPGA, including AXI4 bus interface, data buffer area, prefetches data field, result cache area, state controller and PE array;Data buffer area is used to cache feature diagram data, convolution Nuclear Data and the index value read from external memory DDR by AXI4 bus interface;Data field is prefetched for prefetching the feature diagram data for needing parallel input PE array from the sub- buffer area of characteristic pattern;Result cache area is used to cache the calculated result of every row PE;State controller realizes the conversion between working condition for controlling accelerator working condition;PE array is used to read the data prefetched in data field and convolution nucleon buffer area and carries out convolution operation.Such accelerator terminates in advance redundant computation using parameter sparsity, the characteristic of repetition weighted data and activation primitive Relu, reduces calculation amount, and reduce energy consumption by reducing memory access number.
Description
Technical field
The invention belongs to electronic information and deep learning fields, in particular to a kind of to be based on FPGA (Filed
Programmable Gate Array) calculation optimization convolutional neural networks accelerator hardware structure.
Background technique
In recent years, the use rapid development of deep neural network, produces significant impact to world economy and social activities.
Depth convolutional neural networks technology receives significant attention in many machine learning fields, including speech recognition, natural language processing
With intelligent image processing, especially field of image recognition, depth convolutional neural networks achieve some significant achievements.At these
In field, depth convolutional neural networks can be realized the accuracy for surmounting the mankind.The brilliance of depth convolutional neural networks is originated from
Mass data can be carried out in it to extract advanced features from initial data after statistical learning.
Depth convolutional neural networks are well-known computation-intensive network, and the 90% of the total operand of convolution operation Zhan
More than.Operation information and algorithm structure when by using convolutional calculation largely calculate to reduce these, i.e. reduction reasoning institute
Work is needed to become the hot research direction of a new round.
The high-precision of depth convolutional neural networks is using high computation complexity as cost.In addition to computation-intensive, convolution mind
Need to store millions of or even nearly hundred million parameters through network.The large scale of this network proposes handling capacity to bottom accelerating hardware
With the challenge of energy efficiency.
At present, it has been proposed that be based on FPGA, GPU (Graphic Processing Unit, graphics processor) and ASIC
The various accelerators of (Application Specific Integrated Circuit, specific integrated circuit) design improve
The performance of depth convolutional neural networks.Based on the accelerator of FPGA since its performance is good, energy efficiency is high, and the development cycle is short, weight
The advantages that structure ability is strong is widely studied.Different from generic structure, FPGA allows the function of the designed hardware of user's customization, with suitable
Answer various resources and data use pattern.
Based on Such analysis, when convolutional calculation exists in the prior art thus the excessive problem of redundant computation amount, this case are produced
It is raw.
Summary of the invention
The purpose of the present invention is to provide a kind of convolutional neural networks accelerator of calculation optimization based on FPGA, sharp
With parameter sparsity, the characteristic of repetition weighted data and activation primitive Relu, redundant computation is terminated in advance, reduces calculation amount, and
Energy consumption is reduced by reducing memory access number.
In order to achieve the above objectives, solution of the invention is:
A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA, including AXI4 bus interface, data buffer storage
Area prefetches data field, result cache area, state controller and PE array;
AXI4 bus interface is general bus interface, and accelerator can be mounted to any bus using AXI4 agreement
It works in equipment;
Data buffer area be used for cache pass through AXI4 bus interface read from external memory DDR feature diagram data,
Convolution Nuclear Data and index value;Data buffer area includes the sub- buffer area of M characteristic pattern and C convolution nucleon buffer area, each column PE
One convolution nucleon buffer area of corresponding configuration, the sub- buffer area of characteristic pattern of actual use is determined according to the every layer parameter actually calculated
Number;Wherein, the sub- buffer area number M of characteristic pattern according to convolutional neural networks current layer convolution kernel size, output characteristic pattern size,
Convolution window offset determines;
Data field is prefetched for prefetching the feature diagram data for needing parallel input PE array from the sub- buffer area of characteristic pattern;
Result cache area includes the R buffer areas that bear fruit, every row PE one buffer area that bears fruit of corresponding configuration, for caching
The calculated result of every row PE;
State controller realizes the conversion between working condition for controlling accelerator working condition;
PE array is realized by FPGA, carries out convolution behaviour for reading the data prefetched in data field and convolution nucleon buffer area
Make, different output characteristic patterns are calculated in the PE of different lines, and not going together for same output characteristic pattern is calculated in the PE not gone together.
PE array includes R*C PE unit;Each PE unit includes two kinds of calculation optimization modes, is previously active mode and weight repeats mould
Formula.
Above-mentioned PE unit include input-buffer area, weight buffer area, input retrieval area, weight retrieval area, PE control unit,
It is previously active unit and configurable multiply-accumulate unit, input-buffer area and weight buffer area are respectively used to needed for storage convolutional calculation
The feature diagram data and weighted data wanted;Input retrieval area and weight retrieval area are respectively used to storage and search feature diagram data and power
The index value of tuple evidence;PE control unit reads index area index value, is read according to index value for controlling PE cell operation state
The data of buffer area are taken, multiply-accumulate unit is sent into and calculates, and configure multiply-accumulate unit mode and whether start and be previously active unit;
Be previously active unit for detect convolutional calculation part and, if part and less than 0, stop calculate output 0;Multiply-accumulate unit
For carrying out convolutional calculation, can be configured to normally multiply accumulating calculating mode or the duplicate calculation optimization mode of exploitation right weight.
Above-mentioned PE control unit determines that the convolutional calculation Optimizing Mode of multiply-accumulate unit is to be previously active mode or weight weight
Complex pattern selects different calculation optimization modes for each layer choosing;The method of determination is: determining that calculating is excellent using two bit pattern flag bits
Change mode, a high position carry out multiplying accumulating calculating normally for 0;A high position is the duplicate calculation optimization mode of exploitation right weight for 1;Low level is 0
Without being previously active;It is to be previously active mode that low level, which is 1,.
Above-mentioned weight retrieval area includes that multiple weight retrieve area, weight by from positive to negative, the last sequential write of weight of zero
Enter the sub- buffer area of weight, also area is retrieved in write-in in the order for corresponding input index value and weight index value;By weight and index
The operation of value sequence is completed offline;When convolutional calculation, according to weight index value, it is successively read the weight of weight buffer.
Above-mentioned weight index value indicated whether with a weight transfer flag bit replacement calculate weight, flag bit 0, then
Weight is constant, adopts a clock weight;Flag bit is 1, then weight changes, and following clock reads the sub- buffer area of weight in order
In next weight.
Above-mentioned PE unit includes that two kinds of calculation optimization modes are previously active to be previously active mode and weight repeat pattern
Mode refers to real time monitoring conventional part and positive and negative, PE control unit is fed back to if being negative terminates and calculate, directly output Relu
As a result zero, continue convolutional calculation if canonical;Weight repeat pattern refers to convolution operation identical for weight, first weighs corresponding
The identical feature diagram data of weight is added, then and multiplied by weight, reduction multiplication number and the memory access number to weighted data.
In above-mentioned weight repeat pattern, the input feature vector figure when weight transfer flag bit is 0 first does accumulation operations, and will
Accumulation result saves in a register.When weight transfer flag bit is 1, after finishing accumulation operations, by cumulative part and send
Enter multiplication unit to be multiplied with weight, and in result deposit register.
Above-mentioned state controller is made of 7 states, be respectively as follows: waiting, write characteristic pattern, write input index, write convolution kernel,
Weight index, convolutional calculation, calculated result to be write to send, corresponding control signal is sent corresponding submodule by each state,
Complete corresponding function.
Above-mentioned AXI4 bus interface data bit wide is greater than single weight or characteristic pattern data bit width, therefore multiple data are spliced
It is sent at a long numeric data, improve data transfer speed.
After adopting the above scheme, the present invention utilize convolutional calculation when operation information and algorithm structure, reduce redundancy without
The reading of calculating and supplemental characteristic, and convolutional neural networks are accelerated using FPGA hardware platform, it can be improved
The real-time of DCNN realizes higher calculated performance, and reduces energy consumption.
Detailed description of the invention
Fig. 1 is structural schematic diagram of the invention;
Fig. 2 is PE structural schematic diagram of the present invention;
Fig. 3 is input index and weight indexing service schematic diagram;
Fig. 4 is to be previously active cell operation schematic diagram.
Specific embodiment
Below with reference to attached drawing, technical solution of the present invention and beneficial effect are described in detail.
As shown in Figure 1, for the convolutional neural networks accelerator hardware structure that the present invention designs, with PE array size with 16*
16, convolution kernel size 3*3, for convolution kernel step-length 1, working method is as follows:
PC passes through PCI-E interface by data partition cache in external memory DDR, and data buffer area passes through AXI4 bus
Interface reads feature diagram data and is buffered in 3 sub- buffer areas of characteristic pattern by row, and input index value is buffered in spy in the same manner
Sign schemes sub- buffer area.The weighted data read by AXI4 bus interface is successively buffered in 16 convolution nucleon buffer areas, weight
Index value is buffered in convolution nucleon buffer area in the same manner.Prefetching buffer area, by row sequence to be successively read 3 characteristic pattern slow
Area's data are deposited, altogether 3*18 16 feature diagram datas of reading, 16 position feature diagram datas of each clock cycle parallel output, parallel
Input 3 feature diagram datas.The output data for prefetching buffer area is sent into every the first PE of row of PE array, and successively passes to every row phase
Adjacent PE.Input index value is sent into PE array in the same manner.Input feature vector diagram data is buffered in the sub- buffer area of input of each PE
In, input index value is buffered in input retrieval area.Weighted data and weight index value pass through 16 convolution nucleon buffer areas, and
In row input first PE of PE array each column, and successively transmit the adjacent PE of each column.The weight caching being finally buffered in PE
In area and weight retrieval area.PE unit is according to the calculation optimization mode of configuration, according to index value, from inputting sub- buffer area and weight
Data are read in sub- buffer area, carry out convolutional calculation, and accumulation result is sent into 16 buffer areas that bear fruit parallel, every row PE's
Calculated result is stored in the same buffer area that bears fruit.
As shown in connection with fig. 2, PE unit can configure two kinds of calculation optimization modes by two bit pattern flag bit S1S0, in advance
Activation pattern and weight repeat pattern.To be previously active mode when S1S0 is configured to 01, starting is previously active unit, to multiplying accumulating
The part of operation and result are monitored, and if fruit part and value are negative, are then exported Relu result 0 in advance and are stopped current convolution window
It calculates;S1S0 is configured to be weight repeat pattern when 10, and starting input summing elements operate the identical multiplication of weight, first do
Addition, input data is first carried out it is cumulative be stored in register, until weight changes, accumulation result feeding is multiplied accumulating
Unit carries out multiplying accumulating operation.When weight is 0, PE unit will close computing unit, direct output par, c and result.
Referring to Fig. 3, PE control unit successively takes feature diagram data to be sent into meter by input index value from the sub- buffer area of input
Calculate unit.Weight index value indicates that weight is constant if weight index value is 0, if weight with a weight transfer flag bit
Index value then reads next weight for 1 in order.Weight and index value are sequentially arranged from positive to negative by weight, and weight of zero is placed on most
Afterwards, which completes offline.As shown in figure 3, first four input data corresponds to same weight x, intermediate two input numbers
According to same weight y is corresponded to, last three input datas correspond to same weight z.
It referring to Fig. 4, is previously active after unit enables, part and value and zero will be compared, if part and being worth greater than zero
Then continue to calculate output final result;If part and value less than zero, will terminate calculating signal and be sent to PE control unit, PE control
Unit processed, which is closed, to be calculated, the result zero after directly output Relu.
Convolution algorithm is expanded into vector and multiplies accumulating operation so that network structure and hardware structure more match, according to operation
Information and algorithm structure, which simplify, to be calculated, and is improved computational efficiency and is reduced energy consumption.The present embodiment particular state conversion process is such as
Under:
Initialization postaccelerator enters wait state, the status signal that state controller waits AXI4 bus interface to send
State, when state is 00001, into writing convolution nuclear state;When state is 00010, into writing weight Index Status;
When state is 00100, into writing characteristic pattern state;When state is 01000, into writing input Index Status;Work as data
After receiving, waiting state is 10000, into convolutional calculation state.After calculating, transmission is jumped into automatically and calculates knot
Fruit state, and wait state is jumped back to after being sent completely.
It writes characteristic pattern: if waiting AXI4 bus interface data useful signal to draw high into the state, while successively enabling 3
A sub- buffer area of characteristic pattern, first sub- buffer area store characteristic pattern the first row data;Second sub- buffer area stores characteristic pattern
Second row data;The sub- buffer area of third stores characteristic pattern the third line data;The rebound of characteristic pattern fourth line data is stored in first
In a sub- buffer area ... after having stored feature diagram data in this order, first clock cycle takes the spy of three sub- buffer area storages
First data feeding of the first, second and third row of sign figure prefetches buffer area;Second clock cycle takes three sub- buffer area storages
Characteristic pattern fourth, fifth, six or three rows first data feeding prefetch buffer area ... characteristic pattern row traversal after, in this order successively
Taking second and third ..., a data feeding prefetches buffer area.It prefetches buffer area and 3*18 feature diagram data of storage coexists, after the completion of storage,
16 feature diagram datas of each clock cycle parallel output are sent into every first PE of row of PE array, and successively pass to every row phase
Adjacent PE is ultimately stored in the sub- buffer area of input of PE.
Write input index: if, by characteristic pattern data model storage, finally storing data in the input of PE into the state
It retrieves in area.
It writes convolution kernel: if waiting AXI4 bus interface data useful signal to draw high into the state, while successively enabling 16
A convolution nucleon buffer area, first convolution nucleon buffer area store the corresponding convolution kernel value of first output channel;Second
After convolution nucleon buffer area stores the 16 convolution nucleon buffer area storages of the corresponding convolution kernel value ... of second output channel,
Every sub- each one data of clock output of buffer area, 16 weighted datas input first PE of PE array each column parallel, and according to
It is secondary to transmit the adjacent PE of same column, it is finally buffered in the weight buffer in PE unit.
It writes weight index: if into the state, by convolution kernel data model storage, finally storing data in the weight of PE
It retrieves in area.
Convolutional calculation: if into the state, PE control unit is excellent according to the calculating that mode flags position S1S0 configures PE unit
Change mode, and according to weight index value and input index value, data, which are read, from the sub- buffer area of weight and the sub- buffer area of input send
Enter multiply-accumulate unit to be calculated, after having carried out 3*3* input channel and multiplying accumulating calculating for several times, indicates all data all
It calculates and completes, next clock, which will be jumped into, sends calculated result state.
Send calculated result: if into the state, calculated result is sequential read out from 16 calculated result buffer areas.It will be every
First output channel data in a calculated result buffer area are taken out, and every four scrabble up 64 output datas, pass through
AXI4 bus interface is sent to external memory DDR.Successively external memory DDR is all sent by 16 output channel data
In, accelerator jumps back to wait state.
It can be modified to parameter by state controller, image size when modification being supported to run, convolution kernel size, step
Long size exports characteristic pattern size, and how much is output channel number.Using operating status and algorithm structure, redundant computation is skipped, therefore
Reduce unnecessary calculating and memory access, improves convolutional neural networks accelerator efficiency, and reduce energy consumption.
The above examples only illustrate the technical idea of the present invention, and this does not limit the scope of protection of the present invention, all
According to the technical idea provided by the invention, any changes made on the basis of the technical scheme each falls within the scope of the present invention
Within.
Claims (9)
1. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA, it is characterised in that: including AXI4 bus interface,
Data buffer area prefetches data field, result cache area, state controller and PE array;
The data buffer area be used for cache pass through AXI4 bus interface read from external memory DDR feature diagram data,
Convolution Nuclear Data and index value;Data buffer area includes the sub- buffer area of M characteristic pattern and C convolution nucleon buffer area;
Data field is prefetched for prefetching the feature diagram data for needing parallel input PE array from the sub- buffer area of characteristic pattern;
PE array is realized by FPGA, includes R*C PE unit, each column PE unit one convolution nucleon buffer area of corresponding configuration, root
Every layer parameter that factually border calculates determines the sub- buffer area number of characteristic pattern of actual use;The PE array is for reading prefectching
Convolution operation is carried out according to the data in area and convolution nucleon buffer area, different output features are calculated in the PE unit of different lines
Not going together for same output characteristic pattern is calculated in figure, the PE not gone together;
Result cache area includes the R buffer areas that bear fruit, every row PE unit one buffer area that bears fruit of corresponding configuration, for caching
The calculated result of every row PE unit;
State controller realizes the conversion between working condition for controlling accelerator working condition.
2. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as described in claim 1, it is characterised in that:
The PE unit includes input-buffer area, weight buffer area, input retrieval area, weight retrieval area, PE control unit, is previously active
Unit and multiply-accumulate unit, wherein input-buffer area and weight buffer area are respectively used to feature required for storage convolutional calculation
Diagram data and weighted data, input retrieval area and weight retrieval area are respectively used to storage and search feature diagram data and weighted data
Index value;PE control unit reads index area index value, reads buffer area according to index value for controlling PE cell operation state
Data, be sent into multiply-accumulate unit and calculate, and configure multiply-accumulate unit mode and whether start and be previously active unit;It is previously active
Unit be used for detect convolutional calculation part and, if part and less than 0, stop calculate output 0;Multiply-accumulate unit is for carrying out
Convolutional calculation can be configured to normally multiply accumulating calculating mode or the duplicate calculation optimization mode of exploitation right weight.
3. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as claimed in claim 2, it is characterised in that:
The PE control unit determines that the convolutional calculation Optimizing Mode of multiply-accumulate unit is to be previously active mode or weight repeat pattern, needle
Different calculation optimization modes are selected to each layer choosing;The method of determination is: calculation optimization mode is determined using two bit pattern flag bits, it is high
Position carries out multiplying accumulating calculating normally for 0;A high position is the duplicate calculation optimization mode of exploitation right weight for 1;Low level is 0 without preparatory
Activation;It is to be previously active mode that low level, which is 1,.
4. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as claimed in claim 2, it is characterised in that:
Weight retrieval area includes that multiple weight retrieve area, weight by from positive to negative, weight of zero it is last to be sequentially written in weight sub
Also area is retrieved in write-in in the order for buffer area, corresponding input index value and weight index value;Weight and index value are sorted
Operation is offline to be completed;When convolutional calculation, according to weight index value, it is successively read the weight of weight buffer.
5. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as claimed in claim 4, it is characterised in that:
The weight index value indicates whether that replacement calculates weight with a weight transfer flag bit, and flag bit 0, then weight is not
Become, adopts a clock weight;Flag bit is 1, then weight changes, under following clock is read in order in the sub- buffer area of weight
One weight.
6. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as described in claim 1, it is characterised in that:
The PE unit includes two kinds of calculation optimization modes, and to be previously active mode and weight repeat pattern, the mode of being previously active refers to
It monitors conventional part and positive and negative in real time, calculatings is terminated if being negative, directly export Relu result zero, if canonical continuation convolution meter
It calculates;Weight repeat pattern refers to convolution operation identical for weight, is first added the identical feature diagram data of respective weights, then
And multiplied by weight, reduce multiplication number and the memory access number to weighted data.
7. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as claimed in claim 6, it is characterised in that:
In the weight repeat pattern, the input feature vector figure when weight transfer flag bit is 0 first does accumulation operations, and by accumulation result
It saves in a register;When weight transfer flag bit is 1, after finishing accumulation operations, by cumulative part and it is sent into multiplication list
Member is multiplied with weight, and in result deposit register.
8. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as described in claim 1, it is characterised in that:
The state controller is made of 7 states, is respectively as follows: waiting, is write characteristic pattern, write input index, write convolution kernel, write weight rope
Draw, the transmission of convolutional calculation, calculated result, corresponding control signal is sent corresponding submodule by each state, completes corresponding
Function.
9. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as described in claim 1, it is characterised in that:
Multiple data are spliced into a long numeric data and sent by the AXI4 bus interface.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811493592.XA CN109598338B (en) | 2018-12-07 | 2018-12-07 | Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811493592.XA CN109598338B (en) | 2018-12-07 | 2018-12-07 | Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109598338A true CN109598338A (en) | 2019-04-09 |
CN109598338B CN109598338B (en) | 2023-05-19 |
Family
ID=65961420
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811493592.XA Active CN109598338B (en) | 2018-12-07 | 2018-12-07 | Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109598338B (en) |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110059808A (en) * | 2019-06-24 | 2019-07-26 | 深兰人工智能芯片研究院(江苏)有限公司 | A kind of method for reading data and reading data device of convolutional neural networks |
CN110097174A (en) * | 2019-04-22 | 2019-08-06 | 西安交通大学 | Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row |
CN110163295A (en) * | 2019-05-29 | 2019-08-23 | 四川智盈科技有限公司 | It is a kind of based on the image recognition reasoning accelerated method terminated in advance |
CN110222835A (en) * | 2019-05-13 | 2019-09-10 | 西安交通大学 | A kind of convolutional neural networks hardware system and operation method based on zero value detection |
CN110378468A (en) * | 2019-07-08 | 2019-10-25 | 浙江大学 | A kind of neural network accelerator quantified based on structuring beta pruning and low bit |
CN110390383A (en) * | 2019-06-25 | 2019-10-29 | 东南大学 | A kind of deep neural network hardware accelerator based on power exponent quantization |
CN110390384A (en) * | 2019-06-25 | 2019-10-29 | 东南大学 | A kind of configurable general convolutional neural networks accelerator |
CN110399883A (en) * | 2019-06-28 | 2019-11-01 | 苏州浪潮智能科技有限公司 | Image characteristic extracting method, device, equipment and computer readable storage medium |
CN110414677A (en) * | 2019-07-11 | 2019-11-05 | 东南大学 | It is a kind of to deposit interior counting circuit suitable for connect binaryzation neural network entirely |
CN110673786A (en) * | 2019-09-03 | 2020-01-10 | 浪潮电子信息产业股份有限公司 | Data caching method and device |
CN110705687A (en) * | 2019-09-05 | 2020-01-17 | 北京三快在线科技有限公司 | Convolution neural network hardware computing device and method |
CN110738312A (en) * | 2019-10-15 | 2020-01-31 | 百度在线网络技术(北京)有限公司 | Method, system, device and computer readable storage medium for data processing |
CN110910434A (en) * | 2019-11-05 | 2020-03-24 | 东南大学 | Method for realizing deep learning parallax estimation algorithm based on FPGA (field programmable Gate array) high energy efficiency |
CN111062472A (en) * | 2019-12-11 | 2020-04-24 | 浙江大学 | Sparse neural network accelerator based on structured pruning and acceleration method thereof |
CN111178519A (en) * | 2019-12-27 | 2020-05-19 | 华中科技大学 | Convolutional neural network acceleration engine, convolutional neural network acceleration system and method |
CN111340198A (en) * | 2020-03-26 | 2020-06-26 | 上海大学 | Neural network accelerator with highly-multiplexed data based on FPGA (field programmable Gate array) |
CN111414994A (en) * | 2020-03-03 | 2020-07-14 | 哈尔滨工业大学 | FPGA-based Yolov3 network computing acceleration system and acceleration method thereof |
CN111416743A (en) * | 2020-03-19 | 2020-07-14 | 华中科技大学 | Convolutional network accelerator, configuration method and computer readable storage medium |
CN111898733A (en) * | 2020-07-02 | 2020-11-06 | 西安交通大学 | Deep separable convolutional neural network accelerator architecture |
CN111984548A (en) * | 2020-07-22 | 2020-11-24 | 深圳云天励飞技术有限公司 | Neural network computing device |
CN112149814A (en) * | 2020-09-23 | 2020-12-29 | 哈尔滨理工大学 | Convolutional neural network acceleration system based on FPGA |
CN112187954A (en) * | 2020-10-15 | 2021-01-05 | 中国电子科技集团公司第五十四研究所 | Flow control method of offline file in measurement and control data link transmission |
WO2021031154A1 (en) * | 2019-08-21 | 2021-02-25 | 深圳市大疆创新科技有限公司 | Method and device for loading feature map of neural network |
CN112580793A (en) * | 2020-12-24 | 2021-03-30 | 清华大学 | Neural network accelerator based on time domain memory computing and acceleration method |
CN112668708A (en) * | 2020-12-28 | 2021-04-16 | 中国电子科技集团公司第五十二研究所 | Convolution operation device for improving data utilization rate |
CN113094118A (en) * | 2021-04-26 | 2021-07-09 | 深圳思谋信息科技有限公司 | Data processing system, method, apparatus, computer device and storage medium |
CN113095471A (en) * | 2020-01-09 | 2021-07-09 | 北京君正集成电路股份有限公司 | Method for improving efficiency of detection model |
CN113111995A (en) * | 2020-01-09 | 2021-07-13 | 北京君正集成电路股份有限公司 | Method for shortening model reasoning and model post-processing operation time |
CN113780529A (en) * | 2021-09-08 | 2021-12-10 | 北京航空航天大学杭州创新研究院 | FPGA-oriented sparse convolution neural network multi-level storage computing system |
CN113869494A (en) * | 2021-09-28 | 2021-12-31 | 天津大学 | Neural network convolution FPGA embedded hardware accelerator based on high-level synthesis |
WO2022134688A1 (en) * | 2020-12-25 | 2022-06-30 | 中科寒武纪科技股份有限公司 | Data processing circuit, data processing method, and related products |
CN114780910A (en) * | 2022-06-16 | 2022-07-22 | 千芯半导体科技(北京)有限公司 | Hardware system and calculation method for sparse convolution calculation |
CN115311536A (en) * | 2022-10-11 | 2022-11-08 | 绍兴埃瓦科技有限公司 | Sparse convolution processing method and device in image processing |
CN116187408A (en) * | 2023-04-23 | 2023-05-30 | 成都甄识科技有限公司 | Sparse acceleration unit, calculation method and sparse neural network hardware acceleration system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100076915A1 (en) * | 2008-09-25 | 2010-03-25 | Microsoft Corporation | Field-Programmable Gate Array Based Accelerator System |
US20180032859A1 (en) * | 2016-07-27 | 2018-02-01 | Samsung Electronics Co., Ltd. | Accelerator in convolutional neural network and method for operating the same |
CN108241890A (en) * | 2018-01-29 | 2018-07-03 | 清华大学 | A kind of restructural neural network accelerated method and framework |
CN108537334A (en) * | 2018-04-26 | 2018-09-14 | 济南浪潮高新科技投资发展有限公司 | A kind of acceleration array design methodology for CNN convolutional layer operations |
CN108665059A (en) * | 2018-05-22 | 2018-10-16 | 中国科学技术大学苏州研究院 | Convolutional neural networks acceleration system based on field programmable gate array |
CN108805272A (en) * | 2018-05-03 | 2018-11-13 | 东南大学 | A kind of general convolutional neural networks accelerator based on FPGA |
-
2018
- 2018-12-07 CN CN201811493592.XA patent/CN109598338B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100076915A1 (en) * | 2008-09-25 | 2010-03-25 | Microsoft Corporation | Field-Programmable Gate Array Based Accelerator System |
US20180032859A1 (en) * | 2016-07-27 | 2018-02-01 | Samsung Electronics Co., Ltd. | Accelerator in convolutional neural network and method for operating the same |
CN108241890A (en) * | 2018-01-29 | 2018-07-03 | 清华大学 | A kind of restructural neural network accelerated method and framework |
CN108537334A (en) * | 2018-04-26 | 2018-09-14 | 济南浪潮高新科技投资发展有限公司 | A kind of acceleration array design methodology for CNN convolutional layer operations |
CN108805272A (en) * | 2018-05-03 | 2018-11-13 | 东南大学 | A kind of general convolutional neural networks accelerator based on FPGA |
CN108665059A (en) * | 2018-05-22 | 2018-10-16 | 中国科学技术大学苏州研究院 | Convolutional neural networks acceleration system based on field programmable gate array |
Cited By (51)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110097174A (en) * | 2019-04-22 | 2019-08-06 | 西安交通大学 | Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row |
CN110097174B (en) * | 2019-04-22 | 2021-04-20 | 西安交通大学 | Method, system and device for realizing convolutional neural network based on FPGA and row output priority |
CN110222835A (en) * | 2019-05-13 | 2019-09-10 | 西安交通大学 | A kind of convolutional neural networks hardware system and operation method based on zero value detection |
CN110163295A (en) * | 2019-05-29 | 2019-08-23 | 四川智盈科技有限公司 | It is a kind of based on the image recognition reasoning accelerated method terminated in advance |
CN110059808A (en) * | 2019-06-24 | 2019-07-26 | 深兰人工智能芯片研究院(江苏)有限公司 | A kind of method for reading data and reading data device of convolutional neural networks |
CN110390383B (en) * | 2019-06-25 | 2021-04-06 | 东南大学 | Deep neural network hardware accelerator based on power exponent quantization |
WO2020258527A1 (en) * | 2019-06-25 | 2020-12-30 | 东南大学 | Deep neural network hardware accelerator based on power exponent quantisation |
WO2020258528A1 (en) * | 2019-06-25 | 2020-12-30 | 东南大学 | Configurable universal convolutional neural network accelerator |
WO2020258841A1 (en) * | 2019-06-25 | 2020-12-30 | 东南大学 | Deep neural network hardware accelerator based on power exponent quantisation |
CN110390384A (en) * | 2019-06-25 | 2019-10-29 | 东南大学 | A kind of configurable general convolutional neural networks accelerator |
CN110390384B (en) * | 2019-06-25 | 2021-07-06 | 东南大学 | Configurable general convolutional neural network accelerator |
CN110390383A (en) * | 2019-06-25 | 2019-10-29 | 东南大学 | A kind of deep neural network hardware accelerator based on power exponent quantization |
CN110399883A (en) * | 2019-06-28 | 2019-11-01 | 苏州浪潮智能科技有限公司 | Image characteristic extracting method, device, equipment and computer readable storage medium |
CN110378468A (en) * | 2019-07-08 | 2019-10-25 | 浙江大学 | A kind of neural network accelerator quantified based on structuring beta pruning and low bit |
WO2021004366A1 (en) * | 2019-07-08 | 2021-01-14 | 浙江大学 | Neural network accelerator based on structured pruning and low-bit quantization, and method |
CN110414677A (en) * | 2019-07-11 | 2019-11-05 | 东南大学 | It is a kind of to deposit interior counting circuit suitable for connect binaryzation neural network entirely |
WO2021031154A1 (en) * | 2019-08-21 | 2021-02-25 | 深圳市大疆创新科技有限公司 | Method and device for loading feature map of neural network |
CN110673786A (en) * | 2019-09-03 | 2020-01-10 | 浪潮电子信息产业股份有限公司 | Data caching method and device |
CN110673786B (en) * | 2019-09-03 | 2020-11-10 | 浪潮电子信息产业股份有限公司 | Data caching method and device |
CN110705687A (en) * | 2019-09-05 | 2020-01-17 | 北京三快在线科技有限公司 | Convolution neural network hardware computing device and method |
CN110738312A (en) * | 2019-10-15 | 2020-01-31 | 百度在线网络技术(北京)有限公司 | Method, system, device and computer readable storage medium for data processing |
CN110910434A (en) * | 2019-11-05 | 2020-03-24 | 东南大学 | Method for realizing deep learning parallax estimation algorithm based on FPGA (field programmable Gate array) high energy efficiency |
CN110910434B (en) * | 2019-11-05 | 2023-05-12 | 东南大学 | Method for realizing deep learning parallax estimation algorithm based on FPGA (field programmable Gate array) high energy efficiency |
CN111062472A (en) * | 2019-12-11 | 2020-04-24 | 浙江大学 | Sparse neural network accelerator based on structured pruning and acceleration method thereof |
CN111178519B (en) * | 2019-12-27 | 2022-08-02 | 华中科技大学 | Convolutional neural network acceleration engine, convolutional neural network acceleration system and method |
CN111178519A (en) * | 2019-12-27 | 2020-05-19 | 华中科技大学 | Convolutional neural network acceleration engine, convolutional neural network acceleration system and method |
CN113095471A (en) * | 2020-01-09 | 2021-07-09 | 北京君正集成电路股份有限公司 | Method for improving efficiency of detection model |
CN113095471B (en) * | 2020-01-09 | 2024-05-07 | 北京君正集成电路股份有限公司 | Method for improving efficiency of detection model |
CN113111995A (en) * | 2020-01-09 | 2021-07-13 | 北京君正集成电路股份有限公司 | Method for shortening model reasoning and model post-processing operation time |
CN111414994A (en) * | 2020-03-03 | 2020-07-14 | 哈尔滨工业大学 | FPGA-based Yolov3 network computing acceleration system and acceleration method thereof |
CN111416743A (en) * | 2020-03-19 | 2020-07-14 | 华中科技大学 | Convolutional network accelerator, configuration method and computer readable storage medium |
CN111340198B (en) * | 2020-03-26 | 2023-05-05 | 上海大学 | Neural network accelerator for data high multiplexing based on FPGA |
CN111340198A (en) * | 2020-03-26 | 2020-06-26 | 上海大学 | Neural network accelerator with highly-multiplexed data based on FPGA (field programmable Gate array) |
CN111898733B (en) * | 2020-07-02 | 2022-10-25 | 西安交通大学 | Deep separable convolutional neural network accelerator architecture |
CN111898733A (en) * | 2020-07-02 | 2020-11-06 | 西安交通大学 | Deep separable convolutional neural network accelerator architecture |
CN111984548A (en) * | 2020-07-22 | 2020-11-24 | 深圳云天励飞技术有限公司 | Neural network computing device |
CN111984548B (en) * | 2020-07-22 | 2024-04-02 | 深圳云天励飞技术股份有限公司 | Neural network computing device |
CN112149814A (en) * | 2020-09-23 | 2020-12-29 | 哈尔滨理工大学 | Convolutional neural network acceleration system based on FPGA |
CN112187954A (en) * | 2020-10-15 | 2021-01-05 | 中国电子科技集团公司第五十四研究所 | Flow control method of offline file in measurement and control data link transmission |
CN112580793B (en) * | 2020-12-24 | 2022-08-12 | 清华大学 | Neural network accelerator based on time domain memory computing and acceleration method |
CN112580793A (en) * | 2020-12-24 | 2021-03-30 | 清华大学 | Neural network accelerator based on time domain memory computing and acceleration method |
WO2022134688A1 (en) * | 2020-12-25 | 2022-06-30 | 中科寒武纪科技股份有限公司 | Data processing circuit, data processing method, and related products |
CN112668708B (en) * | 2020-12-28 | 2022-10-14 | 中国电子科技集团公司第五十二研究所 | Convolution operation device for improving data utilization rate |
CN112668708A (en) * | 2020-12-28 | 2021-04-16 | 中国电子科技集团公司第五十二研究所 | Convolution operation device for improving data utilization rate |
CN113094118A (en) * | 2021-04-26 | 2021-07-09 | 深圳思谋信息科技有限公司 | Data processing system, method, apparatus, computer device and storage medium |
CN113780529B (en) * | 2021-09-08 | 2023-09-12 | 北京航空航天大学杭州创新研究院 | FPGA-oriented sparse convolutional neural network multi-stage storage computing system |
CN113780529A (en) * | 2021-09-08 | 2021-12-10 | 北京航空航天大学杭州创新研究院 | FPGA-oriented sparse convolution neural network multi-level storage computing system |
CN113869494A (en) * | 2021-09-28 | 2021-12-31 | 天津大学 | Neural network convolution FPGA embedded hardware accelerator based on high-level synthesis |
CN114780910A (en) * | 2022-06-16 | 2022-07-22 | 千芯半导体科技(北京)有限公司 | Hardware system and calculation method for sparse convolution calculation |
CN115311536A (en) * | 2022-10-11 | 2022-11-08 | 绍兴埃瓦科技有限公司 | Sparse convolution processing method and device in image processing |
CN116187408A (en) * | 2023-04-23 | 2023-05-30 | 成都甄识科技有限公司 | Sparse acceleration unit, calculation method and sparse neural network hardware acceleration system |
Also Published As
Publication number | Publication date |
---|---|
CN109598338B (en) | 2023-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109598338A (en) | A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA | |
Lu et al. | An efficient hardware accelerator for sparse convolutional neural networks on FPGAs | |
CN111242289B (en) | Convolutional neural network acceleration system and method with expandable scale | |
CN107169563B (en) | Processing system and method applied to two-value weight convolutional network | |
CN110390383A (en) | A kind of deep neural network hardware accelerator based on power exponent quantization | |
CN108805272A (en) | A kind of general convolutional neural networks accelerator based on FPGA | |
CN110390384A (en) | A kind of configurable general convolutional neural networks accelerator | |
CN108665059A (en) | Convolutional neural networks acceleration system based on field programmable gate array | |
CN107437110A (en) | The piecemeal convolution optimization method and device of convolutional neural networks | |
CN108197705A (en) | Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium | |
CN103699360B (en) | A kind of vector processor and carry out vector data access, mutual method | |
CN108537331A (en) | A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic | |
CN112487750B (en) | Convolution acceleration computing system and method based on in-memory computing | |
CN109325591A (en) | Neural network processor towards Winograd convolution | |
CN110163355A (en) | A kind of computing device and method | |
Liu et al. | FPGA-NHAP: A general FPGA-based neuromorphic hardware acceleration platform with high speed and low power | |
Zhang et al. | Implementation and optimization of the accelerator based on FPGA hardware for LSTM network | |
Duan et al. | Energy-efficient architecture for FPGA-based deep convolutional neural networks with binary weights | |
Liu et al. | CASSANN-v2: A high-performance CNN accelerator architecture with on-chip memory self-adaptive tuning | |
CN109240644A (en) | A kind of local search approach and circuit for Yi Xin chip | |
Tao et al. | Hima: A fast and scalable history-based memory access engine for differentiable neural computer | |
Zhou et al. | Mat: Processing in-memory acceleration for long-sequence attention | |
CN116822600A (en) | Neural network search chip based on RISC-V architecture | |
Feng et al. | An Efficient Model-Compressed EEGNet Accelerator for Generalized Brain-Computer Interfaces With Near Sensor Intelligence | |
CN103365821A (en) | Address generator of heterogeneous multi-core processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |