CN106228238B - Accelerate the method and system of deep learning algorithm on field programmable gate array platform - Google Patents
Accelerate the method and system of deep learning algorithm on field programmable gate array platform Download PDFInfo
- Publication number
- CN106228238B CN106228238B CN201610596159.3A CN201610596159A CN106228238B CN 106228238 B CN106228238 B CN 106228238B CN 201610596159 A CN201610596159 A CN 201610596159A CN 106228238 B CN106228238 B CN 106228238B
- Authority
- CN
- China
- Prior art keywords
- data
- hardware
- module
- dma
- programmable gate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a kind of methods for accelerating deep learning algorithm on field programmable gate array platform, field programmable gate array platform includes general processor, field programmable gate array and memory module, the following steps are included: predicting process and training process according to deep learning, and deep neural network and convolutional neural networks are combined, determine the general-purpose computations part for being suitable for running on field programmable gate array platform;According to the general-purpose computations part of confirmation, software-hardware synergism calculation is determined;According to calculating logic resource, the bandwidth situation of FPGA, the cured value volume and range of product of IP kernel is determined, using hardware computation unit, accelerated on programmable gate array platform at the scene.The hardware processing element accelerated for deep learning algorithm can be quickly designed according to hardware resource, processing unit has high-performance, low-power consumption feature relative to general processor.
Description
Technical field
The present invention relates to computer hardwares to accelerate field, more particularly to accelerating on a kind of field programmable gate array platform
The method and system of deep learning algorithm.
Background technique
Deep learning has significant achievement on solving high-level abstractions cognitive question, has made in machine learning a new platform
Rank.It not only has a very high scientific research value, but also has very strong practicability, cause no matter academia and industry all very
Favor.However, in order to solve more to be abstracted, more complicated problem concerning study, the network size of deep learning is being continuously increased, and counts
It calculates and the complexity of data also increases severely therewith, for example Google Cat grid has 1,000,000,000 or so neurons.High-performance is low
Deep learning related algorithm is accelerated to energy consumption to become the research hotspot of scientific research and commercial undertaking.
Usual calculating task is divided to two kinds from manifestation mode: on aageneral-purposeaprocessor, task is usually with the shape of software code
Formula is presented, referred to as software task;On special hardware circuit, the intrinsic rapid charater of hardware is given full play to replace software to appoint
Business, referred to as hardware task.Common hardware-accelerated technology has application-specific integrated circuit ASIC (Application Specific
Integrated Circuit), field programmable gate array FPGA (Field Programmable Gate Array) and
Graphics processor GPU (Graphics Processing Unit).ASIC is the ic core designed and developed for special-purpose
Piece has the characteristics that high-performance, low-power consumption, area are small.Usually relative to FPGA, ASIC run faster, power consumption it is lower, and
It is also cheaper when quantization production.Although transistor ratio ASIC used in FPGA is more, FPGA for same given function
Logic task design is simplified, the design cycle, ratio ASIC was short very much.In addition, the exposure mask cost of production ASIC is very high, with line
Wide reduction, exposure mask cost exponentially increase.FPGA is as the programmable normal component for being applicable in different function, without such great number
Research and development cost, and have certain flexibility.GPU is suitable for the parallel computation of mass data, has high bandwidth, high master
Frequently, high concurrency feature, and CUDA (Compute Unified Device Architecture) universal parallel Computational frame
Proposition, make that developer is more convenient, quickly designs high performance solution.But the power consumption of GPU is higher, the function of single GPU
Consumption is often higher than the CPU power consumption of contemporary mainstream, will more tens times even energy consumption of hundreds of times usually relative to FPGA.
Summary of the invention
In view of this, object of the present invention is to: it provides and accelerates deep learning to calculate on a kind of field programmable gate array platform
The method and system of method can quickly design the hardware processing element accelerated for deep learning algorithm according to hardware resource,
Processing unit has high-performance, low-power consumption feature relative to general processor.
The technical scheme is that
Accelerate the method for deep learning algorithm on a kind of field programmable gate array platform, which is characterized in that scene can compile
Journey gate array platform includes general processor, field programmable gate array and memory module, comprising the following steps:
S01: predicting process and training process according to deep learning, and combine deep neural network and convolutional neural networks,
Determine the general-purpose computations part for being suitable for running on field programmable gate array platform;
S02: according to the general-purpose computations part of confirmation, software-hardware synergism calculation is determined;
S03: it according to calculating logic resource, the bandwidth situation of FPGA, determines the cured value volume and range of product of IP kernel, utilizes hardware
Arithmetic element is accelerated on programmable gate array platform at the scene.
In optimal technical scheme, the general-purpose computations part includes forward calculation module, calculates and swashs for matrix multiplication
Encourage function calculating;Right value update module is calculated for vector.
In optimal technical scheme, the step S02 the following steps are included:
Data Preparation is carried out in software end;
Matrix multiplication is converted by convolutional layer convolutional calculation in convolutional neural networks;
The data path calculated as software-hardware synergism is read using direct memory.
In optimal technical scheme, the cured value volume and range of product of IP kernel is determined in the step S03, comprising: according to pending
Hardware task, determine the type of cured arithmetic element on FPGA;According to FPGA hardware logical resource and bandwidth situation, determine
The quantity of the processing unit of pending hardware task.
In optimal technical scheme, the forward calculation module is designed using fragment, and the every a line inside of node matrix equation is pressed and is divided
Piece size carries out fragment, and each column of weighting parameter matrix carry out fragment according to fragment size, by the every fragment for being about to node matrix equation
Size data fragment size numerical value corresponding with each column of weighting parameter matrix carries out dot-product operation, and every a line has been calculated complete
Nonce is added up afterwards and obtains final result.
In optimal technical scheme, the n times side that the fragment size is 2 is consistent with the parallel granularity of arithmetic element.
The present invention discloses a kind of for accelerating the FPGA structure of deep learning algorithm again characterized by comprising
The node data matrix of forward calculation module and weighting parameter matrix are carried out fragment, timesharing by fragment processing structure
It is multiplexed hardware logic;
Excitation function linear approximation realizes structure, for generating arbitrary excitation function;
Parameter configuration module, for configuring the parameter of processing unit;
Forward calculation module, the forward direction that forward calculation hardware configuration and double DMA including single DMA caching weight are read parallel
Computing hardware structure;For the forward calculation of deep neural network, the forward calculation of convolutional neural networks convolutional layer and layer of classifying
And matrix multiplication operation, and carry out assembly line and be optimized to maximum throughput rate;
Right value update module is calculated for vector.
In optimal technical scheme, the parameter configuration module carries out processing unit by DMA transfer configuration parameter data
Configuration, comprising: the operating mode configuration and data scale configuration of forward calculation module, data scale configuration include that node data is advised
Mould configuration, the configuration of input neuron scale and the configuration of output neuron scale;The configuration of right value update module data scale, Working mould
Formula configuration and calculating parameter configuration.
In optimal technical scheme, the forward calculation hardware configuration of the list DMA caching weight includes:
Single DMA is responsible for reading data, writes back;
Pair register buffer area alternately reads data or carries out parallel computation;BRAM group caches and guarantees that data parallel is read
It takes;
With the equal-sized floating-point multiplier of fragment;
With the y-bend add tree of fragment input equal in magnitude;
Cycle accumulor device, cumulative nonce are saved on piece BRAM;
Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation;
The forward calculation hardware configuration that double DMA are read parallel includes:
Neuron data read module is furnished with DMA and FIFO buffer area, is responsible for reading input neuron node data;
Weighting parameter data read module is furnished with DMA and FIFO buffer area, is responsible for reading weighting parameter data;
With the equal-sized floating-point multiplier of fragment;
With the y-bend add tree of fragment input equal in magnitude;
Cycle accumulor device, cumulative nonce are saved on piece BRAM;
Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation.
In optimal technical scheme, the right value update module calculates the calculating with output layer error amount for right value update,
And it carries out assembly line and is optimized to maximum throughput rate, comprising: vector A data read module and vector B data read module, respectively
Equipped with DMA and fifo buffer, two groups of vector values for calculating are read respectively;Computing module is carried out pair by configuration information
The vector answered calculates;As a result module is write back, is furnished with DMA and fifo buffer, calculated result is written back to host's end memory.
Compared with prior art, the invention has the advantages that
The present invention can effectively accelerate deep learning algorithm, including study prediction process and training process, being capable of basis
Hardware resource quickly designs the hardware processing element accelerated for deep learning algorithm, and processing unit is relative to general processor
There are high-performance, low-power consumption feature.
Detailed description of the invention
The invention will be further described with reference to the accompanying drawings and embodiments:
Fig. 1 is the process for accelerating deep learning method on the field programmable gate array platform of the embodiment of the present invention
Figure;
Fig. 2 is the calculating schematic diagram of convolutional layer in convolutional neural networks;
Fig. 3 is that the forward calculation hardware processing element on the field programmable gate array platform of the embodiment of the present invention turns
The schematic diagram that change of lap lamination calculates;
Fig. 4 is right value update processing unit on the field programmable gate array platform of the embodiment of the present invention by data
Matrix conversion at vector schematic diagram;
Fig. 5 is the structural representation that software-hardware synergism calculates on the field programmable gate array platform of the embodiment of the present invention
Figure;
Fig. 6 is that the hardware processing element resource of the embodiment of the present invention uses and field programmable gate array platform resource
And applicable cases solidify the schematic diagram of value volume and range of product;
Fig. 7 is the schematic diagram of the forward calculation processing unit data fragmentation processing of the embodiment of the present invention;
Fig. 8 is that the piece wire approximation of the embodiment of the present invention realizes the schematic diagram of excitation function;
Fig. 9 is the forward direction meter that list DMA prestores weight matrix in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention
Calculate the structural schematic diagram of hardware processing element;
Figure 10 is preceding tired into computing hardware processing unit in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention
Add the structural schematic diagram of processing;
Figure 11 be in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention before into computing hardware processing unit point
The structural schematic diagram of section approximation sigmoid function;
Figure 12 is the forward direction meter that list DMA prestores weight matrix in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention
Calculate the flow chart of data processing figure of hardware processing element;
Figure 13 is the forward direction meter of double DMA parallel read datas in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention
Calculate the structural schematic diagram of hardware processing element;
Figure 14 is the forward direction meter of double DMA parallel read datas in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention
Calculate the flow chart of data processing figure of hardware processing element;
Figure 15 is the knot of right value update hardware processing element in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention
Structure schematic diagram;
Figure 16 is the number of right value update hardware processing element in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention
According to process flow diagram;
Figure 17 is possibility one of deep learning accelerator in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention
Application scenarios and block schematic illustration.
Specific embodiment
Above scheme is described further below in conjunction with specific embodiment.It should be understood that these embodiments are for illustrating
The present invention and be not limited to limit the scope of the invention.Implementation condition used in the examples can be done according to the condition of specific producer
Further adjustment, the implementation condition being not specified is usually the condition in routine experiment.
Embodiment:
Field programmable gate array platform in the embodiment of the present invention refers to while integrated universal processor (General
Purpose Processor, referred to as " GPP ") and field programmable gate array (Field Programmable Gate
Arrays, referred to as " FPGA ") chip computing system, wherein the data path between FPGA and GPP can use PCI-E
Bus protocol, AXI bus protocol etc..Attached drawing data path of the embodiment of the present invention illustrates for using AXI bus protocol, but this hair
It is bright to be not limited to this.
Fig. 1 is the stream that the field programmable gate array platform of the embodiment of the present invention accelerates the method 100 of deep learning algorithm
Cheng Tu.This method 100 includes:
S110 predicts process and training process according to deep learning, wherein training process include local pre-training process and
Global training process, and deep neural network and convolutional neural networks are combined, it determines and is suitable for field programmable gate array platform
The general-purpose computations part of upper operation;
S120 determines software-hardware synergism calculation according to the common hardware computing module of confirmation;
S130, according to calculating logic resource, bandwidth situation on field programmable gate array, determine the cured quantity of IP kernel and
Type.
Below in conjunction with Fig. 2 to Fig. 4, the method for deep learning general-purpose computations part is accelerated to carry out the embodiment of the present invention
Detailed description.
Fig. 2 is the schematic diagram that convolutional layer calculates, it is assumed that input feature vector figure number is 4, and convolution kernel size is 3x3, then by 4
After the cumulative summation of convolution checkout result, the value that output characteristic pattern can be obtained is handled by excitation function.From calculating overall structure
On see, the basic calculating mode of convolutional layer is similar with the calculating of deep neural network hidden layer, as long as by adjusting convolution nuclear parameter sequence
Convolutional calculation used herein can be changing into dot product calculating by column.Specific adjustment mode are as follows: 1), by input feature vector figure from up to
Under, by row be sequentially filled to a line, as shown in Fig. 3 left line;2) after convolution matrix core being rotated 180 degree counterclockwise, from up to
Under, column of weight matrix are sequentially write by row, it is among Fig. 3 shown in that column, original convolution kernel a is successively inverse to convolution kernel d
Hour hands rotate 180 degree after, become a9~a1, b9~b1 ... d9~d1, filling in proper order to one column in.So for convolution
Layer prediction process, basic calculating are convertible into mode identical with deep neural network hidden layer, i.e. matrix multiplication calculating adds
Excitation function processing, but need to pay more the cost of data conversion.
In deep learning training process, in addition to needing a large amount of matrix multiplication calculating that a large amount of vector is also needed to calculate,
It needs matrix data being converted into vector data when carrying out vector calculating, as shown in figure 4, the every a line of data is formed one in proper order
A vector carries out vector calculating.
Therefore, in conjunction with Fig. 2 to Fig. 4, deep learning is predicted the general-purpose computations portion of process and training process by present example
Divide and be attributed to matrix multiplication calculating, excitation function calculates and a large amount of vector calculates.
Fig. 5 is structural framing Figure 200 that the software-hardware synergism that present example uses calculates.The structure includes:
Processing System (abbreviation PS) 210 includes CPU and Memory as the control terminal of whole system.CPU
As host end, runs software end code, and task offload will be accelerated to work to the end PL.In addition, CPU is as controllable
The working condition sum number of each IP kernel in the end PL (intellectual property core represents each hardware computation unit here) processed
According to reading etc.;
Programmable logic Programming Logic (abbreviation PL) 220 is the hardware-accelerated component FPGA core of whole system
Piece.IP kernel can be solidified on fpga chip according to different acceleration tasks to realize the acceleration to algorithm.System by the end PS according to
Specific algorithm scheduling selects different IP Core to carry out parallel computation, can also be by host end software task and the end FPGA hardware
Task carries out parallel computation;
Data/address bus (Data Bus) 230 is responsible for the end whole system PS and the transmission of PL end data;
Control signal bus (Control Bus) 240 is responsible for the transmission at the end whole system PS and the end PL control signal.
Fig. 6 is the accelerator overall structure 2000 based on FPGA design, and structure includes:
System controller 2100 is responsible for execution state, data transmission and the program scheduler of each hardware computation unit of control.
And the non-universal calculating section of responsible operation deep learning, data initialization and hardware computation unit (or being IP kernel) is first
Beginning task;
Memory 2200 is responsible for storage depth learning network parameter and original input data, and data is required to store here
Physical address be it is continuous, facilitate DMA to carry out data transmission;
Data bus protocol 2300, AXI-Stream agreement allow unconfined data burst transmission, are high-performance data
Transport protocol;
Control bus agreement 2400, AXI-Lite are a kind of address of cache single transmission agreements of lightweight, are suitable for hard
The control signal of part arithmetic element transmits;
Data interconnection 2500, data path interconnection;
Control interconnection 2600, control signal lines interconnection;
Direct memory access DMA2700, the data transmission being responsible between accelerator and memory, each hardware processing element are matched
A standby DMA carrys out parallel read data;
PE (Proccesing Element) 2800 computing unit as each accelerator, internal curable 1 forward direction
Calculating arithmetic element or 1 right value update arithmetic element or both includes.Since FPGA is programmable and can weigh
Structure, the quantity of PE can not change operation list in this way according to the resource bandwidth situation dynamic configuration of specific fpga chip here
The computing resource of hardware can be made full use of under first hardware design, guarantee that hardware plays peak performance.
Above in conjunction with Fig. 1 to Fig. 6, the method that the embodiment of the present invention accelerates deep learning algorithm is described in detail, below
The hardware configuration of the embodiment of the present invention will be introduced.
Fig. 7 is to design forward calculation arithmetic element using fragment Computation schema, it is assumed that the size of fragment is 16, by node square
Fragment is carried out by 16 inside the every a line of battle array, weighting parameter matrix carries out fragment according to 16 elements of each column.By being about to node square
Every 16 data of battle array 16 numerical value corresponding with each column of weighting parameter matrix carries out dot-product operation, has been calculated to every a line complete
These nonces, which add up, again afterwards can be obtained final result.Such method not only takes full advantage of data locality, but also subtracts
Resource situation needed for solidifying parallel execution unit is lacked, and has reduced data bandwidth needed for hardware, has allowed single arithmetic element can
To realize that the matrix multiplication of random scale calculates.
In order to keep high-throughput, the size of fragment should be matched with arithmetic element interior design, be kept with parallel granularity
Unanimously, in matrix multiplication operation, fragment can be set as to 2 n times side, to give full play to the cumulative performance of binary tree.By
It is related with parallel granularity in fragment size, theoretically for fragment it is bigger, degree of parallelism is higher, and the performance of arithmetic element can also be got over
It is good, so selecting maximum 2 in the case where hardware resource and bandwidth allownFragment size as arithmetic element.
Fig. 8 is to carry out hard-wired schematic diagram to excitation function in present example.Present example uses segmented line
Property approximation realize S type excitation function, by function by X-axis be divided into it is several it is equivalent be spaced, by Y=a in each intervali*X+bi,X
∈[xi,xi+1) shown in carry out linear approximation, wherein xi+1-xiFor approximate gap size.Whenever needing to calculate excitation function,
The section where it is found first, in accordance with X value and calculates its corresponding aiAnd biRelative to the offset of base address, multiply-add fortune is carried out
After calculation, can approximation obtain Y value.This implementation has two benefits: 1), arbitrary S type excitation function or linear can be achieved
Function, and without changing any hardware design, it is only necessary to replace the numerical value that coefficient a and coefficient b are stored;2), error
Minimum, when approximate interval reduces, error, which can achieve, to be ignored, and cost is only to increase for packing coefficient a and be
The BRAM of number b.And the requirement that deep learning calculates in itself to the accuracy of data is not very high a degree of in other words
Loss of significance has no effect on data result.
Fig. 9 is the hardware configuration that list DMA prestores weight matrix on the field programmable gate array platform of the embodiment of the present invention
Schematic block diagram 3000 cache weight matrix number in advance when the hardware configuration is more sufficient for BRAM resource inside FPGA
Forward calculation is carried out according on piece BRAM.Structure includes:
Data read module 3100 is furnished with DMA and FIFO buffer area, and data bit width is 32, is responsible for reading weighting parameter
It is buffered on piece BRAM and reads neuron node data.
On piece BRAM3200 caches weighting parameter data.By taking fragment size is 16 as an example, it is with 16 by row by weight matrix
Circulation is stored on different BRAM, i.e. i%16 adds the base address of BRAM as addressing system, to guarantee carrying out 16 simultaneously
From different BRAM parallel read datas when row multiplication.
Pair register caching 3300, each register includes 16 registers for storing input neuron number evidence here,
Data cached and progress parallel computation is carried out by replacing.But it is noted herein that: buffer area is filled up to the required time
Time needed for calculating lower than these data just can guarantee the time of buffer data reading by calculating required time institute in this way
Covering, and ensure the correctness of result.
Parallel floating point multiplication 3400, by weighting parameter data and neuron number according to progress parallel multiplication calculating, Floating-point Computation
Realized using DSP, after assembly line optimization, can each clock cycle parallel processing 16 floating-point multiplications operation, fragment size here
For 16.Since input neuron number might not be divided exactly by 16, so when every data fragment carries out dot product calculating,
The possible number inadequate 16 of the last one fragment, then arithmetic element will carry out parallel multiplication calculating with the part of 0 lack of fill 16.
Floating point result obtained in 3400 structure of parallel floating point multiplication is carried out cumulative behaviour by y-bend floating add tree 3500
Make, parallel computation is carried out using y-bend add tree, the read-write dependency of accumulation operations is eliminated, by the required time complexity that adds up
From the near O of O (n) (logn).
Accumulation calculating 3600 is needed since forward calculation processing unit is calculated using fragment processing by y-bend floating add
The result that tree 3500 obtains after calculating adds up, but cumulative mode is to carry out cycle accumulor behaviour every output neuron number
Make.
Excitation function calculates 3700, realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation.
Data write back module 3800, are furnished with DMA and FIFO buffer area, and data bit width is 32, are responsible for data calculated result
It is written back to host's end memory.
The hardware configuration supports parameter configuration, can support the neural computing of different scales.Detailed configuration parameter are as follows:
Data_size: the scale of input neuron number evidence;
Input_size: inputting the number of neuron, due to caching weight matrix data in advance, therefore should be less than piece here
Upper BRAM can allow to cache the corresponding maximum input neuron number Max_input of weighting parameter;
Output_size: the number of output neuron due to caching weight matrix data in advance, therefore should be less than here
On piece BRAM can allow to cache the corresponding maximum output neuron number Max_output of weighting parameter;
Work_mode:0 indicates only to carry out matrix multiplication calculating;1 indicates that carrying out matrix multiplication and excitation function calculates.
Figure 10 is the hardware configuration signal that accumulation calculating is carried out on the field programmable gate array platform of the embodiment of the present invention
Figure 36 00.Structure includes:
Floating add calculates 3610, due to using fragment thought, needs to add up to the median that dot product is calculated.
Median data flow is added up every the number N (or the latter's matrix column number) of output neuron, suitable again after adding up
Sequence output.
Nonce stores BRAM3620, N number of storage unit is arranged inside FPGA for storing ephemeral data, circulation will count
It is added in corresponding BRAM storage unit according to flow data, is judged whether according to the relationship of input neuron number and fragment size
It is cumulative to terminate.The quantity for storing nonce can not be dynamically set when due to FPGA interior design, so in design luck
It calculates unit and sets the maximum cumulative number MAX of support.When the number of output neuron can just be normally carried out cumulative behaviour lower than MAX value
Make.
Assembly line optimization is also equally carried out to the process, and starting interval is optimized to 1 clock cycle, to guarantee centre
Value generates and the rate of processing is consistent.
Figure 11, which shows progress piece wire approximation on the field programmable gate array platform of the embodiment of the present invention and realizes, to be swashed
Encourage the hardware structural diagram 3700 of function.
Excitation function is realized using sublevel linear approximation, is realized that details is as shown in figure 11, unlike Fig. 8, is increased
One X is transmitted directly to the access of Y, allows forward calculation arithmetic element that can only execute matrix multiplication operation and without excitation
The processing of function carries out used matrix multiplication when error amount calculating here mainly for realizing in training process.Due to S type
Excitation function substantially about certain point symmetry, by taking sigmoid function as an example, sigmoid function about (0,0.5) symmetrically, institute
Calculated according to 1-f (- x) when x is less than 0, it can be multiplexed hardware logic in this way, reduce the use to hardware resource.And
And when x is equal to 8, f (x) is equal to 0.999665, is just infinitely close to 1 later, therefore when x is greater than 8, directly result is assigned a value of
1。
Figure 12 is the forward calculation that list DMA prestores weighting parameter on the field programmable gate array platform of the embodiment of the present invention
The calculation flow chart of hardware computation unit.
It is successively read configuration data from DMA first, node data is read according to configuration information.First will when reading node data
After register group a is full of, flag is set 0, later according to flag%2 numerical value replace input node data value register group a or
Register group b.Equally, it is carried out according to the weight number of the data of the numerical value of flag%2 reading register group and BRAM caching parallel
Multiplication calculates, and then adds up after the summation of y-bend add tree.After cumulative, selected according to operating mode through overdriving
Function processing or directly output.
Figure 13 is the forward calculation hardware that double DMA are read parallel on the field programmable gate array platform of the embodiment of the present invention
The structural schematic diagram 4000 of arithmetic element.The hardware configuration carries out forward calculation module design for the fpga chip of high bandwidth,
It is read parallel using double DMA and guarantees high-throughput.Here for 16, structure includes: fragment size
Neuron data read module 4100 is furnished with DMA and FIFO buffer area, and data bit width is 512, is responsible for reading defeated
Enter neuron node data, 16 32 single-precision floating-point datas are obtained by shifting function.Since the transmission bit wide of data is
512, it requires that data want address align in host's end memory.Further for input neuron number aliquant 16
Situation needs to carry out neuron node data matrix at host end to fill 0 operation, fills 16-Input_ to the end of every a line
Size%16 0, wherein Input_size is the number for inputting neuron, without filling when Input_size%16 is equal to 0.This
In to each data-reusing Output_size times, wherein Output_size be output neuron number.
Weighting parameter data read module 4200 is furnished with DMA and FIFO buffer area, and data bit width is 512, is responsible for reading
Weighting parameter data obtain 16 32 single-precision floating-point datas by shifting function.Also due to the transmission bit wide of data is
512, it requires that data want address align in host's end memory.Further for input neuron number aliquant 16
Situation needs to carry out weighting parameter data matrix at host end to fill 0 operation, fills 16-Input_ at the end of each column
Size%16 0, without filling when same Input_size%16 is equal to 0.After filling, since DMA transfer needs continuously
Physical address, need the data storage location by weighting parameter matrix to be adjusted and facilitate DMA transfer.
Parallel floating point multiplication 4300, by weighting parameter data and neuron number according to progress parallel multiplication calculating, Floating-point Computation
It is realized using DSP, it, can 16 floating-point multiplication operations of each clock cycle parallel processing after assembly line optimization.
Floating point result obtained in 4300 structure of parallel floating point multiplication is carried out cumulative behaviour by y-bend floating add tree 4400
Make, parallel computation is carried out using y-bend add tree, the read-write dependency of accumulation operations is eliminated, by the required time complexity that adds up
From the near O of O (n) (logn).
Accumulation calculating 4500 is needed since forward calculation processing unit is calculated using fragment processing by y-bend floating add
The result that tree 4400 obtains after calculating adds up, but cumulative mode is to carry out cycle accumulor behaviour every output neuron number
Make.The structure and structure 3600 are identical, therefore are not described in further detail.
Excitation function calculates 4600, realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation.
The structure and structure 3700 are identical, therefore are not described in further detail.
Data write back module 4700, are furnished with DMA and FIFO buffer area, and data bit width is 32, are responsible for data calculated result
It is written back to host's end memory.
The hardware configuration supports parameter configuration, can support the neural computing of different scales.Detailed configuration parameter are as follows:
Data_size: the scale of input neuron number evidence;
Input_size: the number of neuron is inputted;
Output_size: the number of output neuron;
Work_mode:0 indicates only to carry out matrix multiplication calculating;1 indicates that carrying out matrix multiplication and excitation function calculates.
Figure 14 is the forward calculation hardware that double DMA are read parallel on the field programmable gate array platform of the embodiment of the present invention
The calculation flow chart of arithmetic element.
Read configuration information from node DMA first, configuration arithmetic element read the scale of node data and weight data with
And operating mode.Then, 512 data are read in from node DMA and weight DMA respectively, parallel shift obtains 16 neuron sections
Point data and 16 weighting parameter data, due to accelerator multiplexer node data, therefore every Output_size clock cycle reads
Node data, every 1 clock cycle read a weighting parameter data.After reading data, 16 are successively carried out simultaneously
The y-bend add tree summation of the operation of row multiplication and 16 inputs.Summed result is circuited sequentially and is added to specified BRAM storage location
On, and judge whether cumulative end.After cumulative, piecewise approximation is directly exported or carried out according to operating mode selection and motivates letter
Number processing.
Figure 15 is the hardware of right value update hardware computation unit on the field programmable gate array platform of the embodiment of the present invention
Structural schematic diagram 5000.It is read parallel using double DMA, to calculate vector operation with guaranteeing high-throughput.Structure includes:
Vector A data read module 5100, is furnished with DMA and fifo buffer, and bit wide is 32.Also it is responsible for configuration ginseng simultaneously
Several readings.
Vector B data read module 5200, is furnished with DMA and fifo buffer, and bit wide is 32.
Computing module 5300 carries out corresponding vector calculating by different configuration informations.Operating mode carries out a*A+ when being 0
B*B is calculated;(a*A+b*B) * B* (1-B) is carried out when operating mode is 1 to calculate.Wherein a, b are configuration parameter, and A, B are to read respectively
The vector value entered.
As a result module 5400 is write back, is furnished with DMA and fifo buffer, bit wide is 32, and calculated result is written back to host
End memory.
The hardware configuration supports parameter configuration, and the vector of different scales can be supported to calculate.Detailed configuration parameter are as follows:
Data_size: the scale of input vector data;
A: required coefficient value is calculated;
B: required coefficient value is calculated;
Work_mode:0 indicates to carry out a*A+b*B calculating;1 indicates that carrying out (a*A+b*B) * B* (1-B) calculates.
Figure 16 is the calculating of right value update hardware computation unit on the field programmable gate array platform of the embodiment of the present invention
Flow chart.
First from DMA A read configuration information, then according to configuration information Data_size respectively from DMA A and B read to
The value of amount, parallel and configuration parameter a and b sum after carrying out multiplication calculating, are finally chosen whether according to operating mode multiplied by B*
Result is written back to host's end memory by DMA A by (1-B).
Figure 17 is possibility one of deep learning accelerator in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention
Application scenarios and block schematic illustration.
Here the composition of application system is as illustrating, and the present invention is not limited thereto.User answers system sending
When with request, request is assigned to corresponding calculate node by scheduler by application system control node.Calculate node is in basis
Concrete application request will accelerate task offload to FPGA to accelerate.
The general frame figure of each calculate node is made of hardware layer, driving layer, library layer, service layer and application layer.Hardware
Layer is made of FPGA, memory and host end CPU, controller of the CPU as system, each hardware processing element inside control FPGA
The operating status and reading data of (DL Module is referred to as in figure), including forward calculation arithmetic element and right value update unit.
Weighting parameter data required for system-computed and neuron number according to being merely stored in memory, by DMA by data in memory and
It is transmitted before hardware processing element;Driving layer is then the hardware driving write according to hardware platform and operating system;Library layer is then
The Application Programming Interface API encapsulated on the basis of driving;Service layer is the deep learning relevant calculation that user oriented request provides
Accelerate service;Application layer then refers to that deep learning prediction algorithm and training algorithm are specifically applied, such as uses convolutional Neural net
Network prediction algorithm carries out picture classification etc..
Those of ordinary skill in the art may be aware that method described in conjunction with the examples disclosed in this document and hardware
Structure can be realized with the combination of FPGA and CPU.The value volume and range of product of specific FPGA internal curing IP kernel see concrete application and
Fpga chip resource constraint.Professional technician can come using not Tongfang each specific application or specific fpga chip
Formula or different degree of parallelism realize above-mentioned described function, but such implementation should not be considered as beyond the scope of the present invention.
In several embodiments provided herein, it should be understood that disclosed method and hardware configuration, Ke Yitong
Other modes are crossed to realize.For example, the application of deep learning described above is deep neural network and convolutional neural networks are
Schematically.For example, the fragment size and parallel granularity in forward calculation arithmetic element be it is schematical, can be according to specific
Situation is adjusted.Such as the data transfer mode between field programmable gate array and general processor is assisted using AXI bus
View is also schematic.
The foregoing examples are merely illustrative of the technical concept and features of the invention, its object is to allow the person skilled in the art to be
It cans understand the content of the present invention and implement it accordingly, it is not intended to limit the scope of the present invention.It is all smart according to the present invention
The equivalent transformation or modification that refreshing essence is done, should be covered by the protection scope of the present invention.
Claims (9)
1. accelerating the method for deep learning algorithm on a kind of field programmable gate array platform, which is characterized in that field-programmable
Gate array platform includes general processor, field programmable gate array and memory module, comprising the following steps:
S01: process and training process are predicted according to deep learning, and combine deep neural network and convolutional neural networks, is determined
Suitable for the general-purpose computations part run on field programmable gate array platform;
S02: according to the general-purpose computations part of confirmation, software-hardware synergism calculation is determined;
S03: it according to calculating logic resource, the bandwidth situation of FPGA, determines the cured value volume and range of product of IP kernel, utilizes hardware computation
Unit is accelerated on programmable gate array platform at the scene;
The general-purpose computations part includes forward calculation module, and the forward calculation module includes the forward direction of single DMA caching weight
The forward calculation hardware configuration that computing hardware structure and double DMA are read parallel;The forward calculation hardware of the list DMA caching weight
Structure includes:
Single DMA is responsible for reading data, writes back;
Pair register buffer area alternately reads data or carries out parallel computation;BRAM group caches and guarantees that data parallel is read;
With the equal-sized floating-point multiplier of fragment;
With the y-bend add tree of fragment input equal in magnitude;
Cycle accumulor device, cumulative nonce are saved on piece BRAM;
Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation;
The forward calculation hardware configuration that double DMA are read parallel includes:
Neuron data read module is furnished with DMA and FIFO buffer area, is responsible for reading input neuron node data;
Weighting parameter data read module is furnished with DMA and FIFO buffer area, is responsible for reading weighting parameter data;
With the equal-sized floating-point multiplier of fragment;
With the y-bend add tree of fragment input equal in magnitude;
Cycle accumulor device, cumulative nonce are saved on piece BRAM;
Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation.
2. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 1
It is, the forward calculation module, is calculated for matrix multiplication and excitation function calculates;Right value update module is used for meter
It calculates.
3. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 1
Be, the step S02 the following steps are included:
Data Preparation is carried out in software end;
Matrix multiplication is converted by convolutional layer convolutional calculation in convolutional neural networks;
The data path calculated as software-hardware synergism is read using direct memory.
4. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 1
It is, the cured value volume and range of product of IP kernel is determined in the step S03, comprising: according to pending hardware task, determine FPGA
The type of upper cured arithmetic element;According to FPGA hardware logical resource and bandwidth situation, the place of pending hardware task is determined
Manage the quantity of unit.
5. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 2
It is, the forward calculation module is designed using fragment, and fragment size will be pressed inside the every a line of node matrix equation and carries out fragment, weight
The each column of parameter matrix carry out fragment according to fragment size, by the every fragment size data and weighting parameter for being about to node matrix equation
The corresponding fragment size numerical value of each column of matrix carries out dot-product operation, every a line has been calculated finish after nonce added up obtain most
Terminate fruit.
6. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 5
It is, the n times side that the fragment size is 2 is consistent with the parallel granularity of arithmetic element.
7. a kind of for accelerating the FPGA structure of deep learning algorithm characterized by comprising
The node data matrix of forward calculation module and weighting parameter matrix are carried out fragment, time-sharing multiplex by fragment processing structure
Hardware logic;
Excitation function linear approximation realizes structure, for generating arbitrary excitation function;
Parameter configuration module, for configuring the parameter of processing unit;
Forward calculation module, the forward calculation that forward calculation hardware configuration and double DMA including single DMA caching weight are read parallel
Hardware configuration;For the forward calculation of deep neural network, convolutional neural networks convolutional layer and classify layer forward calculation and
Matrix multiplication operation, and carry out assembly line and be optimized to maximum throughput rate;
The forward calculation hardware configuration of list DMA caching weight includes:
Single DMA is responsible for reading data, writes back;
Pair register buffer area alternately reads data or carries out parallel computation;BRAM group caches and guarantees that data parallel is read;
With the equal-sized floating-point multiplier of fragment;
With the y-bend add tree of fragment input equal in magnitude;
Cycle accumulor device, cumulative nonce are saved on piece BRAM;
Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation;
The forward calculation hardware configuration that double DMA are read parallel includes:
Neuron data read module is furnished with DMA and FIFO buffer area, is responsible for reading input neuron node data;
Weighting parameter data read module is furnished with DMA and FIFO buffer area, is responsible for reading weighting parameter data;
With the equal-sized floating-point multiplier of fragment;
With the y-bend add tree of fragment input equal in magnitude;
Cycle accumulor device, cumulative nonce are saved on piece BRAM;
Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation;
Right value update module is calculated for vector.
8. according to claim 7 for accelerating the FPGA structure of deep learning algorithm, which is characterized in that the parameter is matched
It sets module to configure processing unit by DMA transfer configuration parameter data, comprising: the operating mode of forward calculation module is matched
It sets and is configured with data scale, data scale configuration includes the configuration of node data scale, the configuration of input neuron scale and output mind
It is configured through first scale;The configuration of right value update module data scale, operating mode configuration and calculating parameter configuration.
9. according to claim 7 for accelerating the FPGA structure of deep learning algorithm, which is characterized in that the weight is more
New module calculates the calculating with output layer error amount for right value update, and carries out assembly line and be optimized to maximum throughput rate, wraps
Include: vector A data read module and vector B data read module are respectively provided with DMA and fifo buffer, read be used for respectively
The two groups of vector values calculated;Computing module carries out corresponding vector calculating by configuration information;As a result module is write back, is furnished with DMA
And fifo buffer, calculated result is written back to host's end memory.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610596159.3A CN106228238B (en) | 2016-07-27 | 2016-07-27 | Accelerate the method and system of deep learning algorithm on field programmable gate array platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610596159.3A CN106228238B (en) | 2016-07-27 | 2016-07-27 | Accelerate the method and system of deep learning algorithm on field programmable gate array platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106228238A CN106228238A (en) | 2016-12-14 |
CN106228238B true CN106228238B (en) | 2019-03-22 |
Family
ID=57534278
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610596159.3A Active CN106228238B (en) | 2016-07-27 | 2016-07-27 | Accelerate the method and system of deep learning algorithm on field programmable gate array platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106228238B (en) |
Families Citing this family (75)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108268931B (en) * | 2016-12-30 | 2022-10-25 | 华为技术有限公司 | Data processing method, device and system |
US10565492B2 (en) * | 2016-12-31 | 2020-02-18 | Via Alliance Semiconductor Co., Ltd. | Neural network unit with segmentable array width rotator |
US10140252B2 (en) | 2017-02-28 | 2018-11-27 | Microsoft Technology Licensing, Llc | Hardware node with matrix-vector multiply tiles for neural network processing |
US11086967B2 (en) | 2017-03-01 | 2021-08-10 | Texas Instruments Incorporated | Implementing fundamental computational primitives using a matrix multiplication accelerator (MMA) |
CN107633297B (en) * | 2017-03-10 | 2021-04-06 | 南京风兴科技有限公司 | Convolutional neural network hardware accelerator based on parallel fast FIR filter algorithm |
CN108629405B (en) * | 2017-03-22 | 2020-09-18 | 杭州海康威视数字技术股份有限公司 | Method and device for improving calculation efficiency of convolutional neural network |
CN107145944B (en) * | 2017-03-29 | 2020-10-16 | 浙江大学 | Genetic algorithm and system based on FPGA efficient training |
EP3627437B1 (en) * | 2017-04-06 | 2022-11-09 | Cambricon (Xi'an) Semiconductor Co., Ltd. | Data screening device and method |
CN108734288B (en) * | 2017-04-21 | 2021-01-29 | 上海寒武纪信息科技有限公司 | Operation method and device |
CN108804974B (en) * | 2017-04-27 | 2021-07-02 | 深圳鲲云信息科技有限公司 | Method and system for estimating and configuring resources of hardware architecture of target detection algorithm |
CN107392308B (en) * | 2017-06-20 | 2020-04-03 | 中国科学院计算技术研究所 | Convolutional neural network acceleration method and system based on programmable device |
CN107423030A (en) * | 2017-07-28 | 2017-12-01 | 郑州云海信息技术有限公司 | Markov Monte carlo algorithm accelerated method based on FPGA heterogeneous platforms |
CN107480782B (en) * | 2017-08-14 | 2020-11-10 | 电子科技大学 | On-chip learning neural network processor |
CN107506173A (en) * | 2017-08-30 | 2017-12-22 | 郑州云海信息技术有限公司 | A kind of accelerated method, the apparatus and system of singular value decomposition computing |
CN107392309A (en) * | 2017-09-11 | 2017-11-24 | 东南大学—无锡集成电路技术研究所 | A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA |
CN107657581B (en) * | 2017-09-28 | 2020-12-22 | 中国人民解放军国防科技大学 | Convolutional neural network CNN hardware accelerator and acceleration method |
CN109726809B (en) * | 2017-10-30 | 2020-12-08 | 赛灵思公司 | Hardware implementation circuit of deep learning softmax classifier and control method thereof |
CN107862650B (en) * | 2017-11-29 | 2021-07-06 | 中科亿海微电子科技(苏州)有限公司 | Method for accelerating calculation of CNN convolution of two-dimensional image |
CN108090496A (en) * | 2017-12-22 | 2018-05-29 | 银河水滴科技(北京)有限公司 | The method and apparatus of image procossing based on convolutional neural networks |
CN108231086A (en) * | 2017-12-24 | 2018-06-29 | 航天恒星科技有限公司 | A kind of deep learning voice enhancer and method based on FPGA |
CN109993287B (en) * | 2017-12-29 | 2019-12-06 | 北京中科寒武纪科技有限公司 | neural network processing method, computer system, and storage medium |
CN108416422B (en) * | 2017-12-29 | 2024-03-01 | 国民技术股份有限公司 | FPGA-based convolutional neural network implementation method and device |
CN108280514B (en) * | 2018-01-05 | 2020-10-16 | 中国科学技术大学 | FPGA-based sparse neural network acceleration system and design method |
CN108090560A (en) * | 2018-01-05 | 2018-05-29 | 中国科学技术大学苏州研究院 | The design method of LSTM recurrent neural network hardware accelerators based on FPGA |
CN108229670B (en) * | 2018-01-05 | 2021-10-08 | 中国科学技术大学苏州研究院 | Deep neural network acceleration platform based on FPGA |
CN110018979A (en) * | 2018-01-09 | 2019-07-16 | 幻视互动(北京)科技有限公司 | It is a kind of based on restructing algorithm collection and accelerate handle mixed reality data flow MR intelligent glasses and method |
WO2019136755A1 (en) * | 2018-01-15 | 2019-07-18 | 深圳鲲云信息科技有限公司 | Method and system for optimizing design model of artificial intelligence processing device, storage medium, and terminal |
WO2019136751A1 (en) * | 2018-01-15 | 2019-07-18 | 深圳鲲云信息科技有限公司 | Artificial intelligence parallel processing method and apparatus, computer readable storage medium, and terminal |
CN109496319A (en) * | 2018-01-15 | 2019-03-19 | 深圳鲲云信息科技有限公司 | Artificial intelligence process device hardware optimization method, system, storage medium, terminal |
US11874898B2 (en) | 2018-01-15 | 2024-01-16 | Shenzhen Corerain Technologies Co., Ltd. | Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal |
CN108229671B (en) * | 2018-01-16 | 2022-03-04 | 华南理工大学 | System and method for reducing storage bandwidth requirement of external data of accelerator |
CN108320022A (en) * | 2018-01-23 | 2018-07-24 | 深圳市易成自动驾驶技术有限公司 | Deep learning system constituting method, device, deep learning system and storage medium |
US11568232B2 (en) * | 2018-02-08 | 2023-01-31 | Quanta Computer Inc. | Deep learning FPGA converter |
CN110222833B (en) * | 2018-03-01 | 2023-12-19 | 华为技术有限公司 | Data processing circuit for neural network |
CN108764466B (en) * | 2018-03-07 | 2022-02-11 | 东南大学 | Convolution neural network hardware based on field programmable gate array and acceleration method thereof |
CN110363291B (en) * | 2018-03-26 | 2022-02-08 | 上海寒武纪信息科技有限公司 | Operation method and device of neural network, computer equipment and storage medium |
CN110321998B (en) * | 2018-03-31 | 2022-06-14 | 赛灵思公司 | Convolutional neural network implementation method and device, acceleration equipment and storage medium |
CN108520297B (en) * | 2018-04-02 | 2020-09-04 | 周军 | Programmable deep neural network processor |
CN108710941A (en) * | 2018-04-11 | 2018-10-26 | 杭州菲数科技有限公司 | The hard acceleration method and device of neural network model for electronic equipment |
US10657442B2 (en) * | 2018-04-19 | 2020-05-19 | International Business Machines Corporation | Deep learning accelerator architecture with chunking GEMM |
CN108629408A (en) * | 2018-04-28 | 2018-10-09 | 济南浪潮高新科技投资发展有限公司 | A kind of deep learning dynamic model based on FPGA cuts out inference system and method |
US11875251B2 (en) * | 2018-05-03 | 2024-01-16 | Samsung Electronics Co., Ltd. | Neural network method and apparatus |
CN108665059A (en) * | 2018-05-22 | 2018-10-16 | 中国科学技术大学苏州研究院 | Convolutional neural networks acceleration system based on field programmable gate array |
CN108763159A (en) * | 2018-05-22 | 2018-11-06 | 中国科学技术大学苏州研究院 | To arithmetic accelerator before a kind of LSTM based on FPGA |
TWI672643B (en) * | 2018-05-23 | 2019-09-21 | 倍加科技股份有限公司 | Full index operation method for deep neural networks, computer devices, and computer readable recording media |
CN110633226A (en) * | 2018-06-22 | 2019-12-31 | 武汉海康存储技术有限公司 | Fusion memory, storage system and deep learning calculation method |
CN108920413B (en) * | 2018-06-28 | 2019-08-09 | 中国人民解放军国防科技大学 | Convolutional neural network multi-core parallel computing method facing GPDSP |
CN108805277A (en) * | 2018-06-29 | 2018-11-13 | 中国科学技术大学苏州研究院 | Depth belief network based on more FPGA accelerates platform and its design method |
CN110738316B (en) * | 2018-07-20 | 2024-05-14 | 北京三星通信技术研究有限公司 | Operation method and device based on neural network and electronic equipment |
CN110826707B (en) * | 2018-08-10 | 2023-10-31 | 北京百度网讯科技有限公司 | Acceleration method and hardware accelerator applied to convolutional neural network |
CN109359732B (en) * | 2018-09-30 | 2020-06-09 | 阿里巴巴集团控股有限公司 | Chip and data processing method based on chip |
CN109344109B (en) * | 2018-10-23 | 2022-07-26 | 江苏华存电子科技有限公司 | System and method for accelerating artificial intelligence calculation in big data based on solid state disk |
CN111090503B (en) * | 2018-10-24 | 2023-07-21 | 上海雪湖信息科技有限公司 | High-cost-performance cloud computing service system based on FPGA chip |
CN109376332A (en) * | 2018-10-30 | 2019-02-22 | 南京大学 | A kind of arbitrary order Kalman filtering system |
TWI696961B (en) | 2018-12-12 | 2020-06-21 | 財團法人工業技術研究院 | Deep neural networks (dnn) hardware accelerator and operation method thereof |
CN109523019B (en) * | 2018-12-29 | 2024-05-21 | 百度在线网络技术(北京)有限公司 | Accelerator, accelerating system based on FPGA, control method and CNN network system |
CN109740748B (en) * | 2019-01-08 | 2021-01-08 | 西安邮电大学 | Convolutional neural network accelerator based on FPGA |
CN109933370B (en) * | 2019-02-01 | 2021-10-15 | 京微齐力(北京)科技有限公司 | System chip for connecting FPGA and artificial intelligence module |
CN109816108A (en) * | 2019-02-15 | 2019-05-28 | 领目科技(上海)有限公司 | Deep learning accelerator, device and method |
CN110032374B (en) * | 2019-03-21 | 2023-04-07 | 深兰科技(上海)有限公司 | Parameter extraction method, device, equipment and medium |
CN110084363B (en) * | 2019-05-15 | 2023-04-25 | 电科瑞达(成都)科技有限公司 | Deep learning model acceleration method based on FPGA platform |
CN110135572B (en) * | 2019-05-17 | 2023-05-26 | 南京航空航天大学 | SOC-based trainable flexible CNN system design method |
CN112036557B (en) * | 2019-06-04 | 2023-06-27 | 北京邮电大学 | Deep learning system based on multiple FPGA development boards |
CN110399979B (en) * | 2019-06-17 | 2022-05-13 | 深圳大学 | Click rate pre-estimation system and method based on field programmable gate array |
CN112149047A (en) * | 2019-06-27 | 2020-12-29 | 深圳市中兴微电子技术有限公司 | Data processing method and device, storage medium and electronic device |
CN110647983B (en) * | 2019-09-30 | 2023-03-24 | 南京大学 | Self-supervision learning acceleration system and method based on storage and calculation integrated device array |
CN110928605B (en) * | 2019-11-14 | 2023-05-02 | 天津大学 | Beam adjustment method hardware accelerator based on Zynq FPGA |
CN111176962B (en) * | 2019-12-02 | 2021-09-10 | 深圳先进技术研究院 | FPGA platform, performance evaluation and design optimization method thereof and storage medium |
CN111061513B (en) * | 2019-12-20 | 2022-02-01 | 支付宝(杭州)信息技术有限公司 | Method for accelerating modeling of computing device, electronic device and readable storage medium |
CN111884952B (en) * | 2020-07-06 | 2021-05-25 | 华东师范大学 | Multichannel calculation accelerating equipment based on FPGA |
CN113485762A (en) * | 2020-09-19 | 2021-10-08 | 广东高云半导体科技股份有限公司 | Method and apparatus for offloading computational tasks with configurable devices to improve system performance |
CN112433981A (en) * | 2020-11-22 | 2021-03-02 | 中国人民解放军战略支援部队信息工程大学 | Miniaturized software radio platform for high-speed intelligent signal processing |
CN113673690B (en) * | 2021-07-20 | 2024-05-28 | 天津津航计算技术研究所 | Underwater noise classification convolutional neural network accelerator |
CN115658323A (en) * | 2022-11-15 | 2023-01-31 | 国网上海能源互联网研究院有限公司 | FPGA load flow calculation acceleration architecture and method based on software and hardware cooperation |
CN116630709B (en) * | 2023-05-25 | 2024-01-09 | 中国科学院空天信息创新研究院 | Hyperspectral image classification device and method capable of configuring mixed convolutional neural network |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104112053A (en) * | 2014-07-29 | 2014-10-22 | 中国航天科工集团第三研究院第八三五七研究所 | Design method of reconfigurable architecture platform oriented image processing |
CN104915322A (en) * | 2015-06-09 | 2015-09-16 | 中国人民解放军国防科学技术大学 | Method for accelerating convolution neutral network hardware and AXI bus IP core thereof |
CN105162475A (en) * | 2015-08-19 | 2015-12-16 | 中国人民解放军海军工程大学 | FPGA (Field Programmable Gate Array) based parameterized multi-standard decoder with high throughput rate |
CN105447285A (en) * | 2016-01-20 | 2016-03-30 | 杭州菲数科技有限公司 | Method for improving OpenCL hardware execution efficiency |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4366652B2 (en) * | 2004-04-23 | 2009-11-18 | 横河電機株式会社 | Transmitter and duplexing method thereof |
US20140289445A1 (en) * | 2013-03-22 | 2014-09-25 | Antony Savich | Hardware accelerator system and method |
-
2016
- 2016-07-27 CN CN201610596159.3A patent/CN106228238B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104112053A (en) * | 2014-07-29 | 2014-10-22 | 中国航天科工集团第三研究院第八三五七研究所 | Design method of reconfigurable architecture platform oriented image processing |
CN104915322A (en) * | 2015-06-09 | 2015-09-16 | 中国人民解放军国防科学技术大学 | Method for accelerating convolution neutral network hardware and AXI bus IP core thereof |
CN105162475A (en) * | 2015-08-19 | 2015-12-16 | 中国人民解放军海军工程大学 | FPGA (Field Programmable Gate Array) based parameterized multi-standard decoder with high throughput rate |
CN105447285A (en) * | 2016-01-20 | 2016-03-30 | 杭州菲数科技有限公司 | Method for improving OpenCL hardware execution efficiency |
Non-Patent Citations (2)
Title |
---|
A Deep Learning prediction process accelerator based FPGA;Qi Yu等;《IEEE》;20151231;第1159-1162页,第Ⅲ部分-第Ⅴ部分 |
DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning;Tianshi Chen等;《ACM》;20140305;第269-283页,摘要、第2-3部分 |
Also Published As
Publication number | Publication date |
---|---|
CN106228238A (en) | 2016-12-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106228238B (en) | Accelerate the method and system of deep learning algorithm on field programmable gate array platform | |
JP7329533B2 (en) | Method and accelerator apparatus for accelerating operations | |
KR102175044B1 (en) | Apparatus and method for running artificial neural network reverse training | |
JP7358382B2 (en) | Accelerators and systems for accelerating calculations | |
US10902315B2 (en) | Device for implementing artificial neural network with separate computation units | |
US10282659B2 (en) | Device for implementing artificial neural network with multiple instruction units | |
KR101959376B1 (en) | Systems and methods for a multi-core optimized recurrent neural network | |
EP3298547B1 (en) | Batch processing in a neural network processor | |
US20190065958A1 (en) | Apparatus and Methods for Training in Fully Connected Layers of Convolutional Networks | |
KR102203746B1 (en) | Apparatus and method for executing forward computation of artificial neural network | |
JP7078758B2 (en) | Improving machine learning models to improve locality | |
Kästner et al. | Hardware/software codesign for convolutional neural networks exploiting dynamic partial reconfiguration on PYNQ | |
AU2016203619A1 (en) | Layer-based operations scheduling to optimise memory for CNN applications | |
CN112840356A (en) | Operation accelerator, processing method and related equipment | |
CN103870335B (en) | System and method for efficient resource management of signal flow programmed digital signal processor code | |
Stevens et al. | Manna: An accelerator for memory-augmented neural networks | |
CN110414672B (en) | Convolution operation method, device and system | |
CN110377874B (en) | Convolution operation method and system | |
CN113655986B9 (en) | FFT convolution algorithm parallel implementation method and system based on NUMA affinity | |
CN111506520B (en) | Address generation method, related device and storage medium | |
Diamantopoulos et al. | A system-level transprecision FPGA accelerator for BLSTM using on-chip memory reshaping | |
Abdelrazek et al. | A novel architecture using NVIDIA CUDA to speed up simulation of multi-path fast fading channels | |
CN114298329A (en) | Model training method, device, equipment and storage medium | |
Que | Reconfigurable acceleration of recurrent neural networks | |
JP2023006509A (en) | Software generation device and software generation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |