CN106228238B - Accelerate the method and system of deep learning algorithm on field programmable gate array platform - Google Patents

Accelerate the method and system of deep learning algorithm on field programmable gate array platform Download PDF

Info

Publication number
CN106228238B
CN106228238B CN201610596159.3A CN201610596159A CN106228238B CN 106228238 B CN106228238 B CN 106228238B CN 201610596159 A CN201610596159 A CN 201610596159A CN 106228238 B CN106228238 B CN 106228238B
Authority
CN
China
Prior art keywords
data
hardware
module
dma
programmable gate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610596159.3A
Other languages
Chinese (zh)
Other versions
CN106228238A (en
Inventor
周学海
王超
余奇
周徐达
赵洋洋
李曦
陈香兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Institute for Advanced Study USTC
Original Assignee
Suzhou Institute for Advanced Study USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Institute for Advanced Study USTC filed Critical Suzhou Institute for Advanced Study USTC
Priority to CN201610596159.3A priority Critical patent/CN106228238B/en
Publication of CN106228238A publication Critical patent/CN106228238A/en
Application granted granted Critical
Publication of CN106228238B publication Critical patent/CN106228238B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a kind of methods for accelerating deep learning algorithm on field programmable gate array platform, field programmable gate array platform includes general processor, field programmable gate array and memory module, the following steps are included: predicting process and training process according to deep learning, and deep neural network and convolutional neural networks are combined, determine the general-purpose computations part for being suitable for running on field programmable gate array platform;According to the general-purpose computations part of confirmation, software-hardware synergism calculation is determined;According to calculating logic resource, the bandwidth situation of FPGA, the cured value volume and range of product of IP kernel is determined, using hardware computation unit, accelerated on programmable gate array platform at the scene.The hardware processing element accelerated for deep learning algorithm can be quickly designed according to hardware resource, processing unit has high-performance, low-power consumption feature relative to general processor.

Description

Accelerate the method and system of deep learning algorithm on field programmable gate array platform
Technical field
The present invention relates to computer hardwares to accelerate field, more particularly to accelerating on a kind of field programmable gate array platform The method and system of deep learning algorithm.
Background technique
Deep learning has significant achievement on solving high-level abstractions cognitive question, has made in machine learning a new platform Rank.It not only has a very high scientific research value, but also has very strong practicability, cause no matter academia and industry all very Favor.However, in order to solve more to be abstracted, more complicated problem concerning study, the network size of deep learning is being continuously increased, and counts It calculates and the complexity of data also increases severely therewith, for example Google Cat grid has 1,000,000,000 or so neurons.High-performance is low Deep learning related algorithm is accelerated to energy consumption to become the research hotspot of scientific research and commercial undertaking.
Usual calculating task is divided to two kinds from manifestation mode: on aageneral-purposeaprocessor, task is usually with the shape of software code Formula is presented, referred to as software task;On special hardware circuit, the intrinsic rapid charater of hardware is given full play to replace software to appoint Business, referred to as hardware task.Common hardware-accelerated technology has application-specific integrated circuit ASIC (Application Specific Integrated Circuit), field programmable gate array FPGA (Field Programmable Gate Array) and Graphics processor GPU (Graphics Processing Unit).ASIC is the ic core designed and developed for special-purpose Piece has the characteristics that high-performance, low-power consumption, area are small.Usually relative to FPGA, ASIC run faster, power consumption it is lower, and It is also cheaper when quantization production.Although transistor ratio ASIC used in FPGA is more, FPGA for same given function Logic task design is simplified, the design cycle, ratio ASIC was short very much.In addition, the exposure mask cost of production ASIC is very high, with line Wide reduction, exposure mask cost exponentially increase.FPGA is as the programmable normal component for being applicable in different function, without such great number Research and development cost, and have certain flexibility.GPU is suitable for the parallel computation of mass data, has high bandwidth, high master Frequently, high concurrency feature, and CUDA (Compute Unified Device Architecture) universal parallel Computational frame Proposition, make that developer is more convenient, quickly designs high performance solution.But the power consumption of GPU is higher, the function of single GPU Consumption is often higher than the CPU power consumption of contemporary mainstream, will more tens times even energy consumption of hundreds of times usually relative to FPGA.
Summary of the invention
In view of this, object of the present invention is to: it provides and accelerates deep learning to calculate on a kind of field programmable gate array platform The method and system of method can quickly design the hardware processing element accelerated for deep learning algorithm according to hardware resource, Processing unit has high-performance, low-power consumption feature relative to general processor.
The technical scheme is that
Accelerate the method for deep learning algorithm on a kind of field programmable gate array platform, which is characterized in that scene can compile Journey gate array platform includes general processor, field programmable gate array and memory module, comprising the following steps:
S01: predicting process and training process according to deep learning, and combine deep neural network and convolutional neural networks, Determine the general-purpose computations part for being suitable for running on field programmable gate array platform;
S02: according to the general-purpose computations part of confirmation, software-hardware synergism calculation is determined;
S03: it according to calculating logic resource, the bandwidth situation of FPGA, determines the cured value volume and range of product of IP kernel, utilizes hardware Arithmetic element is accelerated on programmable gate array platform at the scene.
In optimal technical scheme, the general-purpose computations part includes forward calculation module, calculates and swashs for matrix multiplication Encourage function calculating;Right value update module is calculated for vector.
In optimal technical scheme, the step S02 the following steps are included:
Data Preparation is carried out in software end;
Matrix multiplication is converted by convolutional layer convolutional calculation in convolutional neural networks;
The data path calculated as software-hardware synergism is read using direct memory.
In optimal technical scheme, the cured value volume and range of product of IP kernel is determined in the step S03, comprising: according to pending Hardware task, determine the type of cured arithmetic element on FPGA;According to FPGA hardware logical resource and bandwidth situation, determine The quantity of the processing unit of pending hardware task.
In optimal technical scheme, the forward calculation module is designed using fragment, and the every a line inside of node matrix equation is pressed and is divided Piece size carries out fragment, and each column of weighting parameter matrix carry out fragment according to fragment size, by the every fragment for being about to node matrix equation Size data fragment size numerical value corresponding with each column of weighting parameter matrix carries out dot-product operation, and every a line has been calculated complete Nonce is added up afterwards and obtains final result.
In optimal technical scheme, the n times side that the fragment size is 2 is consistent with the parallel granularity of arithmetic element.
The present invention discloses a kind of for accelerating the FPGA structure of deep learning algorithm again characterized by comprising
The node data matrix of forward calculation module and weighting parameter matrix are carried out fragment, timesharing by fragment processing structure It is multiplexed hardware logic;
Excitation function linear approximation realizes structure, for generating arbitrary excitation function;
Parameter configuration module, for configuring the parameter of processing unit;
Forward calculation module, the forward direction that forward calculation hardware configuration and double DMA including single DMA caching weight are read parallel Computing hardware structure;For the forward calculation of deep neural network, the forward calculation of convolutional neural networks convolutional layer and layer of classifying And matrix multiplication operation, and carry out assembly line and be optimized to maximum throughput rate;
Right value update module is calculated for vector.
In optimal technical scheme, the parameter configuration module carries out processing unit by DMA transfer configuration parameter data Configuration, comprising: the operating mode configuration and data scale configuration of forward calculation module, data scale configuration include that node data is advised Mould configuration, the configuration of input neuron scale and the configuration of output neuron scale;The configuration of right value update module data scale, Working mould Formula configuration and calculating parameter configuration.
In optimal technical scheme, the forward calculation hardware configuration of the list DMA caching weight includes:
Single DMA is responsible for reading data, writes back;
Pair register buffer area alternately reads data or carries out parallel computation;BRAM group caches and guarantees that data parallel is read It takes;
With the equal-sized floating-point multiplier of fragment;
With the y-bend add tree of fragment input equal in magnitude;
Cycle accumulor device, cumulative nonce are saved on piece BRAM;
Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation;
The forward calculation hardware configuration that double DMA are read parallel includes:
Neuron data read module is furnished with DMA and FIFO buffer area, is responsible for reading input neuron node data;
Weighting parameter data read module is furnished with DMA and FIFO buffer area, is responsible for reading weighting parameter data;
With the equal-sized floating-point multiplier of fragment;
With the y-bend add tree of fragment input equal in magnitude;
Cycle accumulor device, cumulative nonce are saved on piece BRAM;
Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation.
In optimal technical scheme, the right value update module calculates the calculating with output layer error amount for right value update, And it carries out assembly line and is optimized to maximum throughput rate, comprising: vector A data read module and vector B data read module, respectively Equipped with DMA and fifo buffer, two groups of vector values for calculating are read respectively;Computing module is carried out pair by configuration information The vector answered calculates;As a result module is write back, is furnished with DMA and fifo buffer, calculated result is written back to host's end memory.
Compared with prior art, the invention has the advantages that
The present invention can effectively accelerate deep learning algorithm, including study prediction process and training process, being capable of basis Hardware resource quickly designs the hardware processing element accelerated for deep learning algorithm, and processing unit is relative to general processor There are high-performance, low-power consumption feature.
Detailed description of the invention
The invention will be further described with reference to the accompanying drawings and embodiments:
Fig. 1 is the process for accelerating deep learning method on the field programmable gate array platform of the embodiment of the present invention Figure;
Fig. 2 is the calculating schematic diagram of convolutional layer in convolutional neural networks;
Fig. 3 is that the forward calculation hardware processing element on the field programmable gate array platform of the embodiment of the present invention turns The schematic diagram that change of lap lamination calculates;
Fig. 4 is right value update processing unit on the field programmable gate array platform of the embodiment of the present invention by data Matrix conversion at vector schematic diagram;
Fig. 5 is the structural representation that software-hardware synergism calculates on the field programmable gate array platform of the embodiment of the present invention Figure;
Fig. 6 is that the hardware processing element resource of the embodiment of the present invention uses and field programmable gate array platform resource And applicable cases solidify the schematic diagram of value volume and range of product;
Fig. 7 is the schematic diagram of the forward calculation processing unit data fragmentation processing of the embodiment of the present invention;
Fig. 8 is that the piece wire approximation of the embodiment of the present invention realizes the schematic diagram of excitation function;
Fig. 9 is the forward direction meter that list DMA prestores weight matrix in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Calculate the structural schematic diagram of hardware processing element;
Figure 10 is preceding tired into computing hardware processing unit in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Add the structural schematic diagram of processing;
Figure 11 be in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention before into computing hardware processing unit point The structural schematic diagram of section approximation sigmoid function;
Figure 12 is the forward direction meter that list DMA prestores weight matrix in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Calculate the flow chart of data processing figure of hardware processing element;
Figure 13 is the forward direction meter of double DMA parallel read datas in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Calculate the structural schematic diagram of hardware processing element;
Figure 14 is the forward direction meter of double DMA parallel read datas in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Calculate the flow chart of data processing figure of hardware processing element;
Figure 15 is the knot of right value update hardware processing element in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Structure schematic diagram;
Figure 16 is the number of right value update hardware processing element in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention According to process flow diagram;
Figure 17 is possibility one of deep learning accelerator in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Application scenarios and block schematic illustration.
Specific embodiment
Above scheme is described further below in conjunction with specific embodiment.It should be understood that these embodiments are for illustrating The present invention and be not limited to limit the scope of the invention.Implementation condition used in the examples can be done according to the condition of specific producer Further adjustment, the implementation condition being not specified is usually the condition in routine experiment.
Embodiment:
Field programmable gate array platform in the embodiment of the present invention refers to while integrated universal processor (General Purpose Processor, referred to as " GPP ") and field programmable gate array (Field Programmable Gate Arrays, referred to as " FPGA ") chip computing system, wherein the data path between FPGA and GPP can use PCI-E Bus protocol, AXI bus protocol etc..Attached drawing data path of the embodiment of the present invention illustrates for using AXI bus protocol, but this hair It is bright to be not limited to this.
Fig. 1 is the stream that the field programmable gate array platform of the embodiment of the present invention accelerates the method 100 of deep learning algorithm Cheng Tu.This method 100 includes:
S110 predicts process and training process according to deep learning, wherein training process include local pre-training process and Global training process, and deep neural network and convolutional neural networks are combined, it determines and is suitable for field programmable gate array platform The general-purpose computations part of upper operation;
S120 determines software-hardware synergism calculation according to the common hardware computing module of confirmation;
S130, according to calculating logic resource, bandwidth situation on field programmable gate array, determine the cured quantity of IP kernel and Type.
Below in conjunction with Fig. 2 to Fig. 4, the method for deep learning general-purpose computations part is accelerated to carry out the embodiment of the present invention Detailed description.
Fig. 2 is the schematic diagram that convolutional layer calculates, it is assumed that input feature vector figure number is 4, and convolution kernel size is 3x3, then by 4 After the cumulative summation of convolution checkout result, the value that output characteristic pattern can be obtained is handled by excitation function.From calculating overall structure On see, the basic calculating mode of convolutional layer is similar with the calculating of deep neural network hidden layer, as long as by adjusting convolution nuclear parameter sequence Convolutional calculation used herein can be changing into dot product calculating by column.Specific adjustment mode are as follows: 1), by input feature vector figure from up to Under, by row be sequentially filled to a line, as shown in Fig. 3 left line;2) after convolution matrix core being rotated 180 degree counterclockwise, from up to Under, column of weight matrix are sequentially write by row, it is among Fig. 3 shown in that column, original convolution kernel a is successively inverse to convolution kernel d Hour hands rotate 180 degree after, become a9~a1, b9~b1 ... d9~d1, filling in proper order to one column in.So for convolution Layer prediction process, basic calculating are convertible into mode identical with deep neural network hidden layer, i.e. matrix multiplication calculating adds Excitation function processing, but need to pay more the cost of data conversion.
In deep learning training process, in addition to needing a large amount of matrix multiplication calculating that a large amount of vector is also needed to calculate, It needs matrix data being converted into vector data when carrying out vector calculating, as shown in figure 4, the every a line of data is formed one in proper order A vector carries out vector calculating.
Therefore, in conjunction with Fig. 2 to Fig. 4, deep learning is predicted the general-purpose computations portion of process and training process by present example Divide and be attributed to matrix multiplication calculating, excitation function calculates and a large amount of vector calculates.
Fig. 5 is structural framing Figure 200 that the software-hardware synergism that present example uses calculates.The structure includes:
Processing System (abbreviation PS) 210 includes CPU and Memory as the control terminal of whole system.CPU As host end, runs software end code, and task offload will be accelerated to work to the end PL.In addition, CPU is as controllable The working condition sum number of each IP kernel in the end PL (intellectual property core represents each hardware computation unit here) processed According to reading etc.;
Programmable logic Programming Logic (abbreviation PL) 220 is the hardware-accelerated component FPGA core of whole system Piece.IP kernel can be solidified on fpga chip according to different acceleration tasks to realize the acceleration to algorithm.System by the end PS according to Specific algorithm scheduling selects different IP Core to carry out parallel computation, can also be by host end software task and the end FPGA hardware Task carries out parallel computation;
Data/address bus (Data Bus) 230 is responsible for the end whole system PS and the transmission of PL end data;
Control signal bus (Control Bus) 240 is responsible for the transmission at the end whole system PS and the end PL control signal.
Fig. 6 is the accelerator overall structure 2000 based on FPGA design, and structure includes:
System controller 2100 is responsible for execution state, data transmission and the program scheduler of each hardware computation unit of control. And the non-universal calculating section of responsible operation deep learning, data initialization and hardware computation unit (or being IP kernel) is first Beginning task;
Memory 2200 is responsible for storage depth learning network parameter and original input data, and data is required to store here Physical address be it is continuous, facilitate DMA to carry out data transmission;
Data bus protocol 2300, AXI-Stream agreement allow unconfined data burst transmission, are high-performance data Transport protocol;
Control bus agreement 2400, AXI-Lite are a kind of address of cache single transmission agreements of lightweight, are suitable for hard The control signal of part arithmetic element transmits;
Data interconnection 2500, data path interconnection;
Control interconnection 2600, control signal lines interconnection;
Direct memory access DMA2700, the data transmission being responsible between accelerator and memory, each hardware processing element are matched A standby DMA carrys out parallel read data;
PE (Proccesing Element) 2800 computing unit as each accelerator, internal curable 1 forward direction Calculating arithmetic element or 1 right value update arithmetic element or both includes.Since FPGA is programmable and can weigh Structure, the quantity of PE can not change operation list in this way according to the resource bandwidth situation dynamic configuration of specific fpga chip here The computing resource of hardware can be made full use of under first hardware design, guarantee that hardware plays peak performance.
Above in conjunction with Fig. 1 to Fig. 6, the method that the embodiment of the present invention accelerates deep learning algorithm is described in detail, below The hardware configuration of the embodiment of the present invention will be introduced.
Fig. 7 is to design forward calculation arithmetic element using fragment Computation schema, it is assumed that the size of fragment is 16, by node square Fragment is carried out by 16 inside the every a line of battle array, weighting parameter matrix carries out fragment according to 16 elements of each column.By being about to node square Every 16 data of battle array 16 numerical value corresponding with each column of weighting parameter matrix carries out dot-product operation, has been calculated to every a line complete These nonces, which add up, again afterwards can be obtained final result.Such method not only takes full advantage of data locality, but also subtracts Resource situation needed for solidifying parallel execution unit is lacked, and has reduced data bandwidth needed for hardware, has allowed single arithmetic element can To realize that the matrix multiplication of random scale calculates.
In order to keep high-throughput, the size of fragment should be matched with arithmetic element interior design, be kept with parallel granularity Unanimously, in matrix multiplication operation, fragment can be set as to 2 n times side, to give full play to the cumulative performance of binary tree.By It is related with parallel granularity in fragment size, theoretically for fragment it is bigger, degree of parallelism is higher, and the performance of arithmetic element can also be got over It is good, so selecting maximum 2 in the case where hardware resource and bandwidth allownFragment size as arithmetic element.
Fig. 8 is to carry out hard-wired schematic diagram to excitation function in present example.Present example uses segmented line Property approximation realize S type excitation function, by function by X-axis be divided into it is several it is equivalent be spaced, by Y=a in each intervali*X+bi,X ∈[xi,xi+1) shown in carry out linear approximation, wherein xi+1-xiFor approximate gap size.Whenever needing to calculate excitation function, The section where it is found first, in accordance with X value and calculates its corresponding aiAnd biRelative to the offset of base address, multiply-add fortune is carried out After calculation, can approximation obtain Y value.This implementation has two benefits: 1), arbitrary S type excitation function or linear can be achieved Function, and without changing any hardware design, it is only necessary to replace the numerical value that coefficient a and coefficient b are stored;2), error Minimum, when approximate interval reduces, error, which can achieve, to be ignored, and cost is only to increase for packing coefficient a and be The BRAM of number b.And the requirement that deep learning calculates in itself to the accuracy of data is not very high a degree of in other words Loss of significance has no effect on data result.
Fig. 9 is the hardware configuration that list DMA prestores weight matrix on the field programmable gate array platform of the embodiment of the present invention Schematic block diagram 3000 cache weight matrix number in advance when the hardware configuration is more sufficient for BRAM resource inside FPGA Forward calculation is carried out according on piece BRAM.Structure includes:
Data read module 3100 is furnished with DMA and FIFO buffer area, and data bit width is 32, is responsible for reading weighting parameter It is buffered on piece BRAM and reads neuron node data.
On piece BRAM3200 caches weighting parameter data.By taking fragment size is 16 as an example, it is with 16 by row by weight matrix Circulation is stored on different BRAM, i.e. i%16 adds the base address of BRAM as addressing system, to guarantee carrying out 16 simultaneously From different BRAM parallel read datas when row multiplication.
Pair register caching 3300, each register includes 16 registers for storing input neuron number evidence here, Data cached and progress parallel computation is carried out by replacing.But it is noted herein that: buffer area is filled up to the required time Time needed for calculating lower than these data just can guarantee the time of buffer data reading by calculating required time institute in this way Covering, and ensure the correctness of result.
Parallel floating point multiplication 3400, by weighting parameter data and neuron number according to progress parallel multiplication calculating, Floating-point Computation Realized using DSP, after assembly line optimization, can each clock cycle parallel processing 16 floating-point multiplications operation, fragment size here For 16.Since input neuron number might not be divided exactly by 16, so when every data fragment carries out dot product calculating, The possible number inadequate 16 of the last one fragment, then arithmetic element will carry out parallel multiplication calculating with the part of 0 lack of fill 16.
Floating point result obtained in 3400 structure of parallel floating point multiplication is carried out cumulative behaviour by y-bend floating add tree 3500 Make, parallel computation is carried out using y-bend add tree, the read-write dependency of accumulation operations is eliminated, by the required time complexity that adds up From the near O of O (n) (logn).
Accumulation calculating 3600 is needed since forward calculation processing unit is calculated using fragment processing by y-bend floating add The result that tree 3500 obtains after calculating adds up, but cumulative mode is to carry out cycle accumulor behaviour every output neuron number Make.
Excitation function calculates 3700, realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation.
Data write back module 3800, are furnished with DMA and FIFO buffer area, and data bit width is 32, are responsible for data calculated result It is written back to host's end memory.
The hardware configuration supports parameter configuration, can support the neural computing of different scales.Detailed configuration parameter are as follows:
Data_size: the scale of input neuron number evidence;
Input_size: inputting the number of neuron, due to caching weight matrix data in advance, therefore should be less than piece here Upper BRAM can allow to cache the corresponding maximum input neuron number Max_input of weighting parameter;
Output_size: the number of output neuron due to caching weight matrix data in advance, therefore should be less than here On piece BRAM can allow to cache the corresponding maximum output neuron number Max_output of weighting parameter;
Work_mode:0 indicates only to carry out matrix multiplication calculating;1 indicates that carrying out matrix multiplication and excitation function calculates.
Figure 10 is the hardware configuration signal that accumulation calculating is carried out on the field programmable gate array platform of the embodiment of the present invention Figure 36 00.Structure includes:
Floating add calculates 3610, due to using fragment thought, needs to add up to the median that dot product is calculated. Median data flow is added up every the number N (or the latter's matrix column number) of output neuron, suitable again after adding up Sequence output.
Nonce stores BRAM3620, N number of storage unit is arranged inside FPGA for storing ephemeral data, circulation will count It is added in corresponding BRAM storage unit according to flow data, is judged whether according to the relationship of input neuron number and fragment size It is cumulative to terminate.The quantity for storing nonce can not be dynamically set when due to FPGA interior design, so in design luck It calculates unit and sets the maximum cumulative number MAX of support.When the number of output neuron can just be normally carried out cumulative behaviour lower than MAX value Make.
Assembly line optimization is also equally carried out to the process, and starting interval is optimized to 1 clock cycle, to guarantee centre Value generates and the rate of processing is consistent.
Figure 11, which shows progress piece wire approximation on the field programmable gate array platform of the embodiment of the present invention and realizes, to be swashed Encourage the hardware structural diagram 3700 of function.
Excitation function is realized using sublevel linear approximation, is realized that details is as shown in figure 11, unlike Fig. 8, is increased One X is transmitted directly to the access of Y, allows forward calculation arithmetic element that can only execute matrix multiplication operation and without excitation The processing of function carries out used matrix multiplication when error amount calculating here mainly for realizing in training process.Due to S type Excitation function substantially about certain point symmetry, by taking sigmoid function as an example, sigmoid function about (0,0.5) symmetrically, institute Calculated according to 1-f (- x) when x is less than 0, it can be multiplexed hardware logic in this way, reduce the use to hardware resource.And And when x is equal to 8, f (x) is equal to 0.999665, is just infinitely close to 1 later, therefore when x is greater than 8, directly result is assigned a value of 1。
Figure 12 is the forward calculation that list DMA prestores weighting parameter on the field programmable gate array platform of the embodiment of the present invention The calculation flow chart of hardware computation unit.
It is successively read configuration data from DMA first, node data is read according to configuration information.First will when reading node data After register group a is full of, flag is set 0, later according to flag%2 numerical value replace input node data value register group a or Register group b.Equally, it is carried out according to the weight number of the data of the numerical value of flag%2 reading register group and BRAM caching parallel Multiplication calculates, and then adds up after the summation of y-bend add tree.After cumulative, selected according to operating mode through overdriving Function processing or directly output.
Figure 13 is the forward calculation hardware that double DMA are read parallel on the field programmable gate array platform of the embodiment of the present invention The structural schematic diagram 4000 of arithmetic element.The hardware configuration carries out forward calculation module design for the fpga chip of high bandwidth, It is read parallel using double DMA and guarantees high-throughput.Here for 16, structure includes: fragment size
Neuron data read module 4100 is furnished with DMA and FIFO buffer area, and data bit width is 512, is responsible for reading defeated Enter neuron node data, 16 32 single-precision floating-point datas are obtained by shifting function.Since the transmission bit wide of data is 512, it requires that data want address align in host's end memory.Further for input neuron number aliquant 16 Situation needs to carry out neuron node data matrix at host end to fill 0 operation, fills 16-Input_ to the end of every a line Size%16 0, wherein Input_size is the number for inputting neuron, without filling when Input_size%16 is equal to 0.This In to each data-reusing Output_size times, wherein Output_size be output neuron number.
Weighting parameter data read module 4200 is furnished with DMA and FIFO buffer area, and data bit width is 512, is responsible for reading Weighting parameter data obtain 16 32 single-precision floating-point datas by shifting function.Also due to the transmission bit wide of data is 512, it requires that data want address align in host's end memory.Further for input neuron number aliquant 16 Situation needs to carry out weighting parameter data matrix at host end to fill 0 operation, fills 16-Input_ at the end of each column Size%16 0, without filling when same Input_size%16 is equal to 0.After filling, since DMA transfer needs continuously Physical address, need the data storage location by weighting parameter matrix to be adjusted and facilitate DMA transfer.
Parallel floating point multiplication 4300, by weighting parameter data and neuron number according to progress parallel multiplication calculating, Floating-point Computation It is realized using DSP, it, can 16 floating-point multiplication operations of each clock cycle parallel processing after assembly line optimization.
Floating point result obtained in 4300 structure of parallel floating point multiplication is carried out cumulative behaviour by y-bend floating add tree 4400 Make, parallel computation is carried out using y-bend add tree, the read-write dependency of accumulation operations is eliminated, by the required time complexity that adds up From the near O of O (n) (logn).
Accumulation calculating 4500 is needed since forward calculation processing unit is calculated using fragment processing by y-bend floating add The result that tree 4400 obtains after calculating adds up, but cumulative mode is to carry out cycle accumulor behaviour every output neuron number Make.The structure and structure 3600 are identical, therefore are not described in further detail.
Excitation function calculates 4600, realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation. The structure and structure 3700 are identical, therefore are not described in further detail.
Data write back module 4700, are furnished with DMA and FIFO buffer area, and data bit width is 32, are responsible for data calculated result It is written back to host's end memory.
The hardware configuration supports parameter configuration, can support the neural computing of different scales.Detailed configuration parameter are as follows:
Data_size: the scale of input neuron number evidence;
Input_size: the number of neuron is inputted;
Output_size: the number of output neuron;
Work_mode:0 indicates only to carry out matrix multiplication calculating;1 indicates that carrying out matrix multiplication and excitation function calculates.
Figure 14 is the forward calculation hardware that double DMA are read parallel on the field programmable gate array platform of the embodiment of the present invention The calculation flow chart of arithmetic element.
Read configuration information from node DMA first, configuration arithmetic element read the scale of node data and weight data with And operating mode.Then, 512 data are read in from node DMA and weight DMA respectively, parallel shift obtains 16 neuron sections Point data and 16 weighting parameter data, due to accelerator multiplexer node data, therefore every Output_size clock cycle reads Node data, every 1 clock cycle read a weighting parameter data.After reading data, 16 are successively carried out simultaneously The y-bend add tree summation of the operation of row multiplication and 16 inputs.Summed result is circuited sequentially and is added to specified BRAM storage location On, and judge whether cumulative end.After cumulative, piecewise approximation is directly exported or carried out according to operating mode selection and motivates letter Number processing.
Figure 15 is the hardware of right value update hardware computation unit on the field programmable gate array platform of the embodiment of the present invention Structural schematic diagram 5000.It is read parallel using double DMA, to calculate vector operation with guaranteeing high-throughput.Structure includes:
Vector A data read module 5100, is furnished with DMA and fifo buffer, and bit wide is 32.Also it is responsible for configuration ginseng simultaneously Several readings.
Vector B data read module 5200, is furnished with DMA and fifo buffer, and bit wide is 32.
Computing module 5300 carries out corresponding vector calculating by different configuration informations.Operating mode carries out a*A+ when being 0 B*B is calculated;(a*A+b*B) * B* (1-B) is carried out when operating mode is 1 to calculate.Wherein a, b are configuration parameter, and A, B are to read respectively The vector value entered.
As a result module 5400 is write back, is furnished with DMA and fifo buffer, bit wide is 32, and calculated result is written back to host End memory.
The hardware configuration supports parameter configuration, and the vector of different scales can be supported to calculate.Detailed configuration parameter are as follows:
Data_size: the scale of input vector data;
A: required coefficient value is calculated;
B: required coefficient value is calculated;
Work_mode:0 indicates to carry out a*A+b*B calculating;1 indicates that carrying out (a*A+b*B) * B* (1-B) calculates.
Figure 16 is the calculating of right value update hardware computation unit on the field programmable gate array platform of the embodiment of the present invention Flow chart.
First from DMA A read configuration information, then according to configuration information Data_size respectively from DMA A and B read to The value of amount, parallel and configuration parameter a and b sum after carrying out multiplication calculating, are finally chosen whether according to operating mode multiplied by B* Result is written back to host's end memory by DMA A by (1-B).
Figure 17 is possibility one of deep learning accelerator in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Application scenarios and block schematic illustration.
Here the composition of application system is as illustrating, and the present invention is not limited thereto.User answers system sending When with request, request is assigned to corresponding calculate node by scheduler by application system control node.Calculate node is in basis Concrete application request will accelerate task offload to FPGA to accelerate.
The general frame figure of each calculate node is made of hardware layer, driving layer, library layer, service layer and application layer.Hardware Layer is made of FPGA, memory and host end CPU, controller of the CPU as system, each hardware processing element inside control FPGA The operating status and reading data of (DL Module is referred to as in figure), including forward calculation arithmetic element and right value update unit. Weighting parameter data required for system-computed and neuron number according to being merely stored in memory, by DMA by data in memory and It is transmitted before hardware processing element;Driving layer is then the hardware driving write according to hardware platform and operating system;Library layer is then The Application Programming Interface API encapsulated on the basis of driving;Service layer is the deep learning relevant calculation that user oriented request provides Accelerate service;Application layer then refers to that deep learning prediction algorithm and training algorithm are specifically applied, such as uses convolutional Neural net Network prediction algorithm carries out picture classification etc..
Those of ordinary skill in the art may be aware that method described in conjunction with the examples disclosed in this document and hardware Structure can be realized with the combination of FPGA and CPU.The value volume and range of product of specific FPGA internal curing IP kernel see concrete application and Fpga chip resource constraint.Professional technician can come using not Tongfang each specific application or specific fpga chip Formula or different degree of parallelism realize above-mentioned described function, but such implementation should not be considered as beyond the scope of the present invention.
In several embodiments provided herein, it should be understood that disclosed method and hardware configuration, Ke Yitong Other modes are crossed to realize.For example, the application of deep learning described above is deep neural network and convolutional neural networks are Schematically.For example, the fragment size and parallel granularity in forward calculation arithmetic element be it is schematical, can be according to specific Situation is adjusted.Such as the data transfer mode between field programmable gate array and general processor is assisted using AXI bus View is also schematic.
The foregoing examples are merely illustrative of the technical concept and features of the invention, its object is to allow the person skilled in the art to be It cans understand the content of the present invention and implement it accordingly, it is not intended to limit the scope of the present invention.It is all smart according to the present invention The equivalent transformation or modification that refreshing essence is done, should be covered by the protection scope of the present invention.

Claims (9)

1. accelerating the method for deep learning algorithm on a kind of field programmable gate array platform, which is characterized in that field-programmable Gate array platform includes general processor, field programmable gate array and memory module, comprising the following steps:
S01: process and training process are predicted according to deep learning, and combine deep neural network and convolutional neural networks, is determined Suitable for the general-purpose computations part run on field programmable gate array platform;
S02: according to the general-purpose computations part of confirmation, software-hardware synergism calculation is determined;
S03: it according to calculating logic resource, the bandwidth situation of FPGA, determines the cured value volume and range of product of IP kernel, utilizes hardware computation Unit is accelerated on programmable gate array platform at the scene;
The general-purpose computations part includes forward calculation module, and the forward calculation module includes the forward direction of single DMA caching weight The forward calculation hardware configuration that computing hardware structure and double DMA are read parallel;The forward calculation hardware of the list DMA caching weight Structure includes:
Single DMA is responsible for reading data, writes back;
Pair register buffer area alternately reads data or carries out parallel computation;BRAM group caches and guarantees that data parallel is read;
With the equal-sized floating-point multiplier of fragment;
With the y-bend add tree of fragment input equal in magnitude;
Cycle accumulor device, cumulative nonce are saved on piece BRAM;
Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation;
The forward calculation hardware configuration that double DMA are read parallel includes:
Neuron data read module is furnished with DMA and FIFO buffer area, is responsible for reading input neuron node data;
Weighting parameter data read module is furnished with DMA and FIFO buffer area, is responsible for reading weighting parameter data;
With the equal-sized floating-point multiplier of fragment;
With the y-bend add tree of fragment input equal in magnitude;
Cycle accumulor device, cumulative nonce are saved on piece BRAM;
Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation.
2. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 1 It is, the forward calculation module, is calculated for matrix multiplication and excitation function calculates;Right value update module is used for meter It calculates.
3. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 1 Be, the step S02 the following steps are included:
Data Preparation is carried out in software end;
Matrix multiplication is converted by convolutional layer convolutional calculation in convolutional neural networks;
The data path calculated as software-hardware synergism is read using direct memory.
4. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 1 It is, the cured value volume and range of product of IP kernel is determined in the step S03, comprising: according to pending hardware task, determine FPGA The type of upper cured arithmetic element;According to FPGA hardware logical resource and bandwidth situation, the place of pending hardware task is determined Manage the quantity of unit.
5. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 2 It is, the forward calculation module is designed using fragment, and fragment size will be pressed inside the every a line of node matrix equation and carries out fragment, weight The each column of parameter matrix carry out fragment according to fragment size, by the every fragment size data and weighting parameter for being about to node matrix equation The corresponding fragment size numerical value of each column of matrix carries out dot-product operation, every a line has been calculated finish after nonce added up obtain most Terminate fruit.
6. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 5 It is, the n times side that the fragment size is 2 is consistent with the parallel granularity of arithmetic element.
7. a kind of for accelerating the FPGA structure of deep learning algorithm characterized by comprising
The node data matrix of forward calculation module and weighting parameter matrix are carried out fragment, time-sharing multiplex by fragment processing structure Hardware logic;
Excitation function linear approximation realizes structure, for generating arbitrary excitation function;
Parameter configuration module, for configuring the parameter of processing unit;
Forward calculation module, the forward calculation that forward calculation hardware configuration and double DMA including single DMA caching weight are read parallel Hardware configuration;For the forward calculation of deep neural network, convolutional neural networks convolutional layer and classify layer forward calculation and Matrix multiplication operation, and carry out assembly line and be optimized to maximum throughput rate;
The forward calculation hardware configuration of list DMA caching weight includes:
Single DMA is responsible for reading data, writes back;
Pair register buffer area alternately reads data or carries out parallel computation;BRAM group caches and guarantees that data parallel is read;
With the equal-sized floating-point multiplier of fragment;
With the y-bend add tree of fragment input equal in magnitude;
Cycle accumulor device, cumulative nonce are saved on piece BRAM;
Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation;
The forward calculation hardware configuration that double DMA are read parallel includes:
Neuron data read module is furnished with DMA and FIFO buffer area, is responsible for reading input neuron node data;
Weighting parameter data read module is furnished with DMA and FIFO buffer area, is responsible for reading weighting parameter data;
With the equal-sized floating-point multiplier of fragment;
With the y-bend add tree of fragment input equal in magnitude;
Cycle accumulor device, cumulative nonce are saved on piece BRAM;
Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation;
Right value update module is calculated for vector.
8. according to claim 7 for accelerating the FPGA structure of deep learning algorithm, which is characterized in that the parameter is matched It sets module to configure processing unit by DMA transfer configuration parameter data, comprising: the operating mode of forward calculation module is matched It sets and is configured with data scale, data scale configuration includes the configuration of node data scale, the configuration of input neuron scale and output mind It is configured through first scale;The configuration of right value update module data scale, operating mode configuration and calculating parameter configuration.
9. according to claim 7 for accelerating the FPGA structure of deep learning algorithm, which is characterized in that the weight is more New module calculates the calculating with output layer error amount for right value update, and carries out assembly line and be optimized to maximum throughput rate, wraps Include: vector A data read module and vector B data read module are respectively provided with DMA and fifo buffer, read be used for respectively The two groups of vector values calculated;Computing module carries out corresponding vector calculating by configuration information;As a result module is write back, is furnished with DMA And fifo buffer, calculated result is written back to host's end memory.
CN201610596159.3A 2016-07-27 2016-07-27 Accelerate the method and system of deep learning algorithm on field programmable gate array platform Active CN106228238B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610596159.3A CN106228238B (en) 2016-07-27 2016-07-27 Accelerate the method and system of deep learning algorithm on field programmable gate array platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610596159.3A CN106228238B (en) 2016-07-27 2016-07-27 Accelerate the method and system of deep learning algorithm on field programmable gate array platform

Publications (2)

Publication Number Publication Date
CN106228238A CN106228238A (en) 2016-12-14
CN106228238B true CN106228238B (en) 2019-03-22

Family

ID=57534278

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610596159.3A Active CN106228238B (en) 2016-07-27 2016-07-27 Accelerate the method and system of deep learning algorithm on field programmable gate array platform

Country Status (1)

Country Link
CN (1) CN106228238B (en)

Families Citing this family (75)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268931B (en) * 2016-12-30 2022-10-25 华为技术有限公司 Data processing method, device and system
US10565492B2 (en) * 2016-12-31 2020-02-18 Via Alliance Semiconductor Co., Ltd. Neural network unit with segmentable array width rotator
US10140252B2 (en) 2017-02-28 2018-11-27 Microsoft Technology Licensing, Llc Hardware node with matrix-vector multiply tiles for neural network processing
US11086967B2 (en) 2017-03-01 2021-08-10 Texas Instruments Incorporated Implementing fundamental computational primitives using a matrix multiplication accelerator (MMA)
CN107633297B (en) * 2017-03-10 2021-04-06 南京风兴科技有限公司 Convolutional neural network hardware accelerator based on parallel fast FIR filter algorithm
CN108629405B (en) * 2017-03-22 2020-09-18 杭州海康威视数字技术股份有限公司 Method and device for improving calculation efficiency of convolutional neural network
CN107145944B (en) * 2017-03-29 2020-10-16 浙江大学 Genetic algorithm and system based on FPGA efficient training
EP3627437B1 (en) * 2017-04-06 2022-11-09 Cambricon (Xi'an) Semiconductor Co., Ltd. Data screening device and method
CN108734288B (en) * 2017-04-21 2021-01-29 上海寒武纪信息科技有限公司 Operation method and device
CN108804974B (en) * 2017-04-27 2021-07-02 深圳鲲云信息科技有限公司 Method and system for estimating and configuring resources of hardware architecture of target detection algorithm
CN107392308B (en) * 2017-06-20 2020-04-03 中国科学院计算技术研究所 Convolutional neural network acceleration method and system based on programmable device
CN107423030A (en) * 2017-07-28 2017-12-01 郑州云海信息技术有限公司 Markov Monte carlo algorithm accelerated method based on FPGA heterogeneous platforms
CN107480782B (en) * 2017-08-14 2020-11-10 电子科技大学 On-chip learning neural network processor
CN107506173A (en) * 2017-08-30 2017-12-22 郑州云海信息技术有限公司 A kind of accelerated method, the apparatus and system of singular value decomposition computing
CN107392309A (en) * 2017-09-11 2017-11-24 东南大学—无锡集成电路技术研究所 A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA
CN107657581B (en) * 2017-09-28 2020-12-22 中国人民解放军国防科技大学 Convolutional neural network CNN hardware accelerator and acceleration method
CN109726809B (en) * 2017-10-30 2020-12-08 赛灵思公司 Hardware implementation circuit of deep learning softmax classifier and control method thereof
CN107862650B (en) * 2017-11-29 2021-07-06 中科亿海微电子科技(苏州)有限公司 Method for accelerating calculation of CNN convolution of two-dimensional image
CN108090496A (en) * 2017-12-22 2018-05-29 银河水滴科技(北京)有限公司 The method and apparatus of image procossing based on convolutional neural networks
CN108231086A (en) * 2017-12-24 2018-06-29 航天恒星科技有限公司 A kind of deep learning voice enhancer and method based on FPGA
CN109993287B (en) * 2017-12-29 2019-12-06 北京中科寒武纪科技有限公司 neural network processing method, computer system, and storage medium
CN108416422B (en) * 2017-12-29 2024-03-01 国民技术股份有限公司 FPGA-based convolutional neural network implementation method and device
CN108280514B (en) * 2018-01-05 2020-10-16 中国科学技术大学 FPGA-based sparse neural network acceleration system and design method
CN108090560A (en) * 2018-01-05 2018-05-29 中国科学技术大学苏州研究院 The design method of LSTM recurrent neural network hardware accelerators based on FPGA
CN108229670B (en) * 2018-01-05 2021-10-08 中国科学技术大学苏州研究院 Deep neural network acceleration platform based on FPGA
CN110018979A (en) * 2018-01-09 2019-07-16 幻视互动(北京)科技有限公司 It is a kind of based on restructing algorithm collection and accelerate handle mixed reality data flow MR intelligent glasses and method
WO2019136755A1 (en) * 2018-01-15 2019-07-18 深圳鲲云信息科技有限公司 Method and system for optimizing design model of artificial intelligence processing device, storage medium, and terminal
WO2019136751A1 (en) * 2018-01-15 2019-07-18 深圳鲲云信息科技有限公司 Artificial intelligence parallel processing method and apparatus, computer readable storage medium, and terminal
CN109496319A (en) * 2018-01-15 2019-03-19 深圳鲲云信息科技有限公司 Artificial intelligence process device hardware optimization method, system, storage medium, terminal
US11874898B2 (en) 2018-01-15 2024-01-16 Shenzhen Corerain Technologies Co., Ltd. Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal
CN108229671B (en) * 2018-01-16 2022-03-04 华南理工大学 System and method for reducing storage bandwidth requirement of external data of accelerator
CN108320022A (en) * 2018-01-23 2018-07-24 深圳市易成自动驾驶技术有限公司 Deep learning system constituting method, device, deep learning system and storage medium
US11568232B2 (en) * 2018-02-08 2023-01-31 Quanta Computer Inc. Deep learning FPGA converter
CN110222833B (en) * 2018-03-01 2023-12-19 华为技术有限公司 Data processing circuit for neural network
CN108764466B (en) * 2018-03-07 2022-02-11 东南大学 Convolution neural network hardware based on field programmable gate array and acceleration method thereof
CN110363291B (en) * 2018-03-26 2022-02-08 上海寒武纪信息科技有限公司 Operation method and device of neural network, computer equipment and storage medium
CN110321998B (en) * 2018-03-31 2022-06-14 赛灵思公司 Convolutional neural network implementation method and device, acceleration equipment and storage medium
CN108520297B (en) * 2018-04-02 2020-09-04 周军 Programmable deep neural network processor
CN108710941A (en) * 2018-04-11 2018-10-26 杭州菲数科技有限公司 The hard acceleration method and device of neural network model for electronic equipment
US10657442B2 (en) * 2018-04-19 2020-05-19 International Business Machines Corporation Deep learning accelerator architecture with chunking GEMM
CN108629408A (en) * 2018-04-28 2018-10-09 济南浪潮高新科技投资发展有限公司 A kind of deep learning dynamic model based on FPGA cuts out inference system and method
US11875251B2 (en) * 2018-05-03 2024-01-16 Samsung Electronics Co., Ltd. Neural network method and apparatus
CN108665059A (en) * 2018-05-22 2018-10-16 中国科学技术大学苏州研究院 Convolutional neural networks acceleration system based on field programmable gate array
CN108763159A (en) * 2018-05-22 2018-11-06 中国科学技术大学苏州研究院 To arithmetic accelerator before a kind of LSTM based on FPGA
TWI672643B (en) * 2018-05-23 2019-09-21 倍加科技股份有限公司 Full index operation method for deep neural networks, computer devices, and computer readable recording media
CN110633226A (en) * 2018-06-22 2019-12-31 武汉海康存储技术有限公司 Fusion memory, storage system and deep learning calculation method
CN108920413B (en) * 2018-06-28 2019-08-09 中国人民解放军国防科技大学 Convolutional neural network multi-core parallel computing method facing GPDSP
CN108805277A (en) * 2018-06-29 2018-11-13 中国科学技术大学苏州研究院 Depth belief network based on more FPGA accelerates platform and its design method
CN110738316B (en) * 2018-07-20 2024-05-14 北京三星通信技术研究有限公司 Operation method and device based on neural network and electronic equipment
CN110826707B (en) * 2018-08-10 2023-10-31 北京百度网讯科技有限公司 Acceleration method and hardware accelerator applied to convolutional neural network
CN109359732B (en) * 2018-09-30 2020-06-09 阿里巴巴集团控股有限公司 Chip and data processing method based on chip
CN109344109B (en) * 2018-10-23 2022-07-26 江苏华存电子科技有限公司 System and method for accelerating artificial intelligence calculation in big data based on solid state disk
CN111090503B (en) * 2018-10-24 2023-07-21 上海雪湖信息科技有限公司 High-cost-performance cloud computing service system based on FPGA chip
CN109376332A (en) * 2018-10-30 2019-02-22 南京大学 A kind of arbitrary order Kalman filtering system
TWI696961B (en) 2018-12-12 2020-06-21 財團法人工業技術研究院 Deep neural networks (dnn) hardware accelerator and operation method thereof
CN109523019B (en) * 2018-12-29 2024-05-21 百度在线网络技术(北京)有限公司 Accelerator, accelerating system based on FPGA, control method and CNN network system
CN109740748B (en) * 2019-01-08 2021-01-08 西安邮电大学 Convolutional neural network accelerator based on FPGA
CN109933370B (en) * 2019-02-01 2021-10-15 京微齐力(北京)科技有限公司 System chip for connecting FPGA and artificial intelligence module
CN109816108A (en) * 2019-02-15 2019-05-28 领目科技(上海)有限公司 Deep learning accelerator, device and method
CN110032374B (en) * 2019-03-21 2023-04-07 深兰科技(上海)有限公司 Parameter extraction method, device, equipment and medium
CN110084363B (en) * 2019-05-15 2023-04-25 电科瑞达(成都)科技有限公司 Deep learning model acceleration method based on FPGA platform
CN110135572B (en) * 2019-05-17 2023-05-26 南京航空航天大学 SOC-based trainable flexible CNN system design method
CN112036557B (en) * 2019-06-04 2023-06-27 北京邮电大学 Deep learning system based on multiple FPGA development boards
CN110399979B (en) * 2019-06-17 2022-05-13 深圳大学 Click rate pre-estimation system and method based on field programmable gate array
CN112149047A (en) * 2019-06-27 2020-12-29 深圳市中兴微电子技术有限公司 Data processing method and device, storage medium and electronic device
CN110647983B (en) * 2019-09-30 2023-03-24 南京大学 Self-supervision learning acceleration system and method based on storage and calculation integrated device array
CN110928605B (en) * 2019-11-14 2023-05-02 天津大学 Beam adjustment method hardware accelerator based on Zynq FPGA
CN111176962B (en) * 2019-12-02 2021-09-10 深圳先进技术研究院 FPGA platform, performance evaluation and design optimization method thereof and storage medium
CN111061513B (en) * 2019-12-20 2022-02-01 支付宝(杭州)信息技术有限公司 Method for accelerating modeling of computing device, electronic device and readable storage medium
CN111884952B (en) * 2020-07-06 2021-05-25 华东师范大学 Multichannel calculation accelerating equipment based on FPGA
CN113485762A (en) * 2020-09-19 2021-10-08 广东高云半导体科技股份有限公司 Method and apparatus for offloading computational tasks with configurable devices to improve system performance
CN112433981A (en) * 2020-11-22 2021-03-02 中国人民解放军战略支援部队信息工程大学 Miniaturized software radio platform for high-speed intelligent signal processing
CN113673690B (en) * 2021-07-20 2024-05-28 天津津航计算技术研究所 Underwater noise classification convolutional neural network accelerator
CN115658323A (en) * 2022-11-15 2023-01-31 国网上海能源互联网研究院有限公司 FPGA load flow calculation acceleration architecture and method based on software and hardware cooperation
CN116630709B (en) * 2023-05-25 2024-01-09 中国科学院空天信息创新研究院 Hyperspectral image classification device and method capable of configuring mixed convolutional neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104112053A (en) * 2014-07-29 2014-10-22 中国航天科工集团第三研究院第八三五七研究所 Design method of reconfigurable architecture platform oriented image processing
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN105162475A (en) * 2015-08-19 2015-12-16 中国人民解放军海军工程大学 FPGA (Field Programmable Gate Array) based parameterized multi-standard decoder with high throughput rate
CN105447285A (en) * 2016-01-20 2016-03-30 杭州菲数科技有限公司 Method for improving OpenCL hardware execution efficiency

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4366652B2 (en) * 2004-04-23 2009-11-18 横河電機株式会社 Transmitter and duplexing method thereof
US20140289445A1 (en) * 2013-03-22 2014-09-25 Antony Savich Hardware accelerator system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104112053A (en) * 2014-07-29 2014-10-22 中国航天科工集团第三研究院第八三五七研究所 Design method of reconfigurable architecture platform oriented image processing
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN105162475A (en) * 2015-08-19 2015-12-16 中国人民解放军海军工程大学 FPGA (Field Programmable Gate Array) based parameterized multi-standard decoder with high throughput rate
CN105447285A (en) * 2016-01-20 2016-03-30 杭州菲数科技有限公司 Method for improving OpenCL hardware execution efficiency

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Deep Learning prediction process accelerator based FPGA;Qi Yu等;《IEEE》;20151231;第1159-1162页,第Ⅲ部分-第Ⅴ部分
DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning;Tianshi Chen等;《ACM》;20140305;第269-283页,摘要、第2-3部分

Also Published As

Publication number Publication date
CN106228238A (en) 2016-12-14

Similar Documents

Publication Publication Date Title
CN106228238B (en) Accelerate the method and system of deep learning algorithm on field programmable gate array platform
JP7329533B2 (en) Method and accelerator apparatus for accelerating operations
KR102175044B1 (en) Apparatus and method for running artificial neural network reverse training
JP7358382B2 (en) Accelerators and systems for accelerating calculations
US10902315B2 (en) Device for implementing artificial neural network with separate computation units
US10282659B2 (en) Device for implementing artificial neural network with multiple instruction units
KR101959376B1 (en) Systems and methods for a multi-core optimized recurrent neural network
EP3298547B1 (en) Batch processing in a neural network processor
US20190065958A1 (en) Apparatus and Methods for Training in Fully Connected Layers of Convolutional Networks
KR102203746B1 (en) Apparatus and method for executing forward computation of artificial neural network
JP7078758B2 (en) Improving machine learning models to improve locality
Kästner et al. Hardware/software codesign for convolutional neural networks exploiting dynamic partial reconfiguration on PYNQ
AU2016203619A1 (en) Layer-based operations scheduling to optimise memory for CNN applications
CN112840356A (en) Operation accelerator, processing method and related equipment
CN103870335B (en) System and method for efficient resource management of signal flow programmed digital signal processor code
Stevens et al. Manna: An accelerator for memory-augmented neural networks
CN110414672B (en) Convolution operation method, device and system
CN110377874B (en) Convolution operation method and system
CN113655986B9 (en) FFT convolution algorithm parallel implementation method and system based on NUMA affinity
CN111506520B (en) Address generation method, related device and storage medium
Diamantopoulos et al. A system-level transprecision FPGA accelerator for BLSTM using on-chip memory reshaping
Abdelrazek et al. A novel architecture using NVIDIA CUDA to speed up simulation of multi-path fast fading channels
CN114298329A (en) Model training method, device, equipment and storage medium
Que Reconfigurable acceleration of recurrent neural networks
JP2023006509A (en) Software generation device and software generation method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant