CN106228238A  The method and system of degree of depth learning algorithm is accelerated on field programmable gate array platform  Google Patents
The method and system of degree of depth learning algorithm is accelerated on field programmable gate array platform Download PDFInfo
 Publication number
 CN106228238A CN106228238A CN201610596159.3A CN201610596159A CN106228238A CN 106228238 A CN106228238 A CN 106228238A CN 201610596159 A CN201610596159 A CN 201610596159A CN 106228238 A CN106228238 A CN 106228238A
 Authority
 CN
 China
 Prior art keywords
 data
 programmable gate
 gate array
 hardware
 calculation
 Prior art date
Links
 238000004364 calculation methods Methods 0.000 claims abstract description 50
 238000000034 methods Methods 0.000 claims abstract description 31
 230000001537 neural Effects 0.000 claims abstract description 22
 239000011159 matrix materials Substances 0.000 claims description 46
 230000005284 excitation Effects 0.000 claims description 31
 239000010410 layers Substances 0.000 claims description 24
 230000001186 cumulative Effects 0.000 claims description 21
 210000002364 input neuron Anatomy 0.000 claims description 15
 281000042839 Neuron Data companies 0.000 claims description 6
 230000000875 corresponding Effects 0.000 claims description 5
 238000007711 solidification Methods 0.000 claims description 5
 238000002360 preparation methods Methods 0.000 claims description 2
 235000013399 edible fruits Nutrition 0.000 claims 1
 238000010586 diagrams Methods 0.000 description 11
 238000007667 floating Methods 0.000 description 11
 230000005540 biological transmission Effects 0.000 description 10
 210000004205 output neuron Anatomy 0.000 description 9
 210000002569 neurons Anatomy 0.000 description 8
 230000001133 acceleration Effects 0.000 description 5
 238000009825 accumulation Methods 0.000 description 5
 239000004793 Polystyrene Substances 0.000 description 4
 230000006399 behavior Effects 0.000 description 4
 230000001276 controlling effects Effects 0.000 description 3
 244000045947 parasites Species 0.000 description 3
 238000003860 storage Methods 0.000 description 3
 280000706427 Aliquant companies 0.000 description 2
 241001269238 Data Species 0.000 description 2
 238000006243 chemical reactions Methods 0.000 description 2
 235000019800 disodium phosphate Nutrition 0.000 description 2
 238000005516 engineering processes Methods 0.000 description 2
 239000000686 essences Substances 0.000 description 2
 239000000203 mixtures Substances 0.000 description 2
 241000208340 Araliaceae Species 0.000 description 1
 280000651184 Binary Tree companies 0.000 description 1
 210000001503 Joints Anatomy 0.000 description 1
 281000122258 Mentor Graphics companies 0.000 description 1
 235000003140 Panax quinquefolius Nutrition 0.000 description 1
 280000180512 Peak Performance companies 0.000 description 1
 281000022870 Tsinghua Tongfang Company companies 0.000 description 1
 230000001149 cognitive Effects 0.000 description 1
 230000000694 effects Effects 0.000 description 1
 238000005538 encapsulation Methods 0.000 description 1
 238000005265 energy consumption Methods 0.000 description 1
 238000006062 fragmentation reactions Methods 0.000 description 1
 238000009432 framing Methods 0.000 description 1
 235000005035 ginseng Nutrition 0.000 description 1
 235000008434 ginseng Nutrition 0.000 description 1
 238000003475 lamination Methods 0.000 description 1
 238000010801 machine learning Methods 0.000 description 1
 238000004519 manufacturing process Methods 0.000 description 1
 230000004048 modification Effects 0.000 description 1
 238000006011 modification reactions Methods 0.000 description 1
 230000004899 motility Effects 0.000 description 1
 238000005457 optimization Methods 0.000 description 1
 230000001131 transforming Effects 0.000 description 1
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
 G06N3/00—Computer systems based on biological models
 G06N3/02—Computer systems based on biological models using neural network models
 G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
 G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
 G06N3/00—Computer systems based on biological models
 G06N3/02—Computer systems based on biological models using neural network models
 G06N3/08—Learning methods
Abstract
Description
Technical field
The present invention relates to computer hardware and accelerate field, accelerate more particularly on a kind of field programmable gate array platform The method and system of degree of depth learning algorithm.
Background technology
Degree of depth study has significant achievement on solution highlevel abstractions cognitive question, has made in machine learning a new platform Rank.It not only has the highest scientific research value, and has the strongest practicality, causes academia and industrial quarters the most very Favor.But, in order to solve more abstract, more complicated problem concerning study, the network size of degree of depth study is being continuously increased, meter Calculating and the complexity of data increases severely the most therewith, such as Google Cat grid has about 1,000,000,000 neurons.Highperformance is low Energy consumption ground accelerates degree of depth study related algorithm and becomes the study hotspot of scientific research and commercial undertaking.
The task that generally calculates divides two kinds from manifestation mode: on aageneralpurposeaprocessor, task is generally with the shape of software code Formula presents, referred to as software task；On special hardware circuit, give full play to the intrinsic rapid charater of hardware to replace software to appoint Business, referred to as hardware task.Common hardwareaccelerated technology has applicationspecific integrated circuit ASIC (Application Specific Integrated Circuit), field programmable gate array FPGA (Field Programmable Gate Array) and Graphic process unit GPU (Graphics Processing Unit).ASIC is the ic core designed and developed for specialpurpose Sheet, it has the features such as highperformance, lowpower consumption, area be little.Be commonly angled relative to FPGA, ASIC run faster, power consumption lower, and Quantify during production the most less expensive.Although for same given function, the transistor that FPGA is used is more than ASIC, but FPGA Simplify logic task design, the design cycle than ASIC short a lot.Additionally, the mask cost producing ASIC is the highest, along with line Wide reduction, mask cost exponentially increases.FPGA, as being suitable for the normal component able to programme of difference in functionality, does not has such great number R&D costs, and there is certain motility.GPU is applicable to the parallel computation of mass data, has high bandwidth, high master Frequently, high concurrency feature, and CUDA (Compute Unified Device Architecture) universal parallel Computational frame Proposition, make that developer is more convenient, design high performance solution quickly.But the power consumption of GPU is higher, the merit of single GPU Consumption is often higher than the CPU power consumption of contemporary mainstream, is commonly angled relative to FPGA and wants many tens times even energy expenditure of hundreds of times.
Summary of the invention
In view of this, the present invention seeks to: provide and on a kind of field programmable gate array platform, accelerate degree of depth study calculation The method and system of method, it is possible to quickly design the hardware processing element accelerated for degree of depth learning algorithm according to hardware resource, Processing unit has highperformance, lowpower consumption feature relative to general processor.
The technical scheme is that
A kind of method accelerating degree of depth learning algorithm on field programmable gate array platform, it is characterised in that scene can be compiled Journey gate array platform includes general processor, field programmable gate array and memory module, comprises the following steps:
S01: according to degree of depth study prediction process and training process, and combine deep neural network and convolutional neural networks, Determine the generalpurpose computations part being applicable on field programmable gate array platform run；
S02: according to the generalpurpose computations part confirmed, determine softwarehardware synergism calculation；
S03: according to calculating logical resource, the bandwidth situation of FPGA, determine the value volume and range of product that IP kernel solidifies, utilize hardware Arithmetic element, is accelerated on programmable gate array platform at the scene.
In optimal technical scheme, described generalpurpose computations part includes forward calculation module, calculates for matrix multiplication and swashs Encourage function to calculate；Right value update module, calculates for vector.
In optimal technical scheme, described step S02 comprises the following steps:
Data Preparation is carried out at software end；
Convolutional layer convolutional calculation in convolutional neural networks is converted into matrix multiplication；
Direct internal memory is used to read the data path calculated as softwarehardware synergism.
In optimal technical scheme, described step S03 determines the value volume and range of product that IP kernel solidifies, including: according to pending Hardware task, determine on FPGA the kind of the arithmetic element of solidification；According to FPGA hardware logical resource and bandwidth situation, determine The quantity of the processing unit of pending hardware task.
In optimal technical scheme, described forward calculation module uses burst design, by internal for every for node matrix equation a line by dividing Sheet size carries out burst, and the every string of weighting parameter matrix carries out burst according to burst size, by being about to every burst of node matrix equation Burst size the numerical value that size data string every with weighting parameter matrix is corresponding carries out dotproduct operation, and every a line calculates complete After obtain final result by cumulative for nonce.
In optimal technical scheme, described burst size is the n power of 2, keeps consistent with the parallel granularity of arithmetic element.
The present invention discloses again a kind of FPGA structure for accelerating degree of depth learning algorithm, it is characterised in that including:
Burst processes structure, and node data matrix and the weighting parameter matrix of forward calculation module are carried out burst, timesharing Multiplexing hardware logic；
Excitation function linear approximation realizes structure, is used for generating arbitrary excitation function；
Parameter configuration module, for configuring the parameter of processing unit；
Forward calculation module, caches, including single DMA, the forward direction that the forward calculation hardware configuration of weights reads parallel with double DMA Computing hardware structure；Forward calculation, convolutional neural networks convolutional layer and the forward calculation of classification layer for deep neural network And matrix multiplication operation, and carry out streamline and be optimized to maximum throughput rate；
Right value update module, calculates for vector.
In optimal technical scheme, processing unit is carried out by described parameter configuration module by DMA transfer configuration parameter data Configuration, including: the mode of operation configuration of forward calculation module and data scale configure, and data scale configuration includes that node data is advised Mould configuration, the configuration of input neuron scale and the configuration of output neuron scale；The configuration of right value update module data scale, Working mould Formula configuration and calculating parameter configuration.
In optimal technical scheme, the forward calculation hardware configuration of described single DMA caching weights includes:
Single DMA, is responsible for digital independent, writes back；
Pair register relief area, alternately reads data or carries out parallel computation；BRAM group, caches and ensures that data parallel is read Take；
With the equalsized floatingpoint multiplier of burst；
Ybend add tree with burst input equal in magnitude；
Cycle accumulor device, cumulative nonce preserves to BRAM on sheet；
Excitation function computing module, uses piece wire approximation to realize excitation function, and design factor is buffered in BRAM on sheet；
The forward calculation hardware configuration that the described couple of DMA reads parallel includes:
Neuron data read module, is furnished with DMA and FIFO buffer area, is responsible for reading input neuron node data；
Weighting parameter data read module, is furnished with DMA and FIFO buffer area, is responsible for reading weighting parameter data；
With the equalsized floatingpoint multiplier of burst；
Ybend add tree with burst input equal in magnitude；
Cycle accumulor device, cumulative nonce preserves to BRAM on sheet；
Excitation function computing module, uses piece wire approximation to realize excitation function, and design factor is buffered in BRAM on sheet.
In optimal technical scheme, described right value update module, calculate for right value update and the calculating of output layer error amount, And carry out streamline and be optimized to maximum throughput rate, including: vector A data read module and vector B data read module, respectively It is furnished with DMA and fifo buffer, reads two groups of vector value for calculating respectively；Computing module, it is right to be carried out by configuration information The vector calculating answered；Result writes back module, is furnished with DMA and fifo buffer, and result of calculation is written back to host's end memory.
Compared with prior art, the invention have the advantage that
The present invention can effectively accelerate degree of depth learning algorithm, including study prediction process and training process, it is possible to according to Hardware resource quickly designs the hardware processing element accelerated for degree of depth learning algorithm, and processing unit is relative to general processor There are highperformance, lowpower consumption feature.
Accompanying drawing explanation
Below in conjunction with the accompanying drawings and embodiment the invention will be further described:
Fig. 1 is the flow process accelerating degree of depth learning method on the field programmable gate array platform of the embodiment of the present invention Figure；
Fig. 2 is the calculating schematic diagram of convolutional layer in convolutional neural networks；
Fig. 3 is that the forward calculation hardware processing element on the field programmable gate array platform of the embodiment of the present invention turns The schematic diagram that change of lap lamination calculates；
Fig. 4 is that the right value update processing unit on the field programmable gate array platform of the embodiment of the present invention is by data Matrix conversion becomes the schematic diagram of vector；
Fig. 5 is the structural representation that on the field programmable gate array platform of the embodiment of the present invention, softwarehardware synergism calculates Figure；
Fig. 6 is that the hardware processing element resource of the embodiment of the present invention uses and field programmable gate array platform resource And the schematic diagram of applicable cases solidification value volume and range of product；
Fig. 7 is the schematic diagram of the forward calculation processing unit data fragmentation process of the embodiment of the present invention；
Fig. 8 is the schematic diagram that the piece wire approximation of the embodiment of the present invention realizes excitation function；
Fig. 9 is that the heterogeneous multicore reconfigurable of the embodiment of the present invention calculates single DMA on platform and prestores the forward direction meter of weight matrix Calculate the structural representation of hardware processing element；
Figure 10 is that the heterogeneous multicore reconfigurable of the embodiment of the present invention calculates on platform tired in forward direction computing hardware processing unit Add the structural representation of process；
Figure 11 is to divide in forward direction computing hardware processing unit on the heterogeneous multicore reconfigurable calculating platform of the embodiment of the present invention The structural representation of section approximation sigmoid function；
Figure 12 is that the heterogeneous multicore reconfigurable of the embodiment of the present invention calculates single DMA on platform and prestores the forward direction meter of weight matrix Calculate the flow chart of data processing figure of hardware processing element；
Figure 13 is that the heterogeneous multicore reconfigurable of the embodiment of the present invention calculates the forward direction meter of double DMA parallel read data on platform Calculate the structural representation of hardware processing element；
Figure 14 is that the heterogeneous multicore reconfigurable of the embodiment of the present invention calculates the forward direction meter of double DMA parallel read data on platform Calculate the flow chart of data processing figure of hardware processing element；
Figure 15 is that the heterogeneous multicore reconfigurable of the embodiment of the present invention calculates the knot of right value update hardware processing element on platform Structure schematic diagram；
Figure 16 is that the heterogeneous multicore reconfigurable of the embodiment of the present invention calculates the number of right value update hardware processing element on platform According to process chart；
Figure 17 is possible that on the heterogeneous multicore reconfigurable calculating platform of the embodiment of the present invention, the degree of depth learns accelerator Application scenarios and block schematic illustration.
Detailed description of the invention
Below in conjunction with specific embodiment, such scheme is described further.Should be understood that these embodiments are for illustrating The present invention and be not limited to limit the scope of the present invention.The implementation condition used in embodiment can be done according to the condition of concrete producer Adjusting further, not marked implementation condition is usually the condition in normal experiment.
Embodiment:
Field programmable gate array platform in the embodiment of the present invention refers to integrated universal processor (General simultaneously Purpose Processor, referred to as " GPP "), and field programmable gate array (Field Programmable Gate Arrays, referred to as " FPGA ") the calculating system of chip, wherein, the data path between FPGA and GPP can use PCIE Bus protocol, AXI bus protocol etc..Embodiment of the present invention accompanying drawing data path illustrates as a example by using AXI bus protocol, but this Bright it is not limited to this.
Fig. 1 is the stream of the method 100 of the field programmable gate array platform acceleration degree of depth learning algorithm of the embodiment of the present invention Cheng Tu.The method 100 includes:
S110, according to degree of depth study prediction process and training process, wherein training process comprise local pretraining process and Overall situation training process, and combine deep neural network and convolutional neural networks, determine and be applicable to field programmable gate array platform The generalpurpose computations part of upper operation；
S120, according to the common hardware computing module confirmed, determines softwarehardware synergism calculation；
S130, according to calculating logical resource, bandwidth situation on field programmable gate array, determine quantity that IP kernel solidify with Kind.
Below in conjunction with Fig. 2 to Fig. 4, the method that the embodiment of the present invention is accelerated degree of depth study generalpurpose computations part is carried out Describe in detail.
Fig. 2 is the schematic diagram that convolutional layer calculates, it is assumed that input feature vector figure number is 4, and convolution kernel size is 3x3, then by 4 After the cumulative summation of convolution checkout result, process the value that i.e. can get output characteristic figure through excitation function.From calculating overall structure On see, the basic calculating mode of convolutional layer and deep neural network hidden layer calculate similar, as long as by adjustment convolution kernel parameter sequence Convolutional calculation used herein just can be changing into dot product and calculate by row.Concrete adjustment mode is: 1), by input feature vector figure from up to Under, by row be sequentially filled to a line, as shown in Fig. 3 left line；2) after convolution matrix core being revolved turnback counterclockwise, from up to Under, sequentially write in the middle of the string of weight matrix, Fig. 3 shown in that string by row, original convolution kernel a is the most inverse to convolution kernel d After hour hands rotation turnback, become a9～a1, b9～b1 ... d9～d1, filling in proper order to string.So, for convolution Layer prediction process, its basic calculating is convertible into the mode identical with deep neural network hidden layer, i.e. matrix multiplication calculating adds Excitation function processes, but needs to pay more the cost of data conversion.
During degree of depth learning training, also need to substantial amounts of vector calculating except the substantial amounts of matrix multiplication of needs calculates, Need matrix data is converted into vector data when carrying out vector calculating, as shown in Figure 4, every for data a line is formed one in proper order Individual vector carries out vector calculating.
Therefore, in conjunction with Fig. 2 to Fig. 4, the degree of depth is learnt prediction process and the generalpurpose computations portion of training process by present example Divide and be attributed to matrix multiplication calculating, excitation function calculates and substantial amounts of vector calculates.
Fig. 5 is structural framing Figure 200 that the softwarehardware synergism that present example uses calculates.This structure includes:
Processing System (being called for short PS) 210, as the control end of whole system, comprises CPU and Memory.CPU As host's end, run software end code, and acceleration task offload to PL end is operated.Additionally, CPU is as controlled The duty sum of each IP kernel of PL end processed (intellectual property core represents each hardware computation unit here) According to reading etc.；
FPGA Programming Logic (being called for short PL) 220, for the hardwareaccelerated parts FPGA core of whole system Sheet.IP kernel can be solidified on fpga chip according to difference acceleration task and realize the acceleration to algorithm.System by PS end according to Specific algorithm scheduling selects different IP Core to carry out parallel computation, it is also possible to by host's end software task and FPGA end hardware Task carries out parallel computation；
Data/address bus (Data Bus) 230, is responsible for whole system PS end and the transmission of PL end data；
Control signal bus (Control Bus) 240, is responsible for whole system PS end and the transmission of PL end control signal.
Fig. 6 is accelerator population structure 2000 based on FPGA design, and structure includes:
System controller 2100, is responsible for controlling the execution state of each hardware computation unit, data transmission and program scheduler. And it is responsible for running at the beginning of the calculating section of degree of depth study nonuniversal, data initialization and hardware computation unit (or referred to as IP kernel) Beginning task；
Internal memory 2200, is responsible for storage depth learning network parameter and original input data, requires what data stored here Physical address is continuous print, facilitates DMA to carry out data transmission；
Data bus protocol 2300, AXIStream agreement allows unconfined data burst transmission, for highperformance data Hosthost protocol；
Controlling bus protocol 2400, AXILite is that the address of a kind of lightweight maps single transmission agreement, it is adaptable to hard The control signal transmission of part arithmetic element；
Data interconnection 2500, data path interconnects；
Controlling interconnection 2600, control signal lines interconnects；
Direct memory access DMA2700, the data transmission being responsible between accelerator and internal memory, each hardware processing element is all joined A standby DMA carrys out parallel read data；
PE (Proccesing Element) 2800 is as the computing unit of each accelerator, curable 1 forward direction in inside Calculate arithmetic element or 1 right value update arithmetic element or both of which comprises.Owing to FPGA is programmable and can weigh Structure, the quantity of PE dynamically can configure according to the resource bandwidth situation of concrete fpga chip here, is not so changing computing list The calculating resource of hardware can be made full use of, it is ensured that hardware plays peak performance under unit's hardware designs.
Above in conjunction with Fig. 1 to Fig. 6, describe the method that the embodiment of the present invention accelerates degree of depth learning algorithm in detail, below The hardware configuration of the embodiment of the present invention will be introduced.
Fig. 7 is for using burst Computation schema design forward calculation arithmetic element, it is assumed that the size of burst is 16, by node square Carrying out burst by 16 inside the every a line of battle array, weighting parameter matrix carries out burst according to 16 elements of every string.By being about to node square 16 numerical value that every 16 data string every with weighting parameter matrix of battle array is corresponding carry out dotproduct operation, treat that every a line calculates complete After i.e. can get final result by cumulative for these nonces again.This kind of method not only takes full advantage of data locality, and subtracts Lack the resource situation needed for solidification parallel execution unit, and reduced hardware desired data bandwidth, allowed the single arithmetic element can To realize the matrix multiplication calculating of random scale.
In order to keep highthroughput, the size of burst should match with arithmetic element indoor design, keeps with parallel granularity Unanimously, when matrix multiplication operation, burst can be set as the n power of 2, give full play to the cumulative performance of binary tree.By Relevant with parallel granularity in burst size, in theory for burst the biggest, degree of parallelism is the highest, and the performance of arithmetic element also can be got over Good, so in the case of hardware resource and bandwidth allow, selecting maximum 2^{n}Burst size as arithmetic element.
Fig. 8 is, in present example, excitation function is carried out hardwired schematic diagram.Present example uses segmented line Property approximation realize S type excitation function, function is divided into some equivalent intervals by Xaxis, in each interval, presses Y=a_{i}*X+b_{i},X ∈[x_{i},x_{i+1}Linear approximation, wherein x is carried out shown in)_{i+1}x_{i}Gap size for approximation.When needs calculate excitation function, It is first according to X value find the interval at its place and calculate a of its correspondence_{i}And b_{i}Relative to the sideplay amount of base address, carry out multiplyadd fortune After calculation, can approximate and obtain Y value.This implementation has two benefits: 1), can realize arbitrary S type excitation function or linear Function, and without changing any hardware designs, it is only necessary to change the numerical value that coefficient a and coefficient b is stored；2), error Minimum, when approximate interval reduces, error can reach to ignore, and cost is only to increase for packing coefficient a and be The BRAM of number b.And degree of depth study calculating itself is not the highest the most a certain degree of to the requirement of the degree of accuracy of data Loss of significance has no effect on data result.
Fig. 9 is that on the field programmable gate array platform of the embodiment of the present invention, single DMA prestores the hardware configuration of weight matrix Schematic block diagram 3000, this hardware configuration for FPGA inside BRAM resource more sufficient time, in advance caching weight matrix number Forward calculation is carried out according to BRAM on sheet.Structure includes:
Data read module 3100, is furnished with DMA and FIFO buffer area, and data bit width is 32, is responsible for reading weighting parameter It is buffered on sheet on BRAM and reads neuron node data.
BRAM3200 on sheet, caches weighting parameter data.As a example by burst size is 16, by weight matrix by row with 16 it is Circulation is stored on different BRAM, i.e. i%16, thus ensures carrying out 16 also as addressing system plus the base address of BRAM From different BRAM parallel read data during row multiplication.
Pair register caching 3300, the most each depositor comprises 16 depositors for storing input neuron data, By for carrying out data cached and carrying out parallel computation.But it is noted herein that: buffer area is filled up the required time Time needed for calculating less than these data, the time that such guarantee buffer data reads is calculated required time institute Cover, and guarantee the correctness of result.
Weighting parameter data and neuron number evidence are carried out parallel multiplication calculating, Floatingpoint Computation by parallel floating point multiplication 3400 Use DSP to realize, after streamline optimizes, can 16 floatingpoint multiplications of each clock cycle parallel processing operate, burst size here As a example by 16.Owing to input neuron number might not be divided exactly by 16, so when every data burst carries out dot product calculating, The possible number inadequate 16 of last burst, then arithmetic element will carry out parallel multiplication calculating with the part of 0 lack of fill 16.
Ybend floating add tree 3500, carries out cumulative behaviour by the floating point result obtained in parallel floating point multiplication 3400 structure Make, use ybend add tree to carry out parallel computation, eliminate the readwrite dependency of accumulation operations, by cumulative required time complexity From the near O of O (n) (logn).
Accumulation calculating 3600, calculates owing to forward calculation processing unit uses burst to process, needs ybend floating add The result drawn after tree 3500 calculating adds up, but cumulative mode is to be circulated cumulative behaviour every output neuron number Make.
Excitation function calculates 3700, uses piece wire approximation to realize excitation function, and design factor is buffered in BRAM on sheet.
Data write back module 3800, are furnished with DMA and FIFO buffer area, and data bit width is 32, are responsible for data result of calculation It is written back to host's end memory.
This hardware configuration supports parameter configuration, can support the neural computing of different scales.Detailed configuration parameter is:
Data_size: the scale of input neuron data；
The number of Input_size: input neuron, owing to caching weight matrix data in advance, therefore should be less than sheet here Upper BRAM can allow to cache maximum input neuron number Max_input that weighting parameter is corresponding；
The number of Output_size: output neuron, owing to caching weight matrix data in advance, therefore should be less than here On sheet, BRAM can allow to cache maximum output neuron number Max_output that weighting parameter is corresponding；
Work_mode:0 represents and only carries out matrix multiplication calculating；1 expression carries out matrix multiplication and excitation function calculates.
Figure 10 is the hardware configuration signal carrying out accumulation calculating on the field programmable gate array platform of the embodiment of the present invention Figure 36 00.Structure includes:
Floating add calculates 3610, owing to using burst thought, needs intermediate value calculated to dot product to add up. Intermediate value data stream is that number N (or the latter's matrix column number) every output neuron adds up, the most suitable after adding up Sequence exports.
Nonce storage BRAM3620, arranges N number of memory element for storing ephemeral data inside FPGA, and circulation is by number It is added in the BRAM memory element of correspondence according to flow data, judges whether according to the relation of input neuron number and burst size Cumulative end.Owing to the quantity for storing nonce cannot be set during FPGA indoor design dynamically, so in design luck Calculate unit and set the maximum cumulative number MAX of support.When the number of output neuron just can be normally carried out adding up grasping less than MAX value Make.
Equally this process is also carried out streamline optimization, and startup interval is optimized to 1 clock cycle, ensure centre Value produces and keeps consistent with the speed processed.
Figure 11 shows that carrying out piece wire approximation on the field programmable gate array platform of the embodiment of the present invention realizes swashing Encourage the hardware architecture diagram 3700 of function.
Excitation function uses sublevel linear approximation to realize, it is achieved details as shown in figure 11, unlike Fig. 8, adds Article one, X is transmitted directly to the path of Y, allows forward calculation arithmetic element can only perform matrix multiplication operation and without excitation The process of function, here mainly for carrying out the matrix multiplication used when error amount calculates during realizing training.Due to S type Excitation function is substantially about certain point symmetry, and as a example by sigmoid function, sigmoid function is symmetrical about (0,0.5), institute When x is less than 0, to calculate according to 1f (x), so the use to hardware resource can be reduced with multiplexing hardware logic.And And when x is equal to 8, f (x) is equal to 0.999665, the most just it is infinitely close to 1, therefore when x is more than 8, directly result is entered as 1。
Figure 12 is that on the field programmable gate array platform of the embodiment of the present invention, single DMA prestores the forward calculation of weighting parameter The calculation flow chart of hardware computation unit.
First it is successively read configuration data from DMA, reads node data according to configuration information.First will when reading node data Parasites Fauna a be full of after, flag is set to 0, afterwards according to the numerical value of flag%2 replace input node data value register group a or Parasites Fauna b.Equally, the weights number of the data and BRAM caching that read Parasites Fauna according to the numerical value of flag%2 is carried out parallel Multiplication calculates, and adds up after being then passed through the summation of ybend add tree.After Lei Jia, select through overdriving according to mode of operation Function processes or directly output.
Figure 13 is the forward calculation hardware that on the field programmable gate array platform of the embodiment of the present invention, double DMA read parallel The structural representation 4000 of arithmetic element.This hardware configuration carries out forward calculation module design for the fpga chip of high bandwidth, Double DMA is used to read guarantee highthroughput parallel.Here burst size is as a example by 16, and structure includes:
Neuron data read module 4100, is furnished with DMA and FIFO buffer area, and data bit width is 512, is responsible for reading defeated Enter neuron node data, obtain 16 32 singleprecision floatingpoint datas by shifting function.Owing to the transmission bit wide of data is 512, it requires that data want address align in host's end memory.Further for input neuron number aliquant 16 Situation, needs, at host's end, neuron node data matrix is filled 0 operation, and the end of every a line is filled 16Input_ Size%16 0, wherein Input_size is the number of input neuron, without filling when Input_size%16 is equal to 0.This In to each datareusing Output_size time, wherein Output_size is output neuron number.
Weighting parameter data read module 4200, is furnished with DMA and FIFO buffer area, and data bit width is 512, is responsible for reading Weighting parameter data, obtain 16 32 singleprecision floatingpoint datas by shifting function.Also due to the transmission bit wide of data is 512, it requires that data want address align in host's end memory.Further for input neuron number aliquant 16 Situation, needs, at host's end, weighting parameter data matrix is filled 0 operation, fills 16Input_ at the end of every string Size%16 0, without filling when same Input_size%16 is equal to 0.After filling, owing to DMA transfer needs continuously Physical address, need to be adjusted facilitating DMA transfer by the data storage location of weighting parameter matrix.
Weighting parameter data and neuron number evidence are carried out parallel multiplication calculating, Floatingpoint Computation by parallel floating point multiplication 4300 DSP is used to realize, after streamline optimizes, can 16 floatingpoint multiplication operations of each clock cycle parallel processing.
Ybend floating add tree 4400, carries out cumulative behaviour by the floating point result obtained in parallel floating point multiplication 4300 structure Make, use ybend add tree to carry out parallel computation, eliminate the readwrite dependency of accumulation operations, by cumulative required time complexity From the near O of O (n) (logn).
Accumulation calculating 4500, calculates owing to forward calculation processing unit uses burst to process, needs ybend floating add The result drawn after tree 4400 calculating adds up, but cumulative mode is to be circulated cumulative behaviour every output neuron number Make.This structure is identical with structure 3600, therefore is not described in further detail.
Excitation function calculates 4600, uses piece wire approximation to realize excitation function, and design factor is buffered in BRAM on sheet. This structure is identical with structure 3700, therefore is not described in further detail.
Data write back module 4700, are furnished with DMA and FIFO buffer area, and data bit width is 32, are responsible for data result of calculation It is written back to host's end memory.
This hardware configuration supports parameter configuration, can support the neural computing of different scales.Detailed configuration parameter is:
Data_size: the scale of input neuron data；
The number of Input_size: input neuron；
The number of Output_size: output neuron；
Work_mode:0 represents and only carries out matrix multiplication calculating；1 expression carries out matrix multiplication and excitation function calculates.
Figure 14 is the forward calculation hardware that on the field programmable gate array platform of the embodiment of the present invention, double DMA read parallel The calculation flow chart of arithmetic element.
First read configuration information from node DMA, configuration arithmetic element read the scale of node data and weight data with And mode of operation.Then, reading in 512 bit data from node DMA and weights DMA respectively, parallel shift obtains 16 neuron joints Point data and 16 weighting parameter data, due to accelerator multiplexer node data, therefore every Output_size clock cycle reads Node data, weighting parameter data of every 1 clock cycle reading.After digital independent, carry out 16 successively also The ybend add tree summation of the operation of row multiplication and 16 inputs.Summed result is circulated successively the BRAM being added to specify and stores position On, and judge whether cumulative end.After cumulative end, select directly to export or carry out piecewise approximation excitation letter according to mode of operation Number processes.
Figure 15 is the hardware of right value update hardware computation unit on the field programmable gate array platform of the embodiment of the present invention Structural representation 5000.Use double DMA to read parallel, calculate vector operation with ensureing highthroughput.Structure includes:
Vector A data read module 5100, is furnished with DMA and fifo buffer, and bit wide is 32.It also is responsible for configuration ginseng simultaneously The reading of number.
Vector B data read module 5200, is furnished with DMA and fifo buffer, and bit wide is 32.
Computing module 5300, carries out the vector calculating of correspondence by different configuration informations.A*A+ is carried out when mode of operation is 0 B*B calculates；Carry out (a*A+b*B) * B* (1B) when mode of operation is 1 to calculate.Wherein a, b are configuration parameter, and A, B are to read respectively The vector value entered.
Result writes back module 5400, is furnished with DMA and fifo buffer, and bit wide is 32, and result of calculation is written back to host End memory.
This hardware configuration supports parameter configuration, can support the vector calculating of different scales.Detailed configuration parameter is:
Data_size: the scale of input vector data；
A: the coefficient value needed for calculating；
B: the coefficient value needed for calculating；
Work_mode:0 represents and carries out a*A+b*B calculating；1 represents that carrying out (a*A+b*B) * B* (1B) calculates.
Figure 16 is the calculating of right value update hardware computation unit on the field programmable gate array platform of the embodiment of the present invention Flow chart.
First from DMA A read configuration information, then according to configuration information Data_size respectively from DMA A and B read to The value of amount, parallel and configuration parameter a and b sues for peace after carrying out multiplication calculating, chooses whether to be multiplied by B* finally according to mode of operation (1B), result is written back to host's end memory by DMA A.
Figure 17 is possible that on the heterogeneous multicore reconfigurable calculating platform of the embodiment of the present invention, the degree of depth learns accelerator Application scenarios and block schematic illustration.
Here the composition of application system is as illustrating, and the invention is not limited in this.System is sent by user should During with request, application system is controlled node and request is assigned to by scheduler the calculating node of correspondence.Calculate node in basis Acceleration task offload to FPGA is accelerated by concrete application request.
The general frame figure of each calculating node is made up of hardware layer, driving layer, storehouse layer, service layer and application layer.Hardware Layer is made up of FPGA, internal memory and host end CPU, and CPU, as the controller of system, controls the internal each hardware processing element of FPGA The running status of (referred to as DL Module in figure) and digital independent, including forward calculation arithmetic element and right value update unit. Weighting parameter data required for systemcomputed and neuron number according to being merely stored in internal memory, by DMA by data at internal memory and Transmit before hardware processing element；Driving layer is then the hardware driving write according to hardware platform and operating system；Storehouse layer is then Application programming interface API of encapsulation on the basis of driving；Service layer is the degree of depth study correlation computations that user oriented request provides Accelerate service；Application layer then refers to degree of depth study prediction algorithm and the concrete application of training algorithm, such as uses convolutional Neural net Network prediction algorithm carries out picture classification etc..
Those of ordinary skill in the art are it is to be appreciated that combine method and the hardware that the embodiments described herein describes Structure, it is possible to being implemented in combination in FPGA and CPU.The value volume and range of product of concrete FPGA inside solidification IP kernel see concrete application and Fpga chip resource limit.Professional and technical personnel can use not Tongfang to each specific application or specific fpga chip Formula or different degree of parallelism realize abovementioned described function, but this realization is it is not considered that beyond the scope of this invention.
In several embodiments provided herein, it should be understood that disclosed method and hardware configuration, Ke Yitong The mode crossing other realizes.Such as, the application of the degree of depth described above study is deep neural network and convolutional neural networks is Schematically.Such as, burst size and parallel granularity in forward calculation arithmetic element are schematic, can be according to specifically Situation is adjusted.The such as data transfer mode between field programmable gate array and general processor uses AXI bus association View is also schematic.
Examples detailed above, only for technology design and the feature of the explanation present invention, its object is to allow the person skilled in the art be Will appreciate that present disclosure and implement according to this, can not limit the scope of the invention with this.All according to present invention essence God's equivalent transformation of being done of essence or modification, all should contain within protection scope of the present invention.
Claims (10)
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

CN201610596159.3A CN106228238B (en)  20160727  20160727  Accelerate the method and system of deep learning algorithm on field programmable gate array platform 
Applications Claiming Priority (1)
Application Number  Priority Date  Filing Date  Title 

CN201610596159.3A CN106228238B (en)  20160727  20160727  Accelerate the method and system of deep learning algorithm on field programmable gate array platform 
Publications (2)
Publication Number  Publication Date 

CN106228238A true CN106228238A (en)  20161214 
CN106228238B CN106228238B (en)  20190322 
Family
ID=57534278
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

CN201610596159.3A CN106228238B (en)  20160727  20160727  Accelerate the method and system of deep learning algorithm on field programmable gate array platform 
Country Status (1)
Country  Link 

CN (1)  CN106228238B (en) 
Cited By (23)
Publication number  Priority date  Publication date  Assignee  Title 

CN107145944A (en) *  20170329  20170908  浙江大学  Genetic algorithm and system based on FPGA efficient trainings 
CN107392309A (en) *  20170911  20171124  东南大学—无锡集成电路技术研究所  A kind of general fixedpoint number neutral net convolution accelerator hardware structure based on FPGA 
CN107392308A (en) *  20170620  20171124  中国科学院计算技术研究所  A kind of convolutional neural networks accelerated method and system based on programming device 
CN107423030A (en) *  20170728  20171201  郑州云海信息技术有限公司  Markov Monte carlo algorithm accelerated method based on FPGA heterogeneous platforms 
CN107480782A (en) *  20170814  20171215  电子科技大学  Learn neural network processor on a kind of piece 
CN107506173A (en) *  20170830  20171222  郑州云海信息技术有限公司  A kind of accelerated method, the apparatus and system of singular value decomposition computing 
CN107657581A (en) *  20170928  20180202  中国人民解放军国防科技大学  A kind of convolutional neural networks CNN hardware accelerators and accelerated method 
CN108090496A (en) *  20171222  20180529  银河水滴科技（北京）有限公司  The method and apparatus of image procossing based on convolutional neural networks 
CN108231086A (en) *  20171224  20180629  航天恒星科技有限公司  A kind of deep learning voice enhancer and method based on FPGA 
CN108268945A (en) *  20161231  20180710  上海兆芯集成电路有限公司  The neural network unit of circulator with arraywidth sectional 
CN108280514A (en) *  20180105  20180713  中国科学技术大学  Sparse neural network acceleration system based on FPGA and design method 
CN108520297A (en) *  20180402  20180911  周军  Programmable deep neural network processor 
CN108629405A (en) *  20170322  20181009  杭州海康威视数字技术股份有限公司  The method and apparatus for improving convolutional neural networks computational efficiency 
CN108734288A (en) *  20170421  20181102  上海寒武纪信息科技有限公司  A kind of operation method and device 
CN108920413A (en) *  20180628  20181130  中国人民解放军国防科技大学  Convolutional neural networks core parallel calculation method towards GPDSP 
CN109359732A (en) *  20180930  20190219  阿里巴巴集团控股有限公司  A kind of chip and the data processing method based on it 
CN109359736A (en) *  20170406  20190219  上海寒武纪信息科技有限公司  Network processing unit and network operations method 
CN109726809A (en) *  20171030  20190507  北京深鉴智能科技有限公司  The hardware circuit implementation and its control method of deep learning softmax classifier 
CN109993287A (en) *  20171229  20190709  北京中科寒武纪科技有限公司  Processing with Neural Network method, computer system and storage medium 
WO2019136755A1 (en) *  20180115  20190718  深圳鲲云信息科技有限公司  Method and system for optimizing design model of artificial intelligence processing device, storage medium, and terminal 
WO2019165989A1 (en) *  20180301  20190906  华为技术有限公司  Data processing circuit for use in neural network 
TWI696961B (en) *  20181212  20200621  財團法人工業技術研究院  Deep neural networks (dnn) hardware accelerator and operation method thereof 
CN108520297B (en) *  20180402  20200904  周军  Programmable deep neural network processor 
Citations (6)
Publication number  Priority date  Publication date  Assignee  Title 

US20050237232A1 (en) *  20040423  20051027  Yokogawa Electric Corporation  Transmitter and a method for duplicating same 
US20140289445A1 (en) *  20130322  20140925  Antony Savich  Hardware accelerator system and method 
CN104112053A (en) *  20140729  20141022  中国航天科工集团第三研究院第八三五七研究所  Design method of reconfigurable architecture platform oriented image processing 
CN104915322A (en) *  20150609  20150916  中国人民解放军国防科学技术大学  Method for accelerating convolution neutral network hardware and AXI bus IP core thereof 
CN105162475A (en) *  20150819  20151216  中国人民解放军海军工程大学  FPGA (Field Programmable Gate Array) based parameterized multistandard decoder with high throughput rate 
CN105447285A (en) *  20160120  20160330  杭州菲数科技有限公司  Method for improving OpenCL hardware execution efficiency 

2016
 20160727 CN CN201610596159.3A patent/CN106228238B/en active IP Right Grant
Patent Citations (6)
Publication number  Priority date  Publication date  Assignee  Title 

US20050237232A1 (en) *  20040423  20051027  Yokogawa Electric Corporation  Transmitter and a method for duplicating same 
US20140289445A1 (en) *  20130322  20140925  Antony Savich  Hardware accelerator system and method 
CN104112053A (en) *  20140729  20141022  中国航天科工集团第三研究院第八三五七研究所  Design method of reconfigurable architecture platform oriented image processing 
CN104915322A (en) *  20150609  20150916  中国人民解放军国防科学技术大学  Method for accelerating convolution neutral network hardware and AXI bus IP core thereof 
CN105162475A (en) *  20150819  20151216  中国人民解放军海军工程大学  FPGA (Field Programmable Gate Array) based parameterized multistandard decoder with high throughput rate 
CN105447285A (en) *  20160120  20160330  杭州菲数科技有限公司  Method for improving OpenCL hardware execution efficiency 
NonPatent Citations (2)
Title 

QI YU等: "A Deep Learning prediction process accelerator based FPGA", 《IEEE》 * 
TIANSHI CHEN等: "DianNao: A SmallFootprint HighThroughput Accelerator for Ubiquitous MachineLearning", 《ACM》 * 
Cited By (26)
Publication number  Priority date  Publication date  Assignee  Title 

CN108268945A (en) *  20161231  20180710  上海兆芯集成电路有限公司  The neural network unit of circulator with arraywidth sectional 
CN108268945B (en) *  20161231  20200911  上海兆芯集成电路有限公司  Neural network unit and operation method thereof 
CN108629405A (en) *  20170322  20181009  杭州海康威视数字技术股份有限公司  The method and apparatus for improving convolutional neural networks computational efficiency 
CN107145944A (en) *  20170329  20170908  浙江大学  Genetic algorithm and system based on FPGA efficient trainings 
CN109359736A (en) *  20170406  20190219  上海寒武纪信息科技有限公司  Network processing unit and network operations method 
CN108734288A (en) *  20170421  20181102  上海寒武纪信息科技有限公司  A kind of operation method and device 
CN107392308B (en) *  20170620  20200403  中国科学院计算技术研究所  Convolutional neural network acceleration method and system based on programmable device 
CN107392308A (en) *  20170620  20171124  中国科学院计算技术研究所  A kind of convolutional neural networks accelerated method and system based on programming device 
CN107423030A (en) *  20170728  20171201  郑州云海信息技术有限公司  Markov Monte carlo algorithm accelerated method based on FPGA heterogeneous platforms 
CN107480782A (en) *  20170814  20171215  电子科技大学  Learn neural network processor on a kind of piece 
CN107506173A (en) *  20170830  20171222  郑州云海信息技术有限公司  A kind of accelerated method, the apparatus and system of singular value decomposition computing 
CN107392309A (en) *  20170911  20171124  东南大学—无锡集成电路技术研究所  A kind of general fixedpoint number neutral net convolution accelerator hardware structure based on FPGA 
CN107657581A (en) *  20170928  20180202  中国人民解放军国防科技大学  A kind of convolutional neural networks CNN hardware accelerators and accelerated method 
CN109726809A (en) *  20171030  20190507  北京深鉴智能科技有限公司  The hardware circuit implementation and its control method of deep learning softmax classifier 
CN108090496A (en) *  20171222  20180529  银河水滴科技（北京）有限公司  The method and apparatus of image procossing based on convolutional neural networks 
CN108231086A (en) *  20171224  20180629  航天恒星科技有限公司  A kind of deep learning voice enhancer and method based on FPGA 
CN109993287A (en) *  20171229  20190709  北京中科寒武纪科技有限公司  Processing with Neural Network method, computer system and storage medium 
CN108280514A (en) *  20180105  20180713  中国科学技术大学  Sparse neural network acceleration system based on FPGA and design method 
WO2019136755A1 (en) *  20180115  20190718  深圳鲲云信息科技有限公司  Method and system for optimizing design model of artificial intelligence processing device, storage medium, and terminal 
WO2019165989A1 (en) *  20180301  20190906  华为技术有限公司  Data processing circuit for use in neural network 
CN108520297B (en) *  20180402  20200904  周军  Programmable deep neural network processor 
CN108520297A (en) *  20180402  20180911  周军  Programmable deep neural network processor 
CN108920413A (en) *  20180628  20181130  中国人民解放军国防科技大学  Convolutional neural networks core parallel calculation method towards GPDSP 
CN109359732B (en) *  20180930  20200609  阿里巴巴集团控股有限公司  Chip and data processing method based on chip 
CN109359732A (en) *  20180930  20190219  阿里巴巴集团控股有限公司  A kind of chip and the data processing method based on it 
TWI696961B (en) *  20181212  20200621  財團法人工業技術研究院  Deep neural networks (dnn) hardware accelerator and operation method thereof 
Also Published As
Publication number  Publication date 

CN106228238B (en)  20190322 
Similar Documents
Publication  Publication Date  Title 

EP3129870B1 (en)  Data parallel processing method and apparatus based on multiple graphic procesing units  
Alwani et al.  Fusedlayer CNN accelerators  
US20190332945A1 (en)  Apparatus and method for compression coding for artificial neural network  
US9529590B2 (en)  Processor for large graph algorithm computations and matrix operations  
US10083395B2 (en)  Batch processing in a neural network processor  
WO2017185389A1 (en)  Device and method for use in executing matrix multiplication operations  
US20200110983A1 (en)  Apparatus and methods for forward propagation in convolutional neural networks  
WO2018171717A1 (en)  Automated design method and system for neural network processor  
KR20190022627A (en)  Convolutional neural network on programmable twodimensional image processor  
US8468109B2 (en)  Architecture, system and method for artificial neural network implementation  
Agullo et al.  A hybridization methodology for highperformance linear algebra software for GPUs  
Ma et al.  Optimizing the convolution operation to accelerate deep neural networks on FPGA  
KR20190062481A (en)  Efficient data layouts for convolutional neural networks  
KR20190010642A (en)  Accelerator for deep layer neural network  
CN1947156B (en)  Graphics processing architecture employing a unified shader  
CN107239824A (en)  Apparatus and method for realizing sparse convolution neutral net accelerator  
US7574466B2 (en)  Method for finding global extrema of a set of shorts distributed across an array of parallel processing elements  
CN106529668A (en)  Operation device and method of accelerating chip which accelerates depth neural network algorithm  
EP3451241A1 (en)  Device and method for performing training of convolutional neural network  
CN107679620B (en)  Artificial neural network processing device  
JP6348561B2 (en)  System and method for multicore optimized recurrent neural networks  
US20190179674A1 (en)  Systems and methods for data management  
Yu et al.  A deep learning prediction process accelerator based FPGA  
CN109284825B (en)  Apparatus and method for performing LSTM operations  
CN109726806A (en)  Information processing method and terminal device 
Legal Events
Date  Code  Title  Description 

PB01  Publication  
C06  Publication  
SE01  Entry into force of request for substantive examination  
SE01  Entry into force of request for substantive examination  
GR01  Patent grant  
GR01  Patent grant 