CN106228238B

CN106228238B - Accelerate the method and system of deep learning algorithm on field programmable gate array platform

Info

Publication number: CN106228238B
Application number: CN201610596159.3A
Authority: CN
Inventors: 周学海; 王超; 余奇; 周徐达; 赵洋洋; 李曦; 陈香兰
Original assignee: Suzhou Institute for Advanced Study USTC
Current assignee: Suzhou Institute for Advanced Study USTC
Priority date: 2016-07-27
Filing date: 2016-07-27
Publication date: 2019-03-22
Anticipated expiration: 2036-07-27
Also published as: CN106228238A

Abstract

The invention discloses a kind of methods for accelerating deep learning algorithm on field programmable gate array platform, field programmable gate array platform includes general processor, field programmable gate array and memory module, the following steps are included: predicting process and training process according to deep learning, and deep neural network and convolutional neural networks are combined, determine the general-purpose computations part for being suitable for running on field programmable gate array platform；According to the general-purpose computations part of confirmation, software-hardware synergism calculation is determined；According to calculating logic resource, the bandwidth situation of FPGA, the cured value volume and range of product of IP kernel is determined, using hardware computation unit, accelerated on programmable gate array platform at the scene.The hardware processing element accelerated for deep learning algorithm can be quickly designed according to hardware resource, processing unit has high-performance, low-power consumption feature relative to general processor.

Description

Accelerate the method and system of deep learning algorithm on field programmable gate array platform

Technical field

The present invention relates to computer hardwares to accelerate field, more particularly to accelerating on a kind of field programmable gate array platform The method and system of deep learning algorithm.

Background technique

Deep learning has significant achievement on solving high-level abstractions cognitive question, has made in machine learning a new platform Rank.It not only has a very high scientific research value, but also has very strong practicability, cause no matter academia and industry all very Favor.However, in order to solve more to be abstracted, more complicated problem concerning study, the network size of deep learning is being continuously increased, and counts It calculates and the complexity of data also increases severely therewith, for example Google Cat grid has 1,000,000,000 or so neurons.High-performance is low Deep learning related algorithm is accelerated to energy consumption to become the research hotspot of scientific research and commercial undertaking.

Usual calculating task is divided to two kinds from manifestation mode: on aageneral-purposeaprocessor, task is usually with the shape of software code Formula is presented, referred to as software task；On special hardware circuit, the intrinsic rapid charater of hardware is given full play to replace software to appoint Business, referred to as hardware task.Common hardware-accelerated technology has application-specific integrated circuit ASIC (Application Specific Integrated Circuit), field programmable gate array FPGA (Field Programmable Gate Array) and Graphics processor GPU (Graphics Processing Unit).ASIC is the ic core designed and developed for special-purpose Piece has the characteristics that high-performance, low-power consumption, area are small.Usually relative to FPGA, ASIC run faster, power consumption it is lower, and It is also cheaper when quantization production.Although transistor ratio ASIC used in FPGA is more, FPGA for same given function Logic task design is simplified, the design cycle, ratio ASIC was short very much.In addition, the exposure mask cost of production ASIC is very high, with line Wide reduction, exposure mask cost exponentially increase.FPGA is as the programmable normal component for being applicable in different function, without such great number Research and development cost, and have certain flexibility.GPU is suitable for the parallel computation of mass data, has high bandwidth, high master Frequently, high concurrency feature, and CUDA (Compute Unified Device Architecture) universal parallel Computational frame Proposition, make that developer is more convenient, quickly designs high performance solution.But the power consumption of GPU is higher, the function of single GPU Consumption is often higher than the CPU power consumption of contemporary mainstream, will more tens times even energy consumption of hundreds of times usually relative to FPGA.

Summary of the invention

In view of this, object of the present invention is to: it provides and accelerates deep learning to calculate on a kind of field programmable gate array platform The method and system of method can quickly design the hardware processing element accelerated for deep learning algorithm according to hardware resource, Processing unit has high-performance, low-power consumption feature relative to general processor.

The technical scheme is that

Accelerate the method for deep learning algorithm on a kind of field programmable gate array platform, which is characterized in that scene can compile Journey gate array platform includes general processor, field programmable gate array and memory module, comprising the following steps:

S01: predicting process and training process according to deep learning, and combine deep neural network and convolutional neural networks, Determine the general-purpose computations part for being suitable for running on field programmable gate array platform；

S02: according to the general-purpose computations part of confirmation, software-hardware synergism calculation is determined；

S03: it according to calculating logic resource, the bandwidth situation of FPGA, determines the cured value volume and range of product of IP kernel, utilizes hardware Arithmetic element is accelerated on programmable gate array platform at the scene.

In optimal technical scheme, the general-purpose computations part includes forward calculation module, calculates and swashs for matrix multiplication Encourage function calculating；Right value update module is calculated for vector.

In optimal technical scheme, the step S02 the following steps are included:

Data Preparation is carried out in software end；

Matrix multiplication is converted by convolutional layer convolutional calculation in convolutional neural networks；

The data path calculated as software-hardware synergism is read using direct memory.

In optimal technical scheme, the cured value volume and range of product of IP kernel is determined in the step S03, comprising: according to pending Hardware task, determine the type of cured arithmetic element on FPGA；According to FPGA hardware logical resource and bandwidth situation, determine The quantity of the processing unit of pending hardware task.

In optimal technical scheme, the forward calculation module is designed using fragment, and the every a line inside of node matrix equation is pressed and is divided Piece size carries out fragment, and each column of weighting parameter matrix carry out fragment according to fragment size, by the every fragment for being about to node matrix equation Size data fragment size numerical value corresponding with each column of weighting parameter matrix carries out dot-product operation, and every a line has been calculated complete Nonce is added up afterwards and obtains final result.

In optimal technical scheme, the n times side that the fragment size is 2 is consistent with the parallel granularity of arithmetic element.

The present invention discloses a kind of for accelerating the FPGA structure of deep learning algorithm again characterized by comprising

The node data matrix of forward calculation module and weighting parameter matrix are carried out fragment, timesharing by fragment processing structure It is multiplexed hardware logic；

Excitation function linear approximation realizes structure, for generating arbitrary excitation function；

Parameter configuration module, for configuring the parameter of processing unit；

Forward calculation module, the forward direction that forward calculation hardware configuration and double DMA including single DMA caching weight are read parallel Computing hardware structure；For the forward calculation of deep neural network, the forward calculation of convolutional neural networks convolutional layer and layer of classifying And matrix multiplication operation, and carry out assembly line and be optimized to maximum throughput rate；

Right value update module is calculated for vector.

In optimal technical scheme, the parameter configuration module carries out processing unit by DMA transfer configuration parameter data Configuration, comprising: the operating mode configuration and data scale configuration of forward calculation module, data scale configuration include that node data is advised Mould configuration, the configuration of input neuron scale and the configuration of output neuron scale；The configuration of right value update module data scale, Working mould Formula configuration and calculating parameter configuration.

In optimal technical scheme, the forward calculation hardware configuration of the list DMA caching weight includes:

Single DMA is responsible for reading data, writes back；

Pair register buffer area alternately reads data or carries out parallel computation；BRAM group caches and guarantees that data parallel is read It takes；

With the equal-sized floating-point multiplier of fragment；

With the y-bend add tree of fragment input equal in magnitude；

Cycle accumulor device, cumulative nonce are saved on piece BRAM；

Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation；

The forward calculation hardware configuration that double DMA are read parallel includes:

Neuron data read module is furnished with DMA and FIFO buffer area, is responsible for reading input neuron node data；

Weighting parameter data read module is furnished with DMA and FIFO buffer area, is responsible for reading weighting parameter data；

With the equal-sized floating-point multiplier of fragment；

With the y-bend add tree of fragment input equal in magnitude；

Cycle accumulor device, cumulative nonce are saved on piece BRAM；

Excitation function computing module realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation.

In optimal technical scheme, the right value update module calculates the calculating with output layer error amount for right value update, And it carries out assembly line and is optimized to maximum throughput rate, comprising: vector A data read module and vector B data read module, respectively Equipped with DMA and fifo buffer, two groups of vector values for calculating are read respectively；Computing module is carried out pair by configuration information The vector answered calculates；As a result module is write back, is furnished with DMA and fifo buffer, calculated result is written back to host's end memory.

Compared with prior art, the invention has the advantages that

The present invention can effectively accelerate deep learning algorithm, including study prediction process and training process, being capable of basis Hardware resource quickly designs the hardware processing element accelerated for deep learning algorithm, and processing unit is relative to general processor There are high-performance, low-power consumption feature.

Detailed description of the invention

The invention will be further described with reference to the accompanying drawings and embodiments:

Fig. 1 is the process for accelerating deep learning method on the field programmable gate array platform of the embodiment of the present invention Figure；

Fig. 2 is the calculating schematic diagram of convolutional layer in convolutional neural networks；

Fig. 3 is that the forward calculation hardware processing element on the field programmable gate array platform of the embodiment of the present invention turns The schematic diagram that change of lap lamination calculates；

Fig. 4 is right value update processing unit on the field programmable gate array platform of the embodiment of the present invention by data Matrix conversion at vector schematic diagram；

Fig. 5 is the structural representation that software-hardware synergism calculates on the field programmable gate array platform of the embodiment of the present invention Figure；

Fig. 6 is that the hardware processing element resource of the embodiment of the present invention uses and field programmable gate array platform resource And applicable cases solidify the schematic diagram of value volume and range of product；

Fig. 7 is the schematic diagram of the forward calculation processing unit data fragmentation processing of the embodiment of the present invention；

Fig. 8 is that the piece wire approximation of the embodiment of the present invention realizes the schematic diagram of excitation function；

Fig. 9 is the forward direction meter that list DMA prestores weight matrix in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Calculate the structural schematic diagram of hardware processing element；

Figure 10 is preceding tired into computing hardware processing unit in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Add the structural schematic diagram of processing；

Figure 11 be in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention before into computing hardware processing unit point The structural schematic diagram of section approximation sigmoid function；

Figure 12 is the forward direction meter that list DMA prestores weight matrix in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Calculate the flow chart of data processing figure of hardware processing element；

Figure 13 is the forward direction meter of double DMA parallel read datas in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Calculate the structural schematic diagram of hardware processing element；

Figure 14 is the forward direction meter of double DMA parallel read datas in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Calculate the flow chart of data processing figure of hardware processing element；

Figure 15 is the knot of right value update hardware processing element in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Structure schematic diagram；

Figure 16 is the number of right value update hardware processing element in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention According to process flow diagram；

Figure 17 is possibility one of deep learning accelerator in the heterogeneous multi-core reconfigurable computing platform of the embodiment of the present invention Application scenarios and block schematic illustration.

Specific embodiment

Above scheme is described further below in conjunction with specific embodiment.It should be understood that these embodiments are for illustrating The present invention and be not limited to limit the scope of the invention.Implementation condition used in the examples can be done according to the condition of specific producer Further adjustment, the implementation condition being not specified is usually the condition in routine experiment.

Embodiment:

Field programmable gate array platform in the embodiment of the present invention refers to while integrated universal processor (General Purpose Processor, referred to as " GPP ") and field programmable gate array (Field Programmable Gate Arrays, referred to as " FPGA ") chip computing system, wherein the data path between FPGA and GPP can use PCI-E Bus protocol, AXI bus protocol etc..Attached drawing data path of the embodiment of the present invention illustrates for using AXI bus protocol, but this hair It is bright to be not limited to this.

Fig. 1 is the stream that the field programmable gate array platform of the embodiment of the present invention accelerates the method 100 of deep learning algorithm Cheng Tu.This method 100 includes:

S110 predicts process and training process according to deep learning, wherein training process include local pre-training process and Global training process, and deep neural network and convolutional neural networks are combined, it determines and is suitable for field programmable gate array platform The general-purpose computations part of upper operation；

S120 determines software-hardware synergism calculation according to the common hardware computing module of confirmation；

S130, according to calculating logic resource, bandwidth situation on field programmable gate array, determine the cured quantity of IP kernel and Type.

Below in conjunction with Fig. 2 to Fig. 4, the method for deep learning general-purpose computations part is accelerated to carry out the embodiment of the present invention Detailed description.

Fig. 2 is the schematic diagram that convolutional layer calculates, it is assumed that input feature vector figure number is 4, and convolution kernel size is 3x3, then by 4 After the cumulative summation of convolution checkout result, the value that output characteristic pattern can be obtained is handled by excitation function.From calculating overall structure On see, the basic calculating mode of convolutional layer is similar with the calculating of deep neural network hidden layer, as long as by adjusting convolution nuclear parameter sequence Convolutional calculation used herein can be changing into dot product calculating by column.Specific adjustment mode are as follows: 1), by input feature vector figure from up to Under, by row be sequentially filled to a line, as shown in Fig. 3 left line；2) after convolution matrix core being rotated 180 degree counterclockwise, from up to Under, column of weight matrix are sequentially write by row, it is among Fig. 3 shown in that column, original convolution kernel a is successively inverse to convolution kernel d Hour hands rotate 180 degree after, become a9~a1, b9~b1 ... d9~d1, filling in proper order to one column in.So for convolution Layer prediction process, basic calculating are convertible into mode identical with deep neural network hidden layer, i.e. matrix multiplication calculating adds Excitation function processing, but need to pay more the cost of data conversion.

In deep learning training process, in addition to needing a large amount of matrix multiplication calculating that a large amount of vector is also needed to calculate, It needs matrix data being converted into vector data when carrying out vector calculating, as shown in figure 4, the every a line of data is formed one in proper order A vector carries out vector calculating.

Therefore, in conjunction with Fig. 2 to Fig. 4, deep learning is predicted the general-purpose computations portion of process and training process by present example Divide and be attributed to matrix multiplication calculating, excitation function calculates and a large amount of vector calculates.

Fig. 5 is structural framing Figure 200 that the software-hardware synergism that present example uses calculates.The structure includes:

Processing System (abbreviation PS) 210 includes CPU and Memory as the control terminal of whole system.CPU As host end, runs software end code, and task offload will be accelerated to work to the end PL.In addition, CPU is as controllable The working condition sum number of each IP kernel in the end PL (intellectual property core represents each hardware computation unit here) processed According to reading etc.；

Programmable logic Programming Logic (abbreviation PL) 220 is the hardware-accelerated component FPGA core of whole system Piece.IP kernel can be solidified on fpga chip according to different acceleration tasks to realize the acceleration to algorithm.System by the end PS according to Specific algorithm scheduling selects different IP Core to carry out parallel computation, can also be by host end software task and the end FPGA hardware Task carries out parallel computation；

Data/address bus (Data Bus) 230 is responsible for the end whole system PS and the transmission of PL end data；

Control signal bus (Control Bus) 240 is responsible for the transmission at the end whole system PS and the end PL control signal.

Fig. 6 is the accelerator overall structure 2000 based on FPGA design, and structure includes:

System controller 2100 is responsible for execution state, data transmission and the program scheduler of each hardware computation unit of control. And the non-universal calculating section of responsible operation deep learning, data initialization and hardware computation unit (or being IP kernel) is first Beginning task；

Memory 2200 is responsible for storage depth learning network parameter and original input data, and data is required to store here Physical address be it is continuous, facilitate DMA to carry out data transmission；

Data bus protocol 2300, AXI-Stream agreement allow unconfined data burst transmission, are high-performance data Transport protocol；

Control bus agreement 2400, AXI-Lite are a kind of address of cache single transmission agreements of lightweight, are suitable for hard The control signal of part arithmetic element transmits；

Data interconnection 2500, data path interconnection；

Control interconnection 2600, control signal lines interconnection；

Direct memory access DMA2700, the data transmission being responsible between accelerator and memory, each hardware processing element are matched A standby DMA carrys out parallel read data；

PE (Proccesing Element) 2800 computing unit as each accelerator, internal curable 1 forward direction Calculating arithmetic element or 1 right value update arithmetic element or both includes.Since FPGA is programmable and can weigh Structure, the quantity of PE can not change operation list in this way according to the resource bandwidth situation dynamic configuration of specific fpga chip here The computing resource of hardware can be made full use of under first hardware design, guarantee that hardware plays peak performance.

Above in conjunction with Fig. 1 to Fig. 6, the method that the embodiment of the present invention accelerates deep learning algorithm is described in detail, below The hardware configuration of the embodiment of the present invention will be introduced.

Fig. 7 is to design forward calculation arithmetic element using fragment Computation schema, it is assumed that the size of fragment is 16, by node square Fragment is carried out by 16 inside the every a line of battle array, weighting parameter matrix carries out fragment according to 16 elements of each column.By being about to node square Every 16 data of battle array 16 numerical value corresponding with each column of weighting parameter matrix carries out dot-product operation, has been calculated to every a line complete These nonces, which add up, again afterwards can be obtained final result.Such method not only takes full advantage of data locality, but also subtracts Resource situation needed for solidifying parallel execution unit is lacked, and has reduced data bandwidth needed for hardware, has allowed single arithmetic element can To realize that the matrix multiplication of random scale calculates.

In order to keep high-throughput, the size of fragment should be matched with arithmetic element interior design, be kept with parallel granularity Unanimously, in matrix multiplication operation, fragment can be set as to 2 n times side, to give full play to the cumulative performance of binary tree.By It is related with parallel granularity in fragment size, theoretically for fragment it is bigger, degree of parallelism is higher, and the performance of arithmetic element can also be got over It is good, so selecting maximum 2 in the case where hardware resource and bandwidth allowⁿFragment size as arithmetic element.

Fig. 8 is to carry out hard-wired schematic diagram to excitation function in present example.Present example uses segmented line Property approximation realize S type excitation function, by function by X-axis be divided into it is several it is equivalent be spaced, by Y=a in each interval_i*X+b_i,X ∈[x_i,x_i+1) shown in carry out linear approximation, wherein x_i+1-x_iFor approximate gap size.Whenever needing to calculate excitation function, The section where it is found first, in accordance with X value and calculates its corresponding a_iAnd b_iRelative to the offset of base address, multiply-add fortune is carried out After calculation, can approximation obtain Y value.This implementation has two benefits: 1), arbitrary S type excitation function or linear can be achieved Function, and without changing any hardware design, it is only necessary to replace the numerical value that coefficient a and coefficient b are stored；2), error Minimum, when approximate interval reduces, error, which can achieve, to be ignored, and cost is only to increase for packing coefficient a and be The BRAM of number b.And the requirement that deep learning calculates in itself to the accuracy of data is not very high a degree of in other words Loss of significance has no effect on data result.

Fig. 9 is the hardware configuration that list DMA prestores weight matrix on the field programmable gate array platform of the embodiment of the present invention Schematic block diagram 3000 cache weight matrix number in advance when the hardware configuration is more sufficient for BRAM resource inside FPGA Forward calculation is carried out according on piece BRAM.Structure includes:

Data read module 3100 is furnished with DMA and FIFO buffer area, and data bit width is 32, is responsible for reading weighting parameter It is buffered on piece BRAM and reads neuron node data.

On piece BRAM3200 caches weighting parameter data.By taking fragment size is 16 as an example, it is with 16 by row by weight matrix Circulation is stored on different BRAM, i.e. i%16 adds the base address of BRAM as addressing system, to guarantee carrying out 16 simultaneously From different BRAM parallel read datas when row multiplication.

Pair register caching 3300, each register includes 16 registers for storing input neuron number evidence here, Data cached and progress parallel computation is carried out by replacing.But it is noted herein that: buffer area is filled up to the required time Time needed for calculating lower than these data just can guarantee the time of buffer data reading by calculating required time institute in this way Covering, and ensure the correctness of result.

Parallel floating point multiplication 3400, by weighting parameter data and neuron number according to progress parallel multiplication calculating, Floating-point Computation Realized using DSP, after assembly line optimization, can each clock cycle parallel processing 16 floating-point multiplications operation, fragment size here For 16.Since input neuron number might not be divided exactly by 16, so when every data fragment carries out dot product calculating, The possible number inadequate 16 of the last one fragment, then arithmetic element will carry out parallel multiplication calculating with the part of 0 lack of fill 16.

Floating point result obtained in 3400 structure of parallel floating point multiplication is carried out cumulative behaviour by y-bend floating add tree 3500 Make, parallel computation is carried out using y-bend add tree, the read-write dependency of accumulation operations is eliminated, by the required time complexity that adds up From the near O of O (n) (logn).

Accumulation calculating 3600 is needed since forward calculation processing unit is calculated using fragment processing by y-bend floating add The result that tree 3500 obtains after calculating adds up, but cumulative mode is to carry out cycle accumulor behaviour every output neuron number Make.

Excitation function calculates 3700, realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation.

Data write back module 3800, are furnished with DMA and FIFO buffer area, and data bit width is 32, are responsible for data calculated result It is written back to host's end memory.

The hardware configuration supports parameter configuration, can support the neural computing of different scales.Detailed configuration parameter are as follows:

Data_size: the scale of input neuron number evidence；

Input_size: inputting the number of neuron, due to caching weight matrix data in advance, therefore should be less than piece here Upper BRAM can allow to cache the corresponding maximum input neuron number Max_input of weighting parameter；

Output_size: the number of output neuron due to caching weight matrix data in advance, therefore should be less than here On piece BRAM can allow to cache the corresponding maximum output neuron number Max_output of weighting parameter；

Work_mode:0 indicates only to carry out matrix multiplication calculating；1 indicates that carrying out matrix multiplication and excitation function calculates.

Figure 10 is the hardware configuration signal that accumulation calculating is carried out on the field programmable gate array platform of the embodiment of the present invention Figure 36 00.Structure includes:

Floating add calculates 3610, due to using fragment thought, needs to add up to the median that dot product is calculated. Median data flow is added up every the number N (or the latter's matrix column number) of output neuron, suitable again after adding up Sequence output.

Nonce stores BRAM3620, N number of storage unit is arranged inside FPGA for storing ephemeral data, circulation will count It is added in corresponding BRAM storage unit according to flow data, is judged whether according to the relationship of input neuron number and fragment size It is cumulative to terminate.The quantity for storing nonce can not be dynamically set when due to FPGA interior design, so in design luck It calculates unit and sets the maximum cumulative number MAX of support.When the number of output neuron can just be normally carried out cumulative behaviour lower than MAX value Make.

Assembly line optimization is also equally carried out to the process, and starting interval is optimized to 1 clock cycle, to guarantee centre Value generates and the rate of processing is consistent.

Figure 11, which shows progress piece wire approximation on the field programmable gate array platform of the embodiment of the present invention and realizes, to be swashed Encourage the hardware structural diagram 3700 of function.

Excitation function is realized using sublevel linear approximation, is realized that details is as shown in figure 11, unlike Fig. 8, is increased One X is transmitted directly to the access of Y, allows forward calculation arithmetic element that can only execute matrix multiplication operation and without excitation The processing of function carries out used matrix multiplication when error amount calculating here mainly for realizing in training process.Due to S type Excitation function substantially about certain point symmetry, by taking sigmoid function as an example, sigmoid function about (0,0.5) symmetrically, institute Calculated according to 1-f (- x) when x is less than 0, it can be multiplexed hardware logic in this way, reduce the use to hardware resource.And And when x is equal to 8, f (x) is equal to 0.999665, is just infinitely close to 1 later, therefore when x is greater than 8, directly result is assigned a value of 1。

Figure 12 is the forward calculation that list DMA prestores weighting parameter on the field programmable gate array platform of the embodiment of the present invention The calculation flow chart of hardware computation unit.

It is successively read configuration data from DMA first, node data is read according to configuration information.First will when reading node data After register group a is full of, flag is set 0, later according to flag%2 numerical value replace input node data value register group a or Register group b.Equally, it is carried out according to the weight number of the data of the numerical value of flag%2 reading register group and BRAM caching parallel Multiplication calculates, and then adds up after the summation of y-bend add tree.After cumulative, selected according to operating mode through overdriving Function processing or directly output.

Figure 13 is the forward calculation hardware that double DMA are read parallel on the field programmable gate array platform of the embodiment of the present invention The structural schematic diagram 4000 of arithmetic element.The hardware configuration carries out forward calculation module design for the fpga chip of high bandwidth, It is read parallel using double DMA and guarantees high-throughput.Here for 16, structure includes: fragment size

Neuron data read module 4100 is furnished with DMA and FIFO buffer area, and data bit width is 512, is responsible for reading defeated Enter neuron node data, 16 32 single-precision floating-point datas are obtained by shifting function.Since the transmission bit wide of data is 512, it requires that data want address align in host's end memory.Further for input neuron number aliquant 16 Situation needs to carry out neuron node data matrix at host end to fill 0 operation, fills 16-Input_ to the end of every a line Size%16 0, wherein Input_size is the number for inputting neuron, without filling when Input_size%16 is equal to 0.This In to each data-reusing Output_size times, wherein Output_size be output neuron number.

Weighting parameter data read module 4200 is furnished with DMA and FIFO buffer area, and data bit width is 512, is responsible for reading Weighting parameter data obtain 16 32 single-precision floating-point datas by shifting function.Also due to the transmission bit wide of data is 512, it requires that data want address align in host's end memory.Further for input neuron number aliquant 16 Situation needs to carry out weighting parameter data matrix at host end to fill 0 operation, fills 16-Input_ at the end of each column Size%16 0, without filling when same Input_size%16 is equal to 0.After filling, since DMA transfer needs continuously Physical address, need the data storage location by weighting parameter matrix to be adjusted and facilitate DMA transfer.

Parallel floating point multiplication 4300, by weighting parameter data and neuron number according to progress parallel multiplication calculating, Floating-point Computation It is realized using DSP, it, can 16 floating-point multiplication operations of each clock cycle parallel processing after assembly line optimization.

Floating point result obtained in 4300 structure of parallel floating point multiplication is carried out cumulative behaviour by y-bend floating add tree 4400 Make, parallel computation is carried out using y-bend add tree, the read-write dependency of accumulation operations is eliminated, by the required time complexity that adds up From the near O of O (n) (logn).

Accumulation calculating 4500 is needed since forward calculation processing unit is calculated using fragment processing by y-bend floating add The result that tree 4400 obtains after calculating adds up, but cumulative mode is to carry out cycle accumulor behaviour every output neuron number Make.The structure and structure 3600 are identical, therefore are not described in further detail.

Excitation function calculates 4600, realizes that excitation function, design factor are buffered on piece BRAM using piece wire approximation. The structure and structure 3700 are identical, therefore are not described in further detail.

Data write back module 4700, are furnished with DMA and FIFO buffer area, and data bit width is 32, are responsible for data calculated result It is written back to host's end memory.

Data_size: the scale of input neuron number evidence；

Input_size: the number of neuron is inputted；

Output_size: the number of output neuron；

Figure 14 is the forward calculation hardware that double DMA are read parallel on the field programmable gate array platform of the embodiment of the present invention The calculation flow chart of arithmetic element.

Read configuration information from node DMA first, configuration arithmetic element read the scale of node data and weight data with And operating mode.Then, 512 data are read in from node DMA and weight DMA respectively, parallel shift obtains 16 neuron sections Point data and 16 weighting parameter data, due to accelerator multiplexer node data, therefore every Output_size clock cycle reads Node data, every 1 clock cycle read a weighting parameter data.After reading data, 16 are successively carried out simultaneously The y-bend add tree summation of the operation of row multiplication and 16 inputs.Summed result is circuited sequentially and is added to specified BRAM storage location On, and judge whether cumulative end.After cumulative, piecewise approximation is directly exported or carried out according to operating mode selection and motivates letter Number processing.

Figure 15 is the hardware of right value update hardware computation unit on the field programmable gate array platform of the embodiment of the present invention Structural schematic diagram 5000.It is read parallel using double DMA, to calculate vector operation with guaranteeing high-throughput.Structure includes:

Vector A data read module 5100, is furnished with DMA and fifo buffer, and bit wide is 32.Also it is responsible for configuration ginseng simultaneously Several readings.

Vector B data read module 5200, is furnished with DMA and fifo buffer, and bit wide is 32.

Computing module 5300 carries out corresponding vector calculating by different configuration informations.Operating mode carries out a*A+ when being 0 B*B is calculated；(a*A+b*B) * B* (1-B) is carried out when operating mode is 1 to calculate.Wherein a, b are configuration parameter, and A, B are to read respectively The vector value entered.

As a result module 5400 is write back, is furnished with DMA and fifo buffer, bit wide is 32, and calculated result is written back to host End memory.

The hardware configuration supports parameter configuration, and the vector of different scales can be supported to calculate.Detailed configuration parameter are as follows:

Data_size: the scale of input vector data；

A: required coefficient value is calculated；

B: required coefficient value is calculated；

Work_mode:0 indicates to carry out a*A+b*B calculating；1 indicates that carrying out (a*A+b*B) * B* (1-B) calculates.

Figure 16 is the calculating of right value update hardware computation unit on the field programmable gate array platform of the embodiment of the present invention Flow chart.

First from DMA A read configuration information, then according to configuration information Data_size respectively from DMA A and B read to The value of amount, parallel and configuration parameter a and b sum after carrying out multiplication calculating, are finally chosen whether according to operating mode multiplied by B* Result is written back to host's end memory by DMA A by (1-B).

Here the composition of application system is as illustrating, and the present invention is not limited thereto.User answers system sending When with request, request is assigned to corresponding calculate node by scheduler by application system control node.Calculate node is in basis Concrete application request will accelerate task offload to FPGA to accelerate.

The general frame figure of each calculate node is made of hardware layer, driving layer, library layer, service layer and application layer.Hardware Layer is made of FPGA, memory and host end CPU, controller of the CPU as system, each hardware processing element inside control FPGA The operating status and reading data of (DL Module is referred to as in figure), including forward calculation arithmetic element and right value update unit. Weighting parameter data required for system-computed and neuron number according to being merely stored in memory, by DMA by data in memory and It is transmitted before hardware processing element；Driving layer is then the hardware driving write according to hardware platform and operating system；Library layer is then The Application Programming Interface API encapsulated on the basis of driving；Service layer is the deep learning relevant calculation that user oriented request provides Accelerate service；Application layer then refers to that deep learning prediction algorithm and training algorithm are specifically applied, such as uses convolutional Neural net Network prediction algorithm carries out picture classification etc..

Those of ordinary skill in the art may be aware that method described in conjunction with the examples disclosed in this document and hardware Structure can be realized with the combination of FPGA and CPU.The value volume and range of product of specific FPGA internal curing IP kernel see concrete application and Fpga chip resource constraint.Professional technician can come using not Tongfang each specific application or specific fpga chip Formula or different degree of parallelism realize above-mentioned described function, but such implementation should not be considered as beyond the scope of the present invention.

In several embodiments provided herein, it should be understood that disclosed method and hardware configuration, Ke Yitong Other modes are crossed to realize.For example, the application of deep learning described above is deep neural network and convolutional neural networks are Schematically.For example, the fragment size and parallel granularity in forward calculation arithmetic element be it is schematical, can be according to specific Situation is adjusted.Such as the data transfer mode between field programmable gate array and general processor is assisted using AXI bus View is also schematic.

The foregoing examples are merely illustrative of the technical concept and features of the invention, its object is to allow the person skilled in the art to be It cans understand the content of the present invention and implement it accordingly, it is not intended to limit the scope of the present invention.It is all smart according to the present invention The equivalent transformation or modification that refreshing essence is done, should be covered by the protection scope of the present invention.

Claims

1. accelerating the method for deep learning algorithm on a kind of field programmable gate array platform, which is characterized in that field-programmable Gate array platform includes general processor, field programmable gate array and memory module, comprising the following steps:

S01: process and training process are predicted according to deep learning, and combine deep neural network and convolutional neural networks, is determined Suitable for the general-purpose computations part run on field programmable gate array platform；

S03: it according to calculating logic resource, the bandwidth situation of FPGA, determines the cured value volume and range of product of IP kernel, utilizes hardware computation Unit is accelerated on programmable gate array platform at the scene；

The general-purpose computations part includes forward calculation module, and the forward calculation module includes the forward direction of single DMA caching weight The forward calculation hardware configuration that computing hardware structure and double DMA are read parallel；The forward calculation hardware of the list DMA caching weight Structure includes:

Single DMA is responsible for reading data, writes back；

Pair register buffer area alternately reads data or carries out parallel computation；BRAM group caches and guarantees that data parallel is read；

With the equal-sized floating-point multiplier of fragment；

With the y-bend add tree of fragment input equal in magnitude；

Cycle accumulor device, cumulative nonce are saved on piece BRAM；

With the equal-sized floating-point multiplier of fragment；

With the y-bend add tree of fragment input equal in magnitude；

Cycle accumulor device, cumulative nonce are saved on piece BRAM；

2. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 1 It is, the forward calculation module, is calculated for matrix multiplication and excitation function calculates；Right value update module is used for meter It calculates.

3. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 1 Be, the step S02 the following steps are included:

Data Preparation is carried out in software end；

4. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 1 It is, the cured value volume and range of product of IP kernel is determined in the step S03, comprising: according to pending hardware task, determine FPGA The type of upper cured arithmetic element；According to FPGA hardware logical resource and bandwidth situation, the place of pending hardware task is determined Manage the quantity of unit.

5. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 2 It is, the forward calculation module is designed using fragment, and fragment size will be pressed inside the every a line of node matrix equation and carries out fragment, weight The each column of parameter matrix carry out fragment according to fragment size, by the every fragment size data and weighting parameter for being about to node matrix equation The corresponding fragment size numerical value of each column of matrix carries out dot-product operation, every a line has been calculated finish after nonce added up obtain most Terminate fruit.

6. accelerating the method for deep learning algorithm, feature on field programmable gate array platform according to claim 5 It is, the n times side that the fragment size is 2 is consistent with the parallel granularity of arithmetic element.

7. a kind of for accelerating the FPGA structure of deep learning algorithm characterized by comprising

The node data matrix of forward calculation module and weighting parameter matrix are carried out fragment, time-sharing multiplex by fragment processing structure Hardware logic；

Forward calculation module, the forward calculation that forward calculation hardware configuration and double DMA including single DMA caching weight are read parallel Hardware configuration；For the forward calculation of deep neural network, convolutional neural networks convolutional layer and classify layer forward calculation and Matrix multiplication operation, and carry out assembly line and be optimized to maximum throughput rate；

The forward calculation hardware configuration of list DMA caching weight includes:

Single DMA is responsible for reading data, writes back；

With the equal-sized floating-point multiplier of fragment；

With the y-bend add tree of fragment input equal in magnitude；

Cycle accumulor device, cumulative nonce are saved on piece BRAM；

With the equal-sized floating-point multiplier of fragment；

With the y-bend add tree of fragment input equal in magnitude；

Cycle accumulor device, cumulative nonce are saved on piece BRAM；

Right value update module is calculated for vector.

8. according to claim 7 for accelerating the FPGA structure of deep learning algorithm, which is characterized in that the parameter is matched It sets module to configure processing unit by DMA transfer configuration parameter data, comprising: the operating mode of forward calculation module is matched It sets and is configured with data scale, data scale configuration includes the configuration of node data scale, the configuration of input neuron scale and output mind It is configured through first scale；The configuration of right value update module data scale, operating mode configuration and calculating parameter configuration.

9. according to claim 7 for accelerating the FPGA structure of deep learning algorithm, which is characterized in that the weight is more New module calculates the calculating with output layer error amount for right value update, and carries out assembly line and be optimized to maximum throughput rate, wraps Include: vector A data read module and vector B data read module are respectively provided with DMA and fifo buffer, read be used for respectively The two groups of vector values calculated；Computing module carries out corresponding vector calculating by configuration information；As a result module is write back, is furnished with DMA And fifo buffer, calculated result is written back to host's end memory.