CN106228238A - The method and system of degree of depth learning algorithm is accelerated on field programmable gate array platform - Google Patents

The method and system of degree of depth learning algorithm is accelerated on field programmable gate array platform Download PDF

Info

Publication number
CN106228238A
CN106228238A CN201610596159.3A CN201610596159A CN106228238A CN 106228238 A CN106228238 A CN 106228238A CN 201610596159 A CN201610596159 A CN 201610596159A CN 106228238 A CN106228238 A CN 106228238A
Authority
CN
China
Prior art keywords
data
programmable gate
gate array
hardware
calculation
Prior art date
Application number
CN201610596159.3A
Other languages
Chinese (zh)
Other versions
CN106228238B (en
Inventor
周学海
王超
余奇
周徐达
赵洋洋
李曦
陈香兰
Original Assignee
中国科学技术大学苏州研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学技术大学苏州研究院 filed Critical 中国科学技术大学苏州研究院
Priority to CN201610596159.3A priority Critical patent/CN106228238B/en
Publication of CN106228238A publication Critical patent/CN106228238A/en
Application granted granted Critical
Publication of CN106228238B publication Critical patent/CN106228238B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computer systems based on biological models
    • G06N3/02Computer systems based on biological models using neural network models
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computer systems based on biological models
    • G06N3/02Computer systems based on biological models using neural network models
    • G06N3/08Learning methods

Abstract

The invention discloses a kind of method accelerating degree of depth learning algorithm on field programmable gate array platform, field programmable gate array platform includes general processor, field programmable gate array and memory module, comprise the following steps: according to degree of depth study prediction process and training process, and combine deep neural network and convolutional neural networks, determine the general-purpose computations part being applicable on field programmable gate array platform run;According to the general-purpose computations part confirmed, determine software-hardware synergism calculation;Calculating logical resource according to FPGA, bandwidth situation, determine the value volume and range of product that IP kernel solidifies, and utilizes hardware computation unit, is accelerated at the scene on programmable gate array platform.Can quickly design the hardware processing element accelerated for degree of depth learning algorithm according to hardware resource, processing unit has high-performance, low-power consumption feature relative to general processor.

Description

The method and system of degree of depth learning algorithm is accelerated on field programmable gate array platform

Technical field

The present invention relates to computer hardware and accelerate field, accelerate more particularly on a kind of field programmable gate array platform The method and system of degree of depth learning algorithm.

Background technology

Degree of depth study has significant achievement on solution high-level abstractions cognitive question, has made in machine learning a new platform Rank.It not only has the highest scientific research value, and has the strongest practicality, causes academia and industrial quarters the most very Favor.But, in order to solve more abstract, more complicated problem concerning study, the network size of degree of depth study is being continuously increased, meter Calculating and the complexity of data increases severely the most therewith, such as Google Cat grid has about 1,000,000,000 neurons.High-performance is low Energy consumption ground accelerates degree of depth study related algorithm and becomes the study hotspot of scientific research and commercial undertaking.

The task that generally calculates divides two kinds from manifestation mode: on aageneral-purposeaprocessor, task is generally with the shape of software code Formula presents, referred to as software task;On special hardware circuit, give full play to the intrinsic rapid charater of hardware to replace software to appoint Business, referred to as hardware task.Common hardware-accelerated technology has application-specific integrated circuit ASIC (Application Specific Integrated Circuit), field programmable gate array FPGA (Field Programmable Gate Array) and Graphic process unit GPU (Graphics Processing Unit).ASIC is the ic core designed and developed for special-purpose Sheet, it has the features such as high-performance, low-power consumption, area be little.Be commonly angled relative to FPGA, ASIC run faster, power consumption lower, and Quantify during production the most less expensive.Although for same given function, the transistor that FPGA is used is more than ASIC, but FPGA Simplify logic task design, the design cycle than ASIC short a lot.Additionally, the mask cost producing ASIC is the highest, along with line Wide reduction, mask cost exponentially increases.FPGA, as being suitable for the normal component able to programme of difference in functionality, does not has such great number R&D costs, and there is certain motility.GPU is applicable to the parallel computation of mass data, has high bandwidth, high master Frequently, high concurrency feature, and CUDA (Compute Unified Device Architecture) universal parallel Computational frame Proposition, make that developer is more convenient, design high performance solution quickly.But the power consumption of GPU is higher, the merit of single GPU Consumption is often higher than the CPU power consumption of contemporary mainstream, is commonly angled relative to FPGA and wants many tens times even energy expenditure of hundreds of times.

Summary of the invention

In view of this, the present invention seeks to: provide and on a kind of field programmable gate array platform, accelerate degree of depth study calculation The method and system of method, it is possible to quickly design the hardware processing element accelerated for degree of depth learning algorithm according to hardware resource, Processing unit has high-performance, low-power consumption feature relative to general processor.

The technical scheme is that

A kind of method accelerating degree of depth learning algorithm on field programmable gate array platform, it is characterised in that scene can be compiled Journey gate array platform includes general processor, field programmable gate array and memory module, comprises the following steps:

S01: according to degree of depth study prediction process and training process, and combine deep neural network and convolutional neural networks, Determine the general-purpose computations part being applicable on field programmable gate array platform run;

S02: according to the general-purpose computations part confirmed, determine software-hardware synergism calculation;

S03: according to calculating logical resource, the bandwidth situation of FPGA, determine the value volume and range of product that IP kernel solidifies, utilize hardware Arithmetic element, is accelerated on programmable gate array platform at the scene.

In optimal technical scheme, described general-purpose computations part includes forward calculation module, calculates for matrix multiplication and swashs Encourage function to calculate;Right value update module, calculates for vector.

In optimal technical scheme, described step S02 comprises the following steps:

Data Preparation is carried out at software end;

Convolutional layer convolutional calculation in convolutional neural networks is converted into matrix multiplication;

Direct internal memory is used to read the data path calculated as software-hardware synergism.

In optimal technical scheme, described step S03 determines the value volume and range of product that IP kernel solidifies, including: according to pending Hardware task, determine on FPGA the kind of the arithmetic element of solidification;According to FPGA hardware logical resource and bandwidth situation, determine The quantity of the processing unit of pending hardware task.

In optimal technical scheme, described forward calculation module uses burst design, by internal for every for node matrix equation a line by dividing Sheet size carries out burst, and the every string of weighting parameter matrix carries out burst according to burst size, by being about to every burst of node matrix equation Burst size the numerical value that size data string every with weighting parameter matrix is corresponding carries out dot-product operation, and every a line calculates complete After obtain final result by cumulative for nonce.

In optimal technical scheme, described burst size is the n power of 2, keeps consistent with the parallel granularity of arithmetic element.

The present invention discloses again a kind of FPGA structure for accelerating degree of depth learning algorithm, it is characterised in that including:

Burst processes structure, and node data matrix and the weighting parameter matrix of forward calculation module are carried out burst, timesharing Multiplexing hardware logic;

Excitation function linear approximation realizes structure, is used for generating arbitrary excitation function;

Parameter configuration module, for configuring the parameter of processing unit;

Forward calculation module, caches, including single DMA, the forward direction that the forward calculation hardware configuration of weights reads parallel with double DMA Computing hardware structure;Forward calculation, convolutional neural networks convolutional layer and the forward calculation of classification layer for deep neural network And matrix multiplication operation, and carry out streamline and be optimized to maximum throughput rate;

Right value update module, calculates for vector.

In optimal technical scheme, processing unit is carried out by described parameter configuration module by DMA transfer configuration parameter data Configuration, including: the mode of operation configuration of forward calculation module and data scale configure, and data scale configuration includes that node data is advised Mould configuration, the configuration of input neuron scale and the configuration of output neuron scale;The configuration of right value update module data scale, Working mould Formula configuration and calculating parameter configuration.

In optimal technical scheme, the forward calculation hardware configuration of described single DMA caching weights includes:

Single DMA, is responsible for digital independent, writes back;

Pair register relief area, alternately reads data or carries out parallel computation;BRAM group, caches and ensures that data parallel is read Take;

With the equal-sized floating-point multiplier of burst;

Y-bend add tree with burst input equal in magnitude;

Cycle accumulor device, cumulative nonce preserves to BRAM on sheet;

Excitation function computing module, uses piece wire approximation to realize excitation function, and design factor is buffered in BRAM on sheet;

The forward calculation hardware configuration that the described couple of DMA reads parallel includes:

Neuron data read module, is furnished with DMA and FIFO buffer area, is responsible for reading input neuron node data;

Weighting parameter data read module, is furnished with DMA and FIFO buffer area, is responsible for reading weighting parameter data;

With the equal-sized floating-point multiplier of burst;

Y-bend add tree with burst input equal in magnitude;

Cycle accumulor device, cumulative nonce preserves to BRAM on sheet;

Excitation function computing module, uses piece wire approximation to realize excitation function, and design factor is buffered in BRAM on sheet.

In optimal technical scheme, described right value update module, calculate for right value update and the calculating of output layer error amount, And carry out streamline and be optimized to maximum throughput rate, including: vector A data read module and vector B data read module, respectively It is furnished with DMA and fifo buffer, reads two groups of vector value for calculating respectively;Computing module, it is right to be carried out by configuration information The vector calculating answered;Result writes back module, is furnished with DMA and fifo buffer, and result of calculation is written back to host's end memory.

Compared with prior art, the invention have the advantage that

The present invention can effectively accelerate degree of depth learning algorithm, including study prediction process and training process, it is possible to according to Hardware resource quickly designs the hardware processing element accelerated for degree of depth learning algorithm, and processing unit is relative to general processor There are high-performance, low-power consumption feature.

Accompanying drawing explanation

Below in conjunction with the accompanying drawings and embodiment the invention will be further described:

Fig. 1 is the flow process accelerating degree of depth learning method on the field programmable gate array platform of the embodiment of the present invention Figure;

Fig. 2 is the calculating schematic diagram of convolutional layer in convolutional neural networks;

Fig. 3 is that the forward calculation hardware processing element on the field programmable gate array platform of the embodiment of the present invention turns The schematic diagram that change of lap lamination calculates;

Fig. 4 is that the right value update processing unit on the field programmable gate array platform of the embodiment of the present invention is by data Matrix conversion becomes the schematic diagram of vector;

Fig. 5 is the structural representation that on the field programmable gate array platform of the embodiment of the present invention, software-hardware synergism calculates Figure;

Fig. 6 is that the hardware processing element resource of the embodiment of the present invention uses and field programmable gate array platform resource And the schematic diagram of applicable cases solidification value volume and range of product;

Fig. 7 is the schematic diagram of the forward calculation processing unit data fragmentation process of the embodiment of the present invention;

Fig. 8 is the schematic diagram that the piece wire approximation of the embodiment of the present invention realizes excitation function;

Fig. 9 is that the heterogeneous multi-core reconfigurable of the embodiment of the present invention calculates single DMA on platform and prestores the forward direction meter of weight matrix Calculate the structural representation of hardware processing element;

Figure 10 is that the heterogeneous multi-core reconfigurable of the embodiment of the present invention calculates on platform tired in forward direction computing hardware processing unit Add the structural representation of process;

Figure 11 is to divide in forward direction computing hardware processing unit on the heterogeneous multi-core reconfigurable calculating platform of the embodiment of the present invention The structural representation of section approximation sigmoid function;

Figure 12 is that the heterogeneous multi-core reconfigurable of the embodiment of the present invention calculates single DMA on platform and prestores the forward direction meter of weight matrix Calculate the flow chart of data processing figure of hardware processing element;

Figure 13 is that the heterogeneous multi-core reconfigurable of the embodiment of the present invention calculates the forward direction meter of double DMA parallel read data on platform Calculate the structural representation of hardware processing element;

Figure 14 is that the heterogeneous multi-core reconfigurable of the embodiment of the present invention calculates the forward direction meter of double DMA parallel read data on platform Calculate the flow chart of data processing figure of hardware processing element;

Figure 15 is that the heterogeneous multi-core reconfigurable of the embodiment of the present invention calculates the knot of right value update hardware processing element on platform Structure schematic diagram;

Figure 16 is that the heterogeneous multi-core reconfigurable of the embodiment of the present invention calculates the number of right value update hardware processing element on platform According to process chart;

Figure 17 is possible that on the heterogeneous multi-core reconfigurable calculating platform of the embodiment of the present invention, the degree of depth learns accelerator Application scenarios and block schematic illustration.

Detailed description of the invention

Below in conjunction with specific embodiment, such scheme is described further.Should be understood that these embodiments are for illustrating The present invention and be not limited to limit the scope of the present invention.The implementation condition used in embodiment can be done according to the condition of concrete producer Adjusting further, not marked implementation condition is usually the condition in normal experiment.

Embodiment:

Field programmable gate array platform in the embodiment of the present invention refers to integrated universal processor (General simultaneously Purpose Processor, referred to as " GPP "), and field programmable gate array (Field Programmable Gate Arrays, referred to as " FPGA ") the calculating system of chip, wherein, the data path between FPGA and GPP can use PCI-E Bus protocol, AXI bus protocol etc..Embodiment of the present invention accompanying drawing data path illustrates as a example by using AXI bus protocol, but this Bright it is not limited to this.

Fig. 1 is the stream of the method 100 of the field programmable gate array platform acceleration degree of depth learning algorithm of the embodiment of the present invention Cheng Tu.The method 100 includes:

S110, according to degree of depth study prediction process and training process, wherein training process comprise local pre-training process and Overall situation training process, and combine deep neural network and convolutional neural networks, determine and be applicable to field programmable gate array platform The general-purpose computations part of upper operation;

S120, according to the common hardware computing module confirmed, determines software-hardware synergism calculation;

S130, according to calculating logical resource, bandwidth situation on field programmable gate array, determine quantity that IP kernel solidify with Kind.

Below in conjunction with Fig. 2 to Fig. 4, the method that the embodiment of the present invention is accelerated degree of depth study general-purpose computations part is carried out Describe in detail.

Fig. 2 is the schematic diagram that convolutional layer calculates, it is assumed that input feature vector figure number is 4, and convolution kernel size is 3x3, then by 4 After the cumulative summation of convolution checkout result, process the value that i.e. can get output characteristic figure through excitation function.From calculating overall structure On see, the basic calculating mode of convolutional layer and deep neural network hidden layer calculate similar, as long as by adjustment convolution kernel parameter sequence Convolutional calculation used herein just can be changing into dot product and calculate by row.Concrete adjustment mode is: 1), by input feature vector figure from up to Under, by row be sequentially filled to a line, as shown in Fig. 3 left line;2) after convolution matrix core being revolved turnback counterclockwise, from up to Under, sequentially write in the middle of the string of weight matrix, Fig. 3 shown in that string by row, original convolution kernel a is the most inverse to convolution kernel d After hour hands rotation turnback, become a9~a1, b9~b1 ... d9~d1, filling in proper order to string.So, for convolution Layer prediction process, its basic calculating is convertible into the mode identical with deep neural network hidden layer, i.e. matrix multiplication calculating adds Excitation function processes, but needs to pay more the cost of data conversion.

During degree of depth learning training, also need to substantial amounts of vector calculating except the substantial amounts of matrix multiplication of needs calculates, Need matrix data is converted into vector data when carrying out vector calculating, as shown in Figure 4, every for data a line is formed one in proper order Individual vector carries out vector calculating.

Therefore, in conjunction with Fig. 2 to Fig. 4, the degree of depth is learnt prediction process and the general-purpose computations portion of training process by present example Divide and be attributed to matrix multiplication calculating, excitation function calculates and substantial amounts of vector calculates.

Fig. 5 is structural framing Figure 200 that the software-hardware synergism that present example uses calculates.This structure includes:

Processing System (being called for short PS) 210, as the control end of whole system, comprises CPU and Memory.CPU As host's end, run software end code, and acceleration task offload to PL end is operated.Additionally, CPU is as controlled The duty sum of each IP kernel of PL end processed (intellectual property core represents each hardware computation unit here) According to reading etc.;

FPGA Programming Logic (being called for short PL) 220, for the hardware-accelerated parts FPGA core of whole system Sheet.IP kernel can be solidified on fpga chip according to difference acceleration task and realize the acceleration to algorithm.System by PS end according to Specific algorithm scheduling selects different IP Core to carry out parallel computation, it is also possible to by host's end software task and FPGA end hardware Task carries out parallel computation;

Data/address bus (Data Bus) 230, is responsible for whole system PS end and the transmission of PL end data;

Control signal bus (Control Bus) 240, is responsible for whole system PS end and the transmission of PL end control signal.

Fig. 6 is accelerator population structure 2000 based on FPGA design, and structure includes:

System controller 2100, is responsible for controlling the execution state of each hardware computation unit, data transmission and program scheduler. And it is responsible for running at the beginning of the calculating section of degree of depth study non-universal, data initialization and hardware computation unit (or referred to as IP kernel) Beginning task;

Internal memory 2200, is responsible for storage depth learning network parameter and original input data, requires what data stored here Physical address is continuous print, facilitates DMA to carry out data transmission;

Data bus protocol 2300, AXI-Stream agreement allows unconfined data burst transmission, for high-performance data Host-host protocol;

Controlling bus protocol 2400, AXI-Lite is that the address of a kind of lightweight maps single transmission agreement, it is adaptable to hard The control signal transmission of part arithmetic element;

Data interconnection 2500, data path interconnects;

Controlling interconnection 2600, control signal lines interconnects;

Direct memory access DMA2700, the data transmission being responsible between accelerator and internal memory, each hardware processing element is all joined A standby DMA carrys out parallel read data;

PE (Proccesing Element) 2800 is as the computing unit of each accelerator, curable 1 forward direction in inside Calculate arithmetic element or 1 right value update arithmetic element or both of which comprises.Owing to FPGA is programmable and can weigh Structure, the quantity of PE dynamically can configure according to the resource bandwidth situation of concrete fpga chip here, is not so changing computing list The calculating resource of hardware can be made full use of, it is ensured that hardware plays peak performance under unit's hardware designs.

Above in conjunction with Fig. 1 to Fig. 6, describe the method that the embodiment of the present invention accelerates degree of depth learning algorithm in detail, below The hardware configuration of the embodiment of the present invention will be introduced.

Fig. 7 is for using burst Computation schema design forward calculation arithmetic element, it is assumed that the size of burst is 16, by node square Carrying out burst by 16 inside the every a line of battle array, weighting parameter matrix carries out burst according to 16 elements of every string.By being about to node square 16 numerical value that every 16 data string every with weighting parameter matrix of battle array is corresponding carry out dot-product operation, treat that every a line calculates complete After i.e. can get final result by cumulative for these nonces again.This kind of method not only takes full advantage of data locality, and subtracts Lack the resource situation needed for solidification parallel execution unit, and reduced hardware desired data bandwidth, allowed the single arithmetic element can To realize the matrix multiplication calculating of random scale.

In order to keep high-throughput, the size of burst should match with arithmetic element indoor design, keeps with parallel granularity Unanimously, when matrix multiplication operation, burst can be set as the n power of 2, give full play to the cumulative performance of binary tree.By Relevant with parallel granularity in burst size, in theory for burst the biggest, degree of parallelism is the highest, and the performance of arithmetic element also can be got over Good, so in the case of hardware resource and bandwidth allow, selecting maximum 2nBurst size as arithmetic element.

Fig. 8 is, in present example, excitation function is carried out hard-wired schematic diagram.Present example uses segmented line Property approximation realize S type excitation function, function is divided into some equivalent intervals by X-axis, in each interval, presses Y=ai*X+bi,X ∈[xi,xi+1Linear approximation, wherein x is carried out shown in)i+1-xiGap size for approximation.When needs calculate excitation function, It is first according to X value find the interval at its place and calculate a of its correspondenceiAnd biRelative to the side-play amount of base address, carry out multiply-add fortune After calculation, can approximate and obtain Y value.This implementation has two benefits: 1), can realize arbitrary S type excitation function or linear Function, and without changing any hardware designs, it is only necessary to change the numerical value that coefficient a and coefficient b is stored;2), error Minimum, when approximate interval reduces, error can reach to ignore, and cost is only to increase for packing coefficient a and be The BRAM of number b.And degree of depth study calculating itself is not the highest the most a certain degree of to the requirement of the degree of accuracy of data Loss of significance has no effect on data result.

Fig. 9 is that on the field programmable gate array platform of the embodiment of the present invention, single DMA prestores the hardware configuration of weight matrix Schematic block diagram 3000, this hardware configuration for FPGA inside BRAM resource more sufficient time, in advance caching weight matrix number Forward calculation is carried out according to BRAM on sheet.Structure includes:

Data read module 3100, is furnished with DMA and FIFO buffer area, and data bit width is 32, is responsible for reading weighting parameter It is buffered on sheet on BRAM and reads neuron node data.

BRAM3200 on sheet, caches weighting parameter data.As a example by burst size is 16, by weight matrix by row with 16 it is Circulation is stored on different BRAM, i.e. i%16, thus ensures carrying out 16 also as addressing system plus the base address of BRAM From different BRAM parallel read data during row multiplication.

Pair register caching 3300, the most each depositor comprises 16 depositors for storing input neuron data, By for carrying out data cached and carrying out parallel computation.But it is noted herein that: buffer area is filled up the required time Time needed for calculating less than these data, the time that such guarantee buffer data reads is calculated required time institute Cover, and guarantee the correctness of result.

Weighting parameter data and neuron number evidence are carried out parallel multiplication calculating, Floating-point Computation by parallel floating point multiplication 3400 Use DSP to realize, after streamline optimizes, can 16 floating-point multiplications of each clock cycle parallel processing operate, burst size here As a example by 16.Owing to input neuron number might not be divided exactly by 16, so when every data burst carries out dot product calculating, The possible number inadequate 16 of last burst, then arithmetic element will carry out parallel multiplication calculating with the part of 0 lack of fill 16.

Y-bend floating add tree 3500, carries out cumulative behaviour by the floating point result obtained in parallel floating point multiplication 3400 structure Make, use y-bend add tree to carry out parallel computation, eliminate the read-write dependency of accumulation operations, by cumulative required time complexity From the near O of O (n) (logn).

Accumulation calculating 3600, calculates owing to forward calculation processing unit uses burst to process, needs y-bend floating add The result drawn after tree 3500 calculating adds up, but cumulative mode is to be circulated cumulative behaviour every output neuron number Make.

Excitation function calculates 3700, uses piece wire approximation to realize excitation function, and design factor is buffered in BRAM on sheet.

Data write back module 3800, are furnished with DMA and FIFO buffer area, and data bit width is 32, are responsible for data result of calculation It is written back to host's end memory.

This hardware configuration supports parameter configuration, can support the neural computing of different scales.Detailed configuration parameter is:

Data_size: the scale of input neuron data;

The number of Input_size: input neuron, owing to caching weight matrix data in advance, therefore should be less than sheet here Upper BRAM can allow to cache maximum input neuron number Max_input that weighting parameter is corresponding;

The number of Output_size: output neuron, owing to caching weight matrix data in advance, therefore should be less than here On sheet, BRAM can allow to cache maximum output neuron number Max_output that weighting parameter is corresponding;

Work_mode:0 represents and only carries out matrix multiplication calculating;1 expression carries out matrix multiplication and excitation function calculates.

Figure 10 is the hardware configuration signal carrying out accumulation calculating on the field programmable gate array platform of the embodiment of the present invention Figure 36 00.Structure includes:

Floating add calculates 3610, owing to using burst thought, needs intermediate value calculated to dot product to add up. Intermediate value data stream is that number N (or the latter's matrix column number) every output neuron adds up, the most suitable after adding up Sequence exports.

Nonce storage BRAM3620, arranges N number of memory element for storing ephemeral data inside FPGA, and circulation is by number It is added in the BRAM memory element of correspondence according to flow data, judges whether according to the relation of input neuron number and burst size Cumulative end.Owing to the quantity for storing nonce cannot be set during FPGA indoor design dynamically, so in design luck Calculate unit and set the maximum cumulative number MAX of support.When the number of output neuron just can be normally carried out adding up grasping less than MAX value Make.

Equally this process is also carried out streamline optimization, and startup interval is optimized to 1 clock cycle, ensure centre Value produces and keeps consistent with the speed processed.

Figure 11 shows that carrying out piece wire approximation on the field programmable gate array platform of the embodiment of the present invention realizes swashing Encourage the hardware architecture diagram 3700 of function.

Excitation function uses sublevel linear approximation to realize, it is achieved details as shown in figure 11, unlike Fig. 8, adds Article one, X is transmitted directly to the path of Y, allows forward calculation arithmetic element can only perform matrix multiplication operation and without excitation The process of function, here mainly for carrying out the matrix multiplication used when error amount calculates during realizing training.Due to S type Excitation function is substantially about certain point symmetry, and as a example by sigmoid function, sigmoid function is symmetrical about (0,0.5), institute When x is less than 0, to calculate according to 1-f (-x), so the use to hardware resource can be reduced with multiplexing hardware logic.And And when x is equal to 8, f (x) is equal to 0.999665, the most just it is infinitely close to 1, therefore when x is more than 8, directly result is entered as 1。

Figure 12 is that on the field programmable gate array platform of the embodiment of the present invention, single DMA prestores the forward calculation of weighting parameter The calculation flow chart of hardware computation unit.

First it is successively read configuration data from DMA, reads node data according to configuration information.First will when reading node data Parasites Fauna a be full of after, flag is set to 0, afterwards according to the numerical value of flag%2 replace input node data value register group a or Parasites Fauna b.Equally, the weights number of the data and BRAM caching that read Parasites Fauna according to the numerical value of flag%2 is carried out parallel Multiplication calculates, and adds up after being then passed through the summation of y-bend add tree.After Lei Jia, select through overdriving according to mode of operation Function processes or directly output.

Figure 13 is the forward calculation hardware that on the field programmable gate array platform of the embodiment of the present invention, double DMA read parallel The structural representation 4000 of arithmetic element.This hardware configuration carries out forward calculation module design for the fpga chip of high bandwidth, Double DMA is used to read guarantee high-throughput parallel.Here burst size is as a example by 16, and structure includes:

Neuron data read module 4100, is furnished with DMA and FIFO buffer area, and data bit width is 512, is responsible for reading defeated Enter neuron node data, obtain 16 32 single-precision floating-point datas by shifting function.Owing to the transmission bit wide of data is 512, it requires that data want address align in host's end memory.Further for input neuron number aliquant 16 Situation, needs, at host's end, neuron node data matrix is filled 0 operation, and the end of every a line is filled 16-Input_ Size%16 0, wherein Input_size is the number of input neuron, without filling when Input_size%16 is equal to 0.This In to each data-reusing Output_size time, wherein Output_size is output neuron number.

Weighting parameter data read module 4200, is furnished with DMA and FIFO buffer area, and data bit width is 512, is responsible for reading Weighting parameter data, obtain 16 32 single-precision floating-point datas by shifting function.Also due to the transmission bit wide of data is 512, it requires that data want address align in host's end memory.Further for input neuron number aliquant 16 Situation, needs, at host's end, weighting parameter data matrix is filled 0 operation, fills 16-Input_ at the end of every string Size%16 0, without filling when same Input_size%16 is equal to 0.After filling, owing to DMA transfer needs continuously Physical address, need to be adjusted facilitating DMA transfer by the data storage location of weighting parameter matrix.

Weighting parameter data and neuron number evidence are carried out parallel multiplication calculating, Floating-point Computation by parallel floating point multiplication 4300 DSP is used to realize, after streamline optimizes, can 16 floating-point multiplication operations of each clock cycle parallel processing.

Y-bend floating add tree 4400, carries out cumulative behaviour by the floating point result obtained in parallel floating point multiplication 4300 structure Make, use y-bend add tree to carry out parallel computation, eliminate the read-write dependency of accumulation operations, by cumulative required time complexity From the near O of O (n) (logn).

Accumulation calculating 4500, calculates owing to forward calculation processing unit uses burst to process, needs y-bend floating add The result drawn after tree 4400 calculating adds up, but cumulative mode is to be circulated cumulative behaviour every output neuron number Make.This structure is identical with structure 3600, therefore is not described in further detail.

Excitation function calculates 4600, uses piece wire approximation to realize excitation function, and design factor is buffered in BRAM on sheet. This structure is identical with structure 3700, therefore is not described in further detail.

Data write back module 4700, are furnished with DMA and FIFO buffer area, and data bit width is 32, are responsible for data result of calculation It is written back to host's end memory.

This hardware configuration supports parameter configuration, can support the neural computing of different scales.Detailed configuration parameter is:

Data_size: the scale of input neuron data;

The number of Input_size: input neuron;

The number of Output_size: output neuron;

Work_mode:0 represents and only carries out matrix multiplication calculating;1 expression carries out matrix multiplication and excitation function calculates.

Figure 14 is the forward calculation hardware that on the field programmable gate array platform of the embodiment of the present invention, double DMA read parallel The calculation flow chart of arithmetic element.

First read configuration information from node DMA, configuration arithmetic element read the scale of node data and weight data with And mode of operation.Then, reading in 512 bit data from node DMA and weights DMA respectively, parallel shift obtains 16 neuron joints Point data and 16 weighting parameter data, due to accelerator multiplexer node data, therefore every Output_size clock cycle reads Node data, weighting parameter data of every 1 clock cycle reading.After digital independent, carry out 16 successively also The y-bend add tree summation of the operation of row multiplication and 16 inputs.Summed result is circulated successively the BRAM being added to specify and stores position On, and judge whether cumulative end.After cumulative end, select directly to export or carry out piecewise approximation excitation letter according to mode of operation Number processes.

Figure 15 is the hardware of right value update hardware computation unit on the field programmable gate array platform of the embodiment of the present invention Structural representation 5000.Use double DMA to read parallel, calculate vector operation with ensureing high-throughput.Structure includes:

Vector A data read module 5100, is furnished with DMA and fifo buffer, and bit wide is 32.It also is responsible for configuration ginseng simultaneously The reading of number.

Vector B data read module 5200, is furnished with DMA and fifo buffer, and bit wide is 32.

Computing module 5300, carries out the vector calculating of correspondence by different configuration informations.A*A+ is carried out when mode of operation is 0 B*B calculates;Carry out (a*A+b*B) * B* (1-B) when mode of operation is 1 to calculate.Wherein a, b are configuration parameter, and A, B are to read respectively The vector value entered.

Result writes back module 5400, is furnished with DMA and fifo buffer, and bit wide is 32, and result of calculation is written back to host End memory.

This hardware configuration supports parameter configuration, can support the vector calculating of different scales.Detailed configuration parameter is:

Data_size: the scale of input vector data;

A: the coefficient value needed for calculating;

B: the coefficient value needed for calculating;

Work_mode:0 represents and carries out a*A+b*B calculating;1 represents that carrying out (a*A+b*B) * B* (1-B) calculates.

Figure 16 is the calculating of right value update hardware computation unit on the field programmable gate array platform of the embodiment of the present invention Flow chart.

First from DMA A read configuration information, then according to configuration information Data_size respectively from DMA A and B read to The value of amount, parallel and configuration parameter a and b sues for peace after carrying out multiplication calculating, chooses whether to be multiplied by B* finally according to mode of operation (1-B), result is written back to host's end memory by DMA A.

Figure 17 is possible that on the heterogeneous multi-core reconfigurable calculating platform of the embodiment of the present invention, the degree of depth learns accelerator Application scenarios and block schematic illustration.

Here the composition of application system is as illustrating, and the invention is not limited in this.System is sent by user should During with request, application system is controlled node and request is assigned to by scheduler the calculating node of correspondence.Calculate node in basis Acceleration task offload to FPGA is accelerated by concrete application request.

The general frame figure of each calculating node is made up of hardware layer, driving layer, storehouse layer, service layer and application layer.Hardware Layer is made up of FPGA, internal memory and host end CPU, and CPU, as the controller of system, controls the internal each hardware processing element of FPGA The running status of (referred to as DL Module in figure) and digital independent, including forward calculation arithmetic element and right value update unit. Weighting parameter data required for system-computed and neuron number according to being merely stored in internal memory, by DMA by data at internal memory and Transmit before hardware processing element;Driving layer is then the hardware driving write according to hardware platform and operating system;Storehouse layer is then Application programming interface API of encapsulation on the basis of driving;Service layer is the degree of depth study correlation computations that user oriented request provides Accelerate service;Application layer then refers to degree of depth study prediction algorithm and the concrete application of training algorithm, such as uses convolutional Neural net Network prediction algorithm carries out picture classification etc..

Those of ordinary skill in the art are it is to be appreciated that combine method and the hardware that the embodiments described herein describes Structure, it is possible to being implemented in combination in FPGA and CPU.The value volume and range of product of concrete FPGA inside solidification IP kernel see concrete application and Fpga chip resource limit.Professional and technical personnel can use not Tongfang to each specific application or specific fpga chip Formula or different degree of parallelism realize above-mentioned described function, but this realization is it is not considered that beyond the scope of this invention.

In several embodiments provided herein, it should be understood that disclosed method and hardware configuration, Ke Yitong The mode crossing other realizes.Such as, the application of the degree of depth described above study is deep neural network and convolutional neural networks is Schematically.Such as, burst size and parallel granularity in forward calculation arithmetic element are schematic, can be according to specifically Situation is adjusted.The such as data transfer mode between field programmable gate array and general processor uses AXI bus association View is also schematic.

Examples detailed above, only for technology design and the feature of the explanation present invention, its object is to allow the person skilled in the art be Will appreciate that present disclosure and implement according to this, can not limit the scope of the invention with this.All according to present invention essence God's equivalent transformation of being done of essence or modification, all should contain within protection scope of the present invention.

Claims (10)

1. the method accelerating degree of depth learning algorithm on a field programmable gate array platform, it is characterised in that field-programmable Gate array platform includes general processor, field programmable gate array and memory module, comprises the following steps:
S01: according to degree of depth study prediction process and training process, and combine deep neural network and convolutional neural networks, determine It is applicable on field programmable gate array platform the general-purpose computations part run;
S02: according to the general-purpose computations part confirmed, determine software-hardware synergism calculation;
S03: according to calculating logical resource, the bandwidth situation of FPGA, determine the value volume and range of product that IP kernel solidifies, utilize hardware computation Unit, is accelerated on programmable gate array platform at the scene.
The method accelerating degree of depth learning algorithm on field programmable gate array platform the most according to claim 1, its feature Being, described general-purpose computations part includes forward calculation module, calculates for matrix multiplication and excitation function calculates;Right value update Module, calculates for vector.
The method accelerating degree of depth learning algorithm on field programmable gate array platform the most according to claim 1, its feature Being, described step S02 comprises the following steps:
Data Preparation is carried out at software end;
Convolutional layer convolutional calculation in convolutional neural networks is converted into matrix multiplication;
Direct internal memory is used to read the data path calculated as software-hardware synergism.
The method accelerating degree of depth learning algorithm on field programmable gate array platform the most according to claim 1, its feature It is, described step S03 determines the value volume and range of product that IP kernel solidifies, including: according to pending hardware task, determine FPGA The kind of the arithmetic element of upper solidification;According to FPGA hardware logical resource and bandwidth situation, determine the place of pending hardware task The quantity of reason unit.
The method accelerating degree of depth learning algorithm on field programmable gate array platform the most according to claim 2, its feature Being, described forward calculation module uses burst design, carries out burst, weights by burst size inside every for node matrix equation a line The every string of parameter matrix carries out burst according to burst size, by being about to every burst size data of node matrix equation and weighting parameter Burst size the numerical value that the every string of matrix is corresponding carries out dot-product operation, and every a line obtains cumulative for nonce after calculating Termination fruit.
The method accelerating degree of depth learning algorithm on field programmable gate array platform the most according to claim 5, its feature Being, described burst size is the n power of 2, keeps consistent with the parallel granularity of arithmetic element.
7. the FPGA structure being used for accelerating degree of depth learning algorithm, it is characterised in that including:
Burst processes structure, and node data matrix and the weighting parameter matrix of forward calculation module are carried out burst, time-sharing multiplex Hardware logic;
Excitation function linear approximation realizes structure, is used for generating arbitrary excitation function;
Parameter configuration module, for configuring the parameter of processing unit;
Forward calculation module, caches, including single DMA, the forward calculation that the forward calculation hardware configuration of weights reads parallel with double DMA Hardware configuration;For the forward calculation of deep neural network, convolutional neural networks convolutional layer and the forward calculation of classification layer and Matrix multiplication operation, and carry out streamline and be optimized to maximum throughput rate;
Right value update module, calculates for vector.
FPGA structure for accelerating degree of depth learning algorithm the most according to claim 7, it is characterised in that described parameter is joined Put module by DMA transfer configuration parameter data, processing unit to be configured, including: the mode of operation of forward calculation module is joined Putting and configure with data scale, data scale configuration includes the configuration of node data scale, the configuration of input neuron scale and output god Through unit's scale configuration;The configuration of right value update module data scale, mode of operation configuration and calculating parameter configuration.
FPGA structure for accelerating degree of depth learning algorithm the most according to claim 7, it is characterised in that described single DMA The forward calculation hardware configuration of caching weights includes:
Single DMA, is responsible for digital independent, writes back;
Pair register relief area, alternately reads data or carries out parallel computation;BRAM group, caches and ensures that data parallel reads;
With the equal-sized floating-point multiplier of burst;
Y-bend add tree with burst input equal in magnitude;
Cycle accumulor device, cumulative nonce preserves to BRAM on sheet;
Excitation function computing module, uses piece wire approximation to realize excitation function, and design factor is buffered in BRAM on sheet;
The forward calculation hardware configuration that the described couple of DMA reads parallel includes:
Neuron data read module, is furnished with DMA and FIFO buffer area, is responsible for reading input neuron node data;
Weighting parameter data read module, is furnished with DMA and FIFO buffer area, is responsible for reading weighting parameter data;
With the equal-sized floating-point multiplier of burst;
Y-bend add tree with burst input equal in magnitude;
Cycle accumulor device, cumulative nonce preserves to BRAM on sheet;
Excitation function computing module, uses piece wire approximation to realize excitation function, and design factor is buffered in BRAM on sheet.
FPGA structure for accelerating degree of depth learning algorithm the most according to claim 7, it is characterised in that described weights More new module, calculates and the calculating of output layer error amount for right value update, and carries out streamline and be optimized to maximum throughput rate, Including: vector A data read module and vector B data read module, it is respectively provided with DMA and fifo buffer, reads use respectively In the two groups of vector value calculated;Computing module, carries out the vector calculating of correspondence by configuration information;Result writes back module, is furnished with DMA and fifo buffer, be written back to host's end memory by result of calculation.
CN201610596159.3A 2016-07-27 2016-07-27 Accelerate the method and system of deep learning algorithm on field programmable gate array platform CN106228238B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610596159.3A CN106228238B (en) 2016-07-27 2016-07-27 Accelerate the method and system of deep learning algorithm on field programmable gate array platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610596159.3A CN106228238B (en) 2016-07-27 2016-07-27 Accelerate the method and system of deep learning algorithm on field programmable gate array platform

Publications (2)

Publication Number Publication Date
CN106228238A true CN106228238A (en) 2016-12-14
CN106228238B CN106228238B (en) 2019-03-22

Family

ID=57534278

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610596159.3A CN106228238B (en) 2016-07-27 2016-07-27 Accelerate the method and system of deep learning algorithm on field programmable gate array platform

Country Status (1)

Country Link
CN (1) CN106228238B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145944A (en) * 2017-03-29 2017-09-08 浙江大学 Genetic algorithm and system based on FPGA efficient trainings
CN107392309A (en) * 2017-09-11 2017-11-24 东南大学—无锡集成电路技术研究所 A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA
CN107392308A (en) * 2017-06-20 2017-11-24 中国科学院计算技术研究所 A kind of convolutional neural networks accelerated method and system based on programming device
CN107423030A (en) * 2017-07-28 2017-12-01 郑州云海信息技术有限公司 Markov Monte carlo algorithm accelerated method based on FPGA heterogeneous platforms
CN107480782A (en) * 2017-08-14 2017-12-15 电子科技大学 Learn neural network processor on a kind of piece
CN107506173A (en) * 2017-08-30 2017-12-22 郑州云海信息技术有限公司 A kind of accelerated method, the apparatus and system of singular value decomposition computing
CN107657581A (en) * 2017-09-28 2018-02-02 中国人民解放军国防科技大学 A kind of convolutional neural networks CNN hardware accelerators and accelerated method
CN108090496A (en) * 2017-12-22 2018-05-29 银河水滴科技(北京)有限公司 The method and apparatus of image procossing based on convolutional neural networks
CN108231086A (en) * 2017-12-24 2018-06-29 航天恒星科技有限公司 A kind of deep learning voice enhancer and method based on FPGA
CN108268945A (en) * 2016-12-31 2018-07-10 上海兆芯集成电路有限公司 The neural network unit of circulator with array-width sectional
CN108280514A (en) * 2018-01-05 2018-07-13 中国科学技术大学 Sparse neural network acceleration system based on FPGA and design method
CN108520297A (en) * 2018-04-02 2018-09-11 周军 Programmable deep neural network processor
CN108629405A (en) * 2017-03-22 2018-10-09 杭州海康威视数字技术股份有限公司 The method and apparatus for improving convolutional neural networks computational efficiency
CN108734288A (en) * 2017-04-21 2018-11-02 上海寒武纪信息科技有限公司 A kind of operation method and device
CN108920413A (en) * 2018-06-28 2018-11-30 中国人民解放军国防科技大学 Convolutional neural networks core parallel calculation method towards GPDSP
CN109359732A (en) * 2018-09-30 2019-02-19 阿里巴巴集团控股有限公司 A kind of chip and the data processing method based on it
CN109359736A (en) * 2017-04-06 2019-02-19 上海寒武纪信息科技有限公司 Network processing unit and network operations method
CN109726809A (en) * 2017-10-30 2019-05-07 北京深鉴智能科技有限公司 The hardware circuit implementation and its control method of deep learning softmax classifier
CN109993287A (en) * 2017-12-29 2019-07-09 北京中科寒武纪科技有限公司 Processing with Neural Network method, computer system and storage medium
WO2019136755A1 (en) * 2018-01-15 2019-07-18 深圳鲲云信息科技有限公司 Method and system for optimizing design model of artificial intelligence processing device, storage medium, and terminal
WO2019165989A1 (en) * 2018-03-01 2019-09-06 华为技术有限公司 Data processing circuit for use in neural network
TWI696961B (en) * 2018-12-12 2020-06-21 財團法人工業技術研究院 Deep neural networks (dnn) hardware accelerator and operation method thereof
CN108520297B (en) * 2018-04-02 2020-09-04 周军 Programmable deep neural network processor

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050237232A1 (en) * 2004-04-23 2005-10-27 Yokogawa Electric Corporation Transmitter and a method for duplicating same
US20140289445A1 (en) * 2013-03-22 2014-09-25 Antony Savich Hardware accelerator system and method
CN104112053A (en) * 2014-07-29 2014-10-22 中国航天科工集团第三研究院第八三五七研究所 Design method of reconfigurable architecture platform oriented image processing
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN105162475A (en) * 2015-08-19 2015-12-16 中国人民解放军海军工程大学 FPGA (Field Programmable Gate Array) based parameterized multi-standard decoder with high throughput rate
CN105447285A (en) * 2016-01-20 2016-03-30 杭州菲数科技有限公司 Method for improving OpenCL hardware execution efficiency

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050237232A1 (en) * 2004-04-23 2005-10-27 Yokogawa Electric Corporation Transmitter and a method for duplicating same
US20140289445A1 (en) * 2013-03-22 2014-09-25 Antony Savich Hardware accelerator system and method
CN104112053A (en) * 2014-07-29 2014-10-22 中国航天科工集团第三研究院第八三五七研究所 Design method of reconfigurable architecture platform oriented image processing
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN105162475A (en) * 2015-08-19 2015-12-16 中国人民解放军海军工程大学 FPGA (Field Programmable Gate Array) based parameterized multi-standard decoder with high throughput rate
CN105447285A (en) * 2016-01-20 2016-03-30 杭州菲数科技有限公司 Method for improving OpenCL hardware execution efficiency

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
QI YU等: "A Deep Learning prediction process accelerator based FPGA", 《IEEE》 *
TIANSHI CHEN等: "DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning", 《ACM》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268945A (en) * 2016-12-31 2018-07-10 上海兆芯集成电路有限公司 The neural network unit of circulator with array-width sectional
CN108268945B (en) * 2016-12-31 2020-09-11 上海兆芯集成电路有限公司 Neural network unit and operation method thereof
CN108629405A (en) * 2017-03-22 2018-10-09 杭州海康威视数字技术股份有限公司 The method and apparatus for improving convolutional neural networks computational efficiency
CN107145944A (en) * 2017-03-29 2017-09-08 浙江大学 Genetic algorithm and system based on FPGA efficient trainings
CN109359736A (en) * 2017-04-06 2019-02-19 上海寒武纪信息科技有限公司 Network processing unit and network operations method
CN108734288A (en) * 2017-04-21 2018-11-02 上海寒武纪信息科技有限公司 A kind of operation method and device
CN107392308B (en) * 2017-06-20 2020-04-03 中国科学院计算技术研究所 Convolutional neural network acceleration method and system based on programmable device
CN107392308A (en) * 2017-06-20 2017-11-24 中国科学院计算技术研究所 A kind of convolutional neural networks accelerated method and system based on programming device
CN107423030A (en) * 2017-07-28 2017-12-01 郑州云海信息技术有限公司 Markov Monte carlo algorithm accelerated method based on FPGA heterogeneous platforms
CN107480782A (en) * 2017-08-14 2017-12-15 电子科技大学 Learn neural network processor on a kind of piece
CN107506173A (en) * 2017-08-30 2017-12-22 郑州云海信息技术有限公司 A kind of accelerated method, the apparatus and system of singular value decomposition computing
CN107392309A (en) * 2017-09-11 2017-11-24 东南大学—无锡集成电路技术研究所 A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA
CN107657581A (en) * 2017-09-28 2018-02-02 中国人民解放军国防科技大学 A kind of convolutional neural networks CNN hardware accelerators and accelerated method
CN109726809A (en) * 2017-10-30 2019-05-07 北京深鉴智能科技有限公司 The hardware circuit implementation and its control method of deep learning softmax classifier
CN108090496A (en) * 2017-12-22 2018-05-29 银河水滴科技(北京)有限公司 The method and apparatus of image procossing based on convolutional neural networks
CN108231086A (en) * 2017-12-24 2018-06-29 航天恒星科技有限公司 A kind of deep learning voice enhancer and method based on FPGA
CN109993287A (en) * 2017-12-29 2019-07-09 北京中科寒武纪科技有限公司 Processing with Neural Network method, computer system and storage medium
CN108280514A (en) * 2018-01-05 2018-07-13 中国科学技术大学 Sparse neural network acceleration system based on FPGA and design method
WO2019136755A1 (en) * 2018-01-15 2019-07-18 深圳鲲云信息科技有限公司 Method and system for optimizing design model of artificial intelligence processing device, storage medium, and terminal
WO2019165989A1 (en) * 2018-03-01 2019-09-06 华为技术有限公司 Data processing circuit for use in neural network
CN108520297B (en) * 2018-04-02 2020-09-04 周军 Programmable deep neural network processor
CN108520297A (en) * 2018-04-02 2018-09-11 周军 Programmable deep neural network processor
CN108920413A (en) * 2018-06-28 2018-11-30 中国人民解放军国防科技大学 Convolutional neural networks core parallel calculation method towards GPDSP
CN109359732B (en) * 2018-09-30 2020-06-09 阿里巴巴集团控股有限公司 Chip and data processing method based on chip
CN109359732A (en) * 2018-09-30 2019-02-19 阿里巴巴集团控股有限公司 A kind of chip and the data processing method based on it
TWI696961B (en) * 2018-12-12 2020-06-21 財團法人工業技術研究院 Deep neural networks (dnn) hardware accelerator and operation method thereof

Also Published As

Publication number Publication date
CN106228238B (en) 2019-03-22

Similar Documents

Publication Publication Date Title
EP3129870B1 (en) Data parallel processing method and apparatus based on multiple graphic procesing units
Alwani et al. Fused-layer CNN accelerators
US20190332945A1 (en) Apparatus and method for compression coding for artificial neural network
US9529590B2 (en) Processor for large graph algorithm computations and matrix operations
US10083395B2 (en) Batch processing in a neural network processor
WO2017185389A1 (en) Device and method for use in executing matrix multiplication operations
US20200110983A1 (en) Apparatus and methods for forward propagation in convolutional neural networks
WO2018171717A1 (en) Automated design method and system for neural network processor
KR20190022627A (en) Convolutional neural network on programmable two-dimensional image processor
US8468109B2 (en) Architecture, system and method for artificial neural network implementation
Agullo et al. A hybridization methodology for high-performance linear algebra software for GPUs
Ma et al. Optimizing the convolution operation to accelerate deep neural networks on FPGA
KR20190062481A (en) Efficient data layouts for convolutional neural networks
KR20190010642A (en) Accelerator for deep layer neural network
CN1947156B (en) Graphics processing architecture employing a unified shader
CN107239824A (en) Apparatus and method for realizing sparse convolution neutral net accelerator
US7574466B2 (en) Method for finding global extrema of a set of shorts distributed across an array of parallel processing elements
CN106529668A (en) Operation device and method of accelerating chip which accelerates depth neural network algorithm
EP3451241A1 (en) Device and method for performing training of convolutional neural network
CN107679620B (en) Artificial neural network processing device
JP6348561B2 (en) System and method for multi-core optimized recurrent neural networks
US20190179674A1 (en) Systems and methods for data management
Yu et al. A deep learning prediction process accelerator based FPGA
CN109284825B (en) Apparatus and method for performing LSTM operations
CN109726806A (en) Information processing method and terminal device

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant