CN109063825A - Convolutional neural networks accelerator - Google Patents
Convolutional neural networks accelerator Download PDFInfo
- Publication number
- CN109063825A CN109063825A CN201810865157.9A CN201810865157A CN109063825A CN 109063825 A CN109063825 A CN 109063825A CN 201810865157 A CN201810865157 A CN 201810865157A CN 109063825 A CN109063825 A CN 109063825A
- Authority
- CN
- China
- Prior art keywords
- convolutional layer
- group
- convolution
- point
- floating point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Complex Calculations (AREA)
Abstract
This disclosure relates to a kind of convolutional neural networks accelerator, by the way that input feature vector figure group and convolution kernel are respectively converted into block floating point, traditional floating-point operation is replaced with block float point arithmetic, outputting and inputting for convolutional calculation is all fixed point format, and it is costly to solve the problems, such as that FPGA carries out floating-point operation.Shifting function is all only needed in floating-point-block floating point and block floating point-floating-point conversion process, the operation that rounds up is applied in conversion process to avoid the accumulation of error, data transmission and convolutional calculation all only need the mantissa part of block floating point, data bit width can be expanded in calculating process, position truncation will not occur.Therefore, the disclosure can also guarantee the accuracy rate of convolutional neural networks model, deviation of convolutional neural networks model parameter during forward-propagating is effectively prevented, therefore during the forward direction of convolutional neural networks is inferred, without carrying out re -training to convolutional neural networks model.
Description
Technical field
This disclosure relates to nerual network technique field more particularly to a kind of convolutional neural networks accelerator.
Background technique
Convolutional neural networks are in the application field of artificial intelligence, especially image recognition, natural language processing and strategy
Deployment etc. achieves outstanding performance.The success of convolutional neural networks is mainly from the promotion for calculating equipment performance.But with
The increase of the network number of plies, the weight data of convolutional neural networks can reach several hundred megabits even more than gigabits, network
The computing resource for carrying out positive feature extraction and calculation, classification and error propagation consumption is very huge.Therefore to convolutional Neural net
Network accelerate being the key that improve convolutional neural networks model computational efficiency.
In the related technology, FPGA (Field-Programmable Gate Array, field programmable gate array)
It is a kind of logic gate array unit that can be stylized, there is outstanding computation capability.By the FPGA tool specially designed
There is the features such as low in energy consumption, speed is fast, restructural, moreover, FPGA does not use operating system, it is deterministic that a certain item can be absorbed in
Task can reduce a possibility that problem occurs.Therefore, the convolutional neural networks speeding scheme based on FPGA platform becomes one kind
Popular developing direction.
However, disposing convolutional neural networks in FPGA platform, there are two big obstacles: FPGA piece transmission bottleneck and huge outside
Floating-point operation cost.The frequent access mainly from network parameter and characteristic pattern is transmitted outside piece, the increase etc. of the network number of plies is led
The bandwidth demand of FPGA is caused to increase.Lack FPU Float Point Unit on FPGA, completion floating-point arithmetic operation, which will cause, on FPGA gulps down
Decline while spitting rate and power consumption efficiency.
Many methods, such as data-reusing, compression and beta pruning, are suggested the bandwidth demand for meeting FPGA.However, these
Method needs to carry out retraining or entropy coding to network, can the times more more than original network consumption, hinder convolutional Neural
The real-time processing of network.
Fixed-point calculation is commonly used to the calculated performance for replacing floating-point operation to improve FPGA.However, the common drawback of these methods
It is that re -training is needed to carry out undated parameter.Re -training is the process for consuming very much resource, when applied in depth network model
When, it may be desirable to more hardware resources.
Summary of the invention
In view of this, to solve the above problems, raising is handled up the present disclosure proposes a kind of convolutional neural networks accelerator
Amount and power consumption efficiency.
According to the one side of the disclosure, a kind of convolutional neural networks accelerator is provided, comprising:
Floating-point-block floating point converter turns the first input feature vector figure group and the first convolution kernel group of convolutional layer respectively
It changes and generates the second input feature vector figure group and the second convolution kernel group;
Wherein, the data in the second input feature vector figure group and the second convolution kernel group are block floating point data;
Shift unit will be rolled up according to the block index of data in the second input feature vector figure group and the second convolution kernel group
First biasing set of lamination is converted to the second biasing set;
Wherein, the data in the second biasing set are fixed-point data;
Convolutional layer accelerator, according to the second input feature vector figure group, the second convolution kernel group and second biasing
Set carries out the block floating point output result that the multiply-add operation of convolution obtains the convolutional layer;
Block floating point-floating-point converter carries out being converted to the convolutional layer to the block floating point output result of the convolutional layer
Output characteristic pattern of the floating-point output result as the convolutional layer.
In one possible implementation, the convolutional layer accelerator includes multiple processing engines;
Convolutional layer accelerator, according to the second input feature vector figure group, the second convolution kernel group and second biasing
Set carries out the block floating point output result that the multiply-add operation of convolution obtains the convolutional layer, comprising:
Each processing engine obtains the corresponding multiple convolution kernels of the processing engine from the second convolution kernel group respectively;
Each processing engine obtains corresponding second input of the processing engine from the second input feature vector figure group respectively
Characteristic pattern;
The corresponding second input feature vector figure of reason engine and convolution kernel carry out convolution behaviour to each processing engine according to this simultaneously
Make, obtains multiple convolution results;
The convolutional layer accelerator carries out accumulation operations to the multiple convolution results and activation operation obtains the convolution
The block floating point of layer exports result.
In one possible implementation, each processing engine includes multiple processing units;
Each processing engine obtains the corresponding multiple convolution kernels of the processing engine from the second convolution kernel group respectively, packet
It includes:
Each processing unit in processing engine obtains the corresponding convolution kernel of each processing unit respectively.
In one possible implementation, the convolution operation that processing engine carries out includes the processing unit handled in engine
The multiple convolution of progress operates;
When handling multiple processing units in engine every time while carrying out convolution operation, multiple processing in engine are handled
Multiple pixels that the shared processing engine of unit is obtained from the corresponding second input feature vector figure of processing engine by convolution window,
In, when processing unit carries out each convolution operation, convolution window obtains pixel from the corresponding second input feature vector figure of processing engine
Position it is different.
The multiple processing units handled in engine in one possible implementation repeatedly carry out following convolution behaviour simultaneously
Make, obtain the corresponding convolution results of each processing unit:
Processing unit carries out convolution according to the corresponding convolution kernel of the multiple pixel and processing unit of the secondary acquisition simultaneously
Operation, obtains the corresponding convolution results of processing unit.
In one possible implementation, the multiple pixel in a convolution operation includes that the convolution window is divided to two
The the first pixel group and the second pixel group of secondary acquisition, the processing unit include multiplier, the first accumulator, the second accumulator,
The first register being connect with the first accumulator, the second register being connect with the second accumulator;
Processing unit carries out convolution operation according to the multiple pixel and the corresponding convolution kernel of processing unit, and it is single to obtain processing
The corresponding convolution results of member, comprising:
The multiplier is every time from obtaining the first pixel and obtain the second pixel from the second pixel group in the first pixel group
Be composed third pixel group, to the weight corresponding with the first pixel, the second pixel in third pixel group and convolution kernel into
Product is calculated in row multiplication;
Wherein, the first pixel, the digit of the second pixel are M, and M is positive integer, successively by the first pixel, M vacancy and the
Two pixels form third pixel group;
First accumulator adds up preceding 2M data of the product, obtains the first pixel group corresponding first
Accumulation result;
Wherein, first register is for storing the first accumulation result that the first accumulator obtains every time;
Second accumulator adds up rear 2M data of the product, obtains the second pixel group corresponding second
Accumulation result;
Wherein, second register is for storing the second accumulation result that the second accumulator obtains every time;
First accumulation result and second accumulation result form the corresponding convolution results of the processing unit.
In one possible implementation, the convolutional layer accelerator further includes multiple third accumulators and each
The corresponding active module of three accumulators, each third accumulator connect a processing unit in multiple processing engines;
The convolutional layer accelerator carries out accumulation operations to the multiple convolution results and activation operation obtains the convolution
The block floating point of layer exports result, comprising:
For each third accumulator, which uses different processing units the convolution kernel of same output channel
Obtained convolution results add up, and obtain third accumulation result, and the third accumulation result is output to the third and is added up
The corresponding active module of device;
For each active module, the third accumulation result which obtains for corresponding third accumulator is carried out
Activation operation obtains the block floating point output result of the convolutional layer.
In one possible implementation, convolutional neural networks accelerator further includes memory module, the storage mould
Block includes first memory, and the first memory includes the first piecemeal, the second piecemeal, third piecemeal and the 4th piecemeal;
First piecemeal is for storing the corresponding first input feature vector figure group of first layer convolutional layer;
Second piecemeal is for storing the corresponding first convolution kernel group of odd-level convolutional layer and the first biasing set;
The third piecemeal be used for store not be the last layer the corresponding output characteristic pattern of even level convolutional layer;
4th piecemeal is used to store the output vector of full articulamentum.
In one possible implementation, the memory module includes second memory, and the second memory includes
5th piecemeal and the 6th piecemeal;
5th piecemeal is for storing the corresponding first convolution kernel group of even level convolutional layer and the first biasing set;
6th piecemeal be used for store not be the last layer the corresponding output characteristic pattern of odd-level convolutional layer.
In one possible implementation, the convolutional neural networks accelerator further include:
Convolutional layer input-buffer, connection floating-point-block floating point converter, block floating point-floating-point converter and convolutional layer accelerate
Device, for storing the second input feature vector figure group of convolutional layer, the second convolution kernel group and the second biasing set, and by the of convolutional layer
Two input feature vector figure groups, the second convolution kernel group and the second biasing set are sent to convolutional layer accelerator;
Convolutional layer output caching, connection block floating point-floating-point converter, convolutional layer accelerator and the input of full articulamentum are slow
Deposit, for store block floating point output as a result, and by be not the last layer convolutional layer block floating point output result be sent to it is floating
The block floating point output result of the last layer convolutional layer is sent to full articulamentum input-buffer by point-block floating point converter;
Full articulamentum input-buffer, connects full articulamentum accelerator, for receiving and storing the block of the last layer convolutional layer
Floating-point exports as a result, and being sent to full articulamentum accelerator;
Full articulamentum accelerator connects full articulamentum output caching, defeated for the block floating point according to the last layer convolutional layer
Out as a result, carrying out full attended operation, block floating point final result is obtained, and block floating point final result is sent to full articulamentum and is exported
Caching;
Full articulamentum output caching, connects block floating point-floating-point converter, floats for block floating point final result to be sent to block
Point-floating-point converter, so that the block floating point final result is converted to floating-point final result by block floating point-floating-point converter.
The utility model has the advantages that
Input feature vector figure group and convolution kernel are respectively converted into block floating point by floating-point-block floating point conversion by the disclosure,
Traditional floating-point operation is replaced with block float point arithmetic, outputting and inputting for convolutional calculation is all fixed point format, is cleverly avoided
The defect for lacking FPU Float Point Unit on FPGA solves the problems, such as that FPGA progress floating-point operation is costly, significantly reduces
The power consumption of convolutional neural networks accelerator is disposed in FPGA platform, improves handling capacity.
Shifting function is all only needed in floating-point-block floating point and block floating point-floating-point conversion process, the operation that rounds up is applied to
The accumulation of error is avoided in conversion process, data transmission and convolutional calculation all only need the mantissa part of block floating point, calculate
Data bit width can be expanded in the process, position truncation will not occur.Therefore, the disclosure can also guarantee convolutional neural networks mould
The accuracy rate of type effectively prevents deviation of convolutional neural networks model parameter during forward-propagating, therefore in convolution mind
During forward direction through network is inferred, without carrying out re -training, different convolutional neural networks to convolutional neural networks model
Model can be on the accelerator by adjusting parameter configuration to the disclosure.
According to below with reference to the accompanying drawings to detailed description of illustrative embodiments, the other feature and aspect of the disclosure will become
It is clear.
Detailed description of the invention
Comprising in the description and constituting the attached drawing of part of specification and specification together illustrates the disclosure
Exemplary embodiment, feature and aspect, and for explaining the principles of this disclosure.
Fig. 1 is a kind of block diagram of convolutional neural networks accelerator shown according to an exemplary embodiment.
Fig. 2 shows the schematic diagrames according to the exemplary convolutional layer accelerator of the disclosure one.
Fig. 3 shows the schematic diagram according to the exemplary processing unit of the disclosure one.
Fig. 4 shows the schematic diagram according to the exemplary data format of the disclosure one.
Fig. 5 shows the stream of the accelerated method based on block float point arithmetic of the convolutional neural networks according to one embodiment of the disclosure
Cheng Tu.
Fig. 6 shows the data flow of single output channel of the convolutional neural networks accelerator according to one embodiment of the disclosure
Figure.
Specific embodiment
Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing
Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove
It non-specifically points out, it is not necessary to attached drawing drawn to scale.
Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary "
Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.
In addition, giving numerous details in specific embodiment below to better illustrate the disclosure.
It will be appreciated by those skilled in the art that without certain details, the disclosure equally be can be implemented.In some instances, for
Device, means, element and circuit well known to those skilled in the art are not described in detail, in order to highlight the purport of the disclosure.
Fig. 1 is a kind of block diagram of convolutional neural networks accelerator shown according to an exemplary embodiment.The disclosure
Convolutional neural networks accelerator can be applied to all kinds of FPGA or ASIC (Application Specific Integrated
Circuit, specific integrated circuit), it is not limited here, as shown in Figure 1, the convolutional neural networks accelerator may include:
Floating-point-block floating point converter, shift unit and convolutional layer accelerator, block floating point-floating-point converter.
Floating-point-block floating point converter turns the first input feature vector figure group and the first convolution kernel group of convolutional layer respectively
It changes and generates the second input feature vector figure group and the second convolution kernel group;Wherein, the second input feature vector figure group and second convolution
Data in core group are block floating point data.
Shift unit will be rolled up according to the block index of data in the second input feature vector figure group and the second convolution kernel group
First biasing set of lamination is converted to the second biasing set;Wherein, the data in the second biasing set are fixed-point data.
Wherein, include in the first input feature vector figure group and the first convolution kernel group can be floating number.
Floating number can be in a computer to some any real number of approximate representation, and it is 10 that this representation method, which is similar to radix,
Scientific notation.For example, this real number can be by an integer or fixed-point number (i.e. mantissa) multiplied by some radix (in computer
Usually 2) integral number power obtains, and (Normalized) floating number expression way of specification has following form:
±m×βe
Wherein, m is mantissa, if the precision of m is p, then m (i.e. mantissa) can be the position p shaped like ± d.ddd...ddd
Number, 0≤d < β;β is radix, and e is index.
Fixed-point data can be expressed as the representation method of another number used in computer, participate in the decimal of the number of operation
Point position immobilizes.Such as Q format indicates are as follows: Qm.n indicates that data indicate that integer part, n-bit indicate decimal with m bit
Part, needs m+n+1 altogether to indicate this data, and extra one is used as and meets position.
Block floating point algorithm can be expressed as imitating floating-point operation with the method for software on fixed-point dsp, with
Obtain higher computational accuracy and biggish dynamic range.
Block floating point data can share identical index for a data block, for example, it is assumed that X includes N number of floating number
Block, X can be indicated are as follows:
Wherein xiIt is i-th of element in X, miAnd eiIt is xiMantissa and exponential part;Maximum exponent bits are determined in block
Justice is block exponent bits ∈X
After derivation block exponent bits, xiMantissa bit be right-shifted by diPosition, wherein di=∈X-ei;Therefore, the block floating point lattice of X
Formula X ' be represented as
Wherein, m 'i=mi>>diIt is the mantissa bit after conversion.
It may include the corresponding first input feature vector figure of one or more input channels in first input feature vector figure group,
In, the first input feature vector figure can be expressed as carrying out input feature vector figure obtained from feature extraction to pending data.For example, if
Pending data is color image, can carry out red channel feature extraction, green channel feature extraction respectively to color image
With blue channel feature extraction, red channel, green channel and the corresponding input feature vector figure of blue channel are respectively obtained.
Generally, in image procossing, for given input picture, each pixel value is defeated in the output image
Enter the weighted average of each pixel value in a zonule in image, wherein weight is defined by a function, this function can be with
Referred to as convolution kernel.First convolution kernel group may include one or more first convolution kernels.
Floating-point-block floating point converter can obtain the first input feature vector figure group, the first convolution kernel respectively by memory interface
Group, and the data in the first input feature vector figure group and the first convolution kernel group are converted into block floating point data, obtain the second input spy
Levy figure group and the second convolution kernel group.
Convolutional layer accelerator may include multiple processing engines, for example, Fig. 2 shows according to the exemplary convolutional layer of the disclosure one
The schematic diagram of accelerator, in the example of the disclosure by taking 16 channel per treatment as an example, convolutional layer accelerator may include at 16
Manage engine PE1, PE2 ... PE16.
Each processing engine obtains the corresponding multiple convolution kernels of the processing engine from the second convolution kernel group respectively.Often
A processing engine may include multiple processing units, the corresponding output channel of a processing unit, each of processing engine
Processing unit can obtain the corresponding convolution kernel of each processing unit respectively.
The pixel of the input feature vector figure of all input channels forms a block, shares same piece of exponent bits.Each output
All weights (corresponding to the convolution kernel of all input channels) in channel form a block, share same piece of exponent bits.It is this
Method of partition is aligned all input datas before convolutional calculation, and data transmission and convolutional calculation all only need block floating point
Mantissa part.
As shown in Fig. 2, a processing engine may include 64 processing units, 64 convolution kernels in the second convolution kernel group
It is corresponding with 64 processing units in each processing engine, that is to say, that for 64 processing units in a processing engine,
Each processing unit carries out the multiply-add operation of convolution using convolution kernel corresponding with itself.The exponent bits of processing unit are equal to corresponding
The exponent bits of the exponent bits and weight (convolution kernel) of input feature vector figure and, for example, by taking PU1_1 as an example, exponent bits are equal to PE1
Input feature vector figure exponent bits convolution kernel corresponding with PU1_1 exponent bits and.
First biasing set may include multiple first biasings, and corresponding one first an of output channel of convolutional layer is partially
It sets.
Shift unit can be according to the difference of the first of each output channel of convolutional layer the biasing and the exponent bits of processing unit, will
First bias shift obtains the second biasing of corresponding fixed point format, b1, b2 ... b64 as shown in Figure 2, to obtain second partially
Set set.
Floating-point-block floating point converter and shift unit can also parameter (convolution kernel and biasing) to full articulamentum more than
Process is converted.
Convolutional neural networks accelerator can also include memory module, and memory module may include: first memory
DDR3M1, second memory DDR3M0, first memory DDR3M1 and second memory DDR3M0 are connect with memory interface respectively,
First memory DDR3M1 and second memory DDR3M0 memory capacity can be 4GB.
First input feature vector figure group, the first convolution kernel group and the first biasing set can be stored to first memory
DDR3M1 and/or second memory DDR3M0.
The first memory may include the first piecemeal, the second piecemeal, third piecemeal and the 4th piecemeal, and described second deposits
Reservoir may include the 5th piecemeal and the 6th piecemeal.
Convolutional neural networks accelerator can also include: PCIe interface, PCIe interface respectively with first memory and
The connection of two memories, by PCIe interface can by pending data (for example, first input feature vector figure group) and parameter (such as
First biasing set of convolutional layer and the first convolution kernel group) first memory and second memory is written.
First piecemeal is for storing the corresponding first input feature vector figure group of first layer convolutional layer, for example, passing through PCIe
Interface the first input feature vector figure group can be written in the first piecemeal of first memory.
Second piecemeal is used to store the parameter of the odd-level of convolutional layer and full articulamentum, and the parameter may include volume
Product core, biasing.For example, second piecemeal can be used for storing the corresponding first convolution kernel group of odd-level convolutional layer and first partially
Set set.
5th piecemeal is used to store the parameter of the even level of convolutional layer and full articulamentum, and the parameter may include volume
Product core, biasing.For example, the 5th piecemeal can be used for storing the corresponding first convolution kernel group of even level convolutional layer and first partially
Set set.
The third piecemeal be used for store not be the last layer the corresponding output characteristic pattern of even level convolutional layer;Described
Six piecemeals be used for store not be the last layer the corresponding output characteristic pattern of odd-level convolutional layer.4th piecemeal is for storing
The output vector of full articulamentum.
Convolutional neural networks accelerator can also include: convolutional layer input-buffer, convolutional layer output caching, full articulamentum
Input-buffer, full articulamentum accelerator and full articulamentum output caching.
Convolutional layer input-buffer, connection floating-point-block floating point converter, block floating point-floating-point converter and convolutional layer accelerate
Device, can be used for storing the second input feature vector figure group of convolutional layer, the second convolution kernel group and the second biasing set, and by convolutional layer
The second input feature vector figure group, the second convolution kernel group and second biasing set be sent to convolutional layer accelerator.
Floating-point-block floating point converter can be by memory interface (according to the number of plies of convolutional layer) from first memory or second
Memory reads the first input feature vector figure of convolutional layer, the first convolution kernel group, the first input feature vector figure group to convolutional layer and the
One convolution kernel group carries out conversion respectively and generates the second input feature vector figure group and the second convolution kernel group, and shift unit can be connect by storage
Mouth (according to the number of plies of convolutional layer) reads the first biasing set of convolutional layer from first memory or second memory, partially to first
Set is set to be shifted to obtain the second biasing set.
Convolutional layer input-buffer can store the second input feature vector figure of the convolutional layer after floating-point-block floating point converter conversion
Group and the second convolution kernel group.
Convolutional layer input-buffer may include a convolution window, two pixel memories, two weight storage devices and one
Block index bit memory.Therefore, convolutional layer input-buffer can deposit the data read from floating-point-block floating point converter
Storage.
Convolutional layer accelerator, according to the second input feature vector figure group, the second convolution kernel group and second biasing
Set carries out the block floating point output result that the multiply-add operation of convolution obtains the convolutional layer.Convolutional layer accelerator can be according to convolutional layer
The the second input feature vector figure group and the second convolution kernel group of the convolutional layer of input-buffer storage and the second biasing of shift unit output
Set carries out the multiply-add operation of convolution.
Specifically, each processing engine obtains the corresponding multiple volumes of the processing engine from the second convolution kernel group respectively
Product core.Specifically, each processing unit handled in engine obtains the corresponding convolution kernel of each processing unit respectively.
As described above, each processing engine includes 64 processing units, each processing unit is respectively from the second convolution kernel group
Obtain corresponding convolution kernel.
Each processing engine obtains corresponding second input of the processing engine from the second input feature vector figure group respectively
Characteristic pattern.
The corresponding second input feature vector figure of reason engine and convolution kernel carry out convolution behaviour to each processing engine according to this simultaneously
Make, obtains multiple convolution results.
In one possible implementation, multiple processing units in engine are handled and carry out multiple convolution operation to obtain
Multiple convolution results.When handling multiple processing units in engine every time while carrying out convolution operation, handle more in engine
Multiple pictures that the shared processing engine of a processing unit is obtained from the corresponding second input feature vector figure of processing engine by convolution window
Element.The quantity of the multiple pixel can be determined according to the quantity of weight in convolution kernel, for example, the matrix that convolution kernel is 2 × 2,
Can so obtain 2 × 2 pixels from the second input feature vector figure, 2 × 2 pixels be also in the second input feature vector figure with
The formal distribution of matrix.
Wherein, when processing unit carries out each convolution operation, convolution window is from the corresponding second input feature vector figure of processing engine
The middle position for obtaining pixel is different, and in one example, the step-length that convolution window obtains pixel from the second input feature vector figure can be with
It is 1.
By taking a certain secondary convolution operation as an example, multiple processing units proceed as follows simultaneously: multiplied unit is more in acquisition
After a pixel, while convolution operation is carried out according to the multiple pixel and the corresponding convolution kernel of processing unit, obtains processing unit
Corresponding convolution results.
In the secondary convolution operation, the multiple pixel may include the first pixel group that the convolution window obtains in two times
With the second pixel group.For example, convolution window obtains 2 × 2 pixels from the second input feature vector figure respectively twice, as the first pixel
Group and the second pixel group, twice between the step-length that obtains be 1.As shown in Fig. 2, ix (m, n) is x-th of input channel in position
Pixel on (m, n).
Fig. 3 shows the schematic diagram according to the exemplary processing unit of the disclosure one, as shown in figure 3, the processing unit can be with
Connect including multiplier, the first accumulator, the second accumulator, the first register being connect with the first accumulator and the second accumulator
The second register connect.
Wherein, kxyIt is the convolution kernel of y-th of output channel in x-th of input channel.
In the secondary convolution operation, the multiplier is by the first pixel obtained from the first pixel group and from the second pixel
The second combination of pixels obtained in group at third pixel group, in third pixel group and convolution kernel with the first pixel, second
The corresponding weight of pixel carries out multiplication and product is calculated.
Multiplier is identical with mode, the quantity of acquisition pixel in the second pixel group from the first pixel group.
For example, the first pixel, the digit of the second pixel are M, M is positive integer, successively by the first pixel, M vacancy and the
Two pixels form third pixel group.Fig. 4 shows the schematic diagram according to the exemplary data format of the disclosure one, as shown in figure 3, A table
Show third pixel group, be the first pixel in the position 0-7 of A, 7-15 be vacancy, in 15-23 be the second pixel;B is that processing is single
Weight corresponding with the first pixel and the second pixel in the corresponding convolution kernel of member.
Multiplier carries out multiplication to the weight corresponding with the first pixel, the second pixel in third pixel group and convolution kernel
Product is calculated, can refer to, the first pixel in third pixel group is multiplied respectively with corresponding weight, the second pixel with
Corresponding weight is multiplied respectively, and obtained product digit is 4M.Two multiplication operations can be by a piece of as shown in Figure 3
DSP48E1 is realized.Obtained product P as shown in Figure 3, preceding 2M in product in third pixel group the first pixel with
Corresponding weight is multiplied respectively obtaining as a result, rear 2M in product are multiplied respectively with corresponding weight for the second pixel and obtain
Result.
First accumulator adds up preceding 2M data of the product, obtains the first pixel group corresponding first
Accumulation result;Second accumulator adds up rear 2M data of the product, obtains the second pixel group corresponding
Two accumulation results.
Wherein, first register is for storing the first accumulation result that the first accumulator obtains every time, and described second
Register is for storing the second accumulation result that the second accumulator obtains every time.
First accumulation result and second accumulation result form the corresponding convolution results of the processing unit.Data
Transmission and convolutional calculation all only need the mantissa part of block floating point, can expand data bit width in calculating process, no
Position truncation can occur.
It is parallel that three-level is designed in process of convolution array: input channel is parallel, output channel is parallel and Pixel-level is parallel, leads to
It crosses using three-level parallel-convolution processing array and Pingpang Memory structure, improves the calculated performance of system.
The convolutional layer accelerator carries out accumulation operations to the multiple convolution results and activation operation obtains the convolution
Output characteristic pattern of the block floating point output result of layer as the convolutional layer.
Specifically, the convolutional layer accelerator further includes that multiple third accumulators and each third accumulator are corresponding sharp
Flexible module, each third accumulator connect a processing unit in multiple processing engines.It may include Relu in active module
Activation primitive.
For example, as shown in Fig. 2, convolutional layer accelerator may include 64 third accumulator A1, A2 ... A64, each third
Accumulator connect it is multiple processing engines in a processing units, such as accumulator A1 respectively with it is each processing engine in processing
Unit PU1_1, PU2_1 ... PU64_1 connection, accumulator A2 respectively with it is each processing engine in processing unit PU1_2, PU2_
2 ... PU64_2 connections, etc..The convolution kernel that all processing units of one third accumulator connection use can be identical.
For each third accumulator, which uses different processing units the convolution kernel of same output channel
Obtained convolution results add up, and obtain third accumulation result, and the third accumulation result is output to the third and is added up
The corresponding active module of device.
Multiple processing engines execute convolution to different input channels simultaneously, and calculated result is added in accumulator.
For each active module, the third accumulation result which obtains for corresponding third accumulator is carried out
Activation operation obtains the block floating point output result of the convolutional layer.
In one possible implementation, convolutional layer output caching for storing the block floating point output as a result,
And by be not the last layer convolutional layer block floating point output result be sent to block floating point-floating-point converter.
Block floating point-floating-point converter, the floating-point for be converted to the convolutional layer to block floating point output result are defeated
Output characteristic pattern of the result as the convolutional layer out.
The output characteristic pattern of even level convolutional layer for not being the last layer is possibly stored to third piecemeal as next
First input feature vector figure of layer convolutional layer, the output characteristic pattern of odd-level convolutional layer for not being the last layer are possibly stored to
First input feature vector figure of 6th piecemeal as next layer of convolutional layer.
Block exponent bits for exporting characteristic pattern are possibly stored to convolutional layer input-buffer, the as next layer of convolutional layer
The block exponent bits of two input feature vector figure groups.
Block floating point output for the last layer convolutional layer is as a result, convolutional layer output caching sends block floating point output result
To full articulamentum input-buffer, full articulamentum input-buffer is received and the block floating point of storage the last layer convolutional layer exports result simultaneously
It is sent to full articulamentum accelerator.Full articulamentum accelerator, for exported according to the block floating point of the last layer convolutional layer as a result, into
The full attended operation of row obtains block floating point final result, and block floating point final result is sent to full articulamentum output caching.
Full articulamentum output caching, for block floating point final result to be sent to block floating point-floating-point converter, so that block is floating
The block floating point final result is converted to floating-point final result by point-floating-point converter.Floating-point final result can be full connection
The output vector of layer, block floating point-floating-point converter can store the output vector of full articulamentum to the 4th piecemeal.
Shifting function is all only needed in floating-point-block floating point and block floating point-floating-point conversion process, the operation that rounds up is applied to
The accumulation of error is avoided in conversion process.
Fig. 5 shows the stream of the accelerated method based on block float point arithmetic of the convolutional neural networks according to one embodiment of the disclosure
Cheng Tu.Fig. 6 shows the data flow diagram of single output channel of the convolutional neural networks accelerator according to one embodiment of the disclosure.
As shown in Figure 5, which comprises
Step S10, convolutional layer input-buffer read convolutional layer the first input feature vector figure group and convolutional layer first partially
Set set and the first convolution kernel group;
Before step S10, the first input feature vector figure group can be written the first of first memory by PCIe interface
Piecemeal, by the first convolution kernel and the first biasing according to corresponding in the convolutional layer write-in first memory or second memory
Piecemeal.
In one possible implementation, the parameter of convolutional layer and the odd-level of full articulamentum first can be written to deposit
Second piecemeal of reservoir, by the 5th piecemeal of the parameter of convolutional layer and the even level of full articulamentum write-in second memory.Convolution
The parameter of layer and full articulamentum can be convolution kernel, biasing and block exponent bits as described above.
Convolutional layer input-buffer can read current convolutional layer to be processed from corresponding position according to rule stored above
The first input feature vector figure group and convolutional layer first biasing set and the first convolution kernel group.
Whole parameters of one convolutional layer all directly read in on-chip memory in the initial stage, can be improved and handle up
Amount.
Step S11, the first input feature vector figure group and the first convolution kernel group to convolutional layer carry out floating-point-block floating point conversion,
The the second input feature vector figure group and the second convolution kernel group of block floating point data format are obtained, the first biasing set is shifted to obtain
Second biasing set.
Wherein, the second biasing set remains as fixed-point data.Floating-point-block floating point conversion as described above, repeats no more.
Second input feature vector figure group of convolutional layer, the second convolution kernel group and the second biasing set are sent to volume by step S12
Lamination accelerator.
Step S13, convolutional layer accelerator is according to the second input feature vector figure group, the second convolution kernel group and described
The block floating point that two biasing set carry out the convolution multiply-add operation acquisition convolutional layer exports result.
As shown in fig. 6, being gathered according to the second input feature vector figure group, the second convolution kernel group and second biasing
Carry out the multiply-add operation of convolution.
Step S14, if the convolutional layer is not the last layer convolutional layer, convolutional layer output caching, which exports block floating point, to be tied
Fruit is sent to block floating point-floating-point converter (such as Fig. 6), if the convolutional layer is the last layer convolutional layer, convolutional layer output is slow
It deposits and block floating point output result is sent to full articulamentum input-buffer.
Step S15, block floating point-floating-point converter carry out the floating-point for being converted to the convolutional layer to block floating point output result
Export output characteristic pattern of the result as the convolutional layer.
As shown in fig. 6, for not being that the output characteristic pattern of even level convolutional layer of the last layer is possibly stored to third point
First input feature vector figure of the block (in the first memory in external memory) as next layer of convolutional layer, for not being last
The output characteristic pattern of the odd-level convolutional layer of layer is possibly stored to the 6th piecemeal (in the second memory in external memory) work
For the first input feature vector figure of next layer of convolutional layer.
Step S16, full articulamentum input-buffer receive the block floating point output result of the last layer convolutional layer and are sent to complete
Articulamentum accelerator.
Step S17, full articulamentum accelerator are exported according to the block floating point of the last layer convolutional layer as a result, carrying out full connection behaviour
Make, obtains block floating point final result, and block floating point final result is sent to full articulamentum output caching.
Block floating point final result is sent to block floating point-floating-point converter by step S18, full articulamentum output caching, so that
The block floating point final result is converted to floating-point final result by block floating point-floating-point converter.
Input feature vector figure group and convolution kernel are respectively converted into block floating point by floating-point-block floating point conversion by the disclosure,
Traditional floating-point operation is replaced with block float point arithmetic, outputting and inputting for convolutional calculation is all fixed point format, is cleverly avoided
The defect for lacking FPU Float Point Unit on FPGA solves the problems, such as that FPGA progress floating-point operation is costly, significantly reduces
The power consumption of convolutional neural networks accelerator is disposed in FPGA platform, improves handling capacity.
Shifting function is all only needed in floating-point-block floating point and block floating point-floating-point conversion process, the operation that rounds up is applied to
The accumulation of error is avoided in conversion process, data transmission and convolutional calculation all only need the mantissa part of block floating point, calculate
Data bit width can be expanded in the process, position truncation will not occur.Therefore the disclosure can also guarantee convolutional neural networks mould
The accuracy rate of type effectively prevents deviation of convolutional neural networks model parameter during forward-propagating, therefore in convolution mind
During forward direction through network is inferred, without carrying out re -training, different convolutional neural networks to convolutional neural networks model
Model can be on the accelerator by adjusting parameter configuration to the disclosure.
The presently disclosed embodiments is described above, above description is exemplary, and non-exclusive, and
It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill
Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport
In the principle, practical application or technological improvement to the technology in market for best explaining each embodiment, or lead this technology
Other those of ordinary skill in domain can understand each embodiment disclosed herein.
Claims (10)
1. a kind of convolutional neural networks accelerator characterized by comprising
Floating-point-block floating point converter carries out conversion life to the first input feature vector figure group and the first convolution kernel group of convolutional layer respectively
At the second input feature vector figure group and the second convolution kernel group;
Wherein, the data in the second input feature vector figure group and the second convolution kernel group are block floating point data;
Shift unit, according to the block index of data in the second input feature vector figure group and the second convolution kernel group, by convolutional layer
First biasing set be converted to the second biasing set;
Wherein, the data in the second biasing set are fixed-point data;
Convolutional layer accelerator is gathered according to the second input feature vector figure group, the second convolution kernel group and second biasing
Carry out the block floating point output result that the multiply-add operation of convolution obtains the convolutional layer;
Block floating point-floating-point converter carries out being converted to the floating of the convolutional layer to the block floating point output result of the convolutional layer
Output characteristic pattern of the point output result as the convolutional layer.
2. the apparatus according to claim 1, which is characterized in that the convolutional layer accelerator includes multiple processing engines;
Convolutional layer accelerator is gathered according to the second input feature vector figure group, the second convolution kernel group and second biasing
Carry out the block floating point output result that the multiply-add operation of convolution obtains the convolutional layer, comprising:
Each processing engine obtains the corresponding multiple convolution kernels of the processing engine from the second convolution kernel group respectively;
Each processing engine obtains corresponding second input feature vector of the processing engine from the second input feature vector figure group respectively
Figure;
The corresponding second input feature vector figure of reason engine and convolution kernel carry out convolution operation to each processing engine according to this simultaneously, obtain
To multiple convolution results;
The convolutional layer accelerator carries out accumulation operations to the multiple convolution results and activation operation obtains the convolutional layer
Block floating point exports result.
3. the apparatus of claim 2, which is characterized in that each processing engine includes multiple processing units;
Each processing engine obtains the corresponding multiple convolution kernels of the processing engine from the second convolution kernel group respectively, comprising:
Each processing unit in processing engine obtains the corresponding convolution kernel of each processing unit respectively.
4. device according to claim 3, which is characterized in that the convolution operation that processing engine carries out includes in processing engine
Processing unit carry out multiple convolution operation;
When handling multiple processing units in engine every time while carrying out convolution operation, multiple processing units in engine are handled
Multiple pixels that shared processing engine is obtained from the corresponding second input feature vector figure of processing engine by convolution window, wherein place
When managing the unit each convolution operation of progress, convolution window obtains the position of pixel from the corresponding second input feature vector figure of processing engine
It is different.
5. device according to claim 4, which is characterized in that
Multiple processing units in processing engine repeatedly carry out following convolution operation simultaneously, obtain the corresponding volume of each processing unit
Product result:
Processing unit carries out convolution operation according to the corresponding convolution kernel of the multiple pixel and processing unit of the secondary acquisition simultaneously,
Obtain the corresponding convolution results of processing unit.
6. device according to claim 5, which is characterized in that the multiple pixel in a convolution operation includes described
The the first pixel group and the second pixel group that convolution window obtains in two times,
The processing unit includes multiplier, the first accumulator, the second accumulator, the first deposit connecting with the first accumulator
Device, the second register being connect with the second accumulator;
Processing unit carries out convolution operation according to the multiple pixel and the corresponding convolution kernel of processing unit, obtains processing unit pair
The convolution results answered, comprising:
The multiplier is every time from obtaining the first pixel and obtain the second combination of pixels from the second pixel group in the first pixel group
Multiply into third pixel group, to the weight corresponding with the first pixel, the second pixel in third pixel group and convolution kernel
Product is calculated in method;
Wherein, the first pixel, the digit of the second pixel are M, and M is positive integer, successively by the first pixel, M vacancy and the second picture
Element composition third pixel group;
First accumulator adds up preceding 2M data of the product, obtains the first pixel group corresponding first and adds up
As a result;
Wherein, first register is for storing the first accumulation result that the first accumulator obtains every time;
Second accumulator adds up rear 2M data of the product, obtains the second pixel group corresponding second and adds up
As a result;
Wherein, second register is for storing the second accumulation result that the second accumulator obtains every time;
First accumulation result and second accumulation result form the corresponding convolution results of the processing unit.
7. the apparatus according to claim 1, which is characterized in that the convolutional layer accelerator further includes that multiple thirds are cumulative
Device and the corresponding active module of each third accumulator, each third accumulator connect a processing in multiple processing engines
Unit;
The convolutional layer accelerator carries out accumulation operations to the multiple convolution results and activation operation obtains the convolutional layer
Block floating point exports result, comprising:
For each third accumulator, which obtains different processing units using the convolution kernel of same output channel
Convolution results add up, obtain third accumulation result, and the third accumulation result is output to the third accumulator pair
The active module answered;
For each active module, the third accumulation result which obtains for corresponding third accumulator is activated
Operation obtains the block floating point output result of the convolutional layer.
8. the apparatus according to claim 1, which is characterized in that convolutional neural networks accelerator further includes memory module,
The memory module includes first memory, and the first memory includes the first piecemeal, the second piecemeal, third piecemeal and the 4th
Piecemeal;
First piecemeal is for storing the corresponding first input feature vector figure group of first layer convolutional layer;
Second piecemeal is for storing the corresponding first convolution kernel group of odd-level convolutional layer and the first biasing set;
The third piecemeal be used for store not be the last layer the corresponding output characteristic pattern of even level convolutional layer;
4th piecemeal is used to store the output vector of full articulamentum.
9. device according to claim 1 or 8, which is characterized in that the memory module includes second memory, and described
Two memories include the 5th piecemeal and the 6th piecemeal;
5th piecemeal is for storing the corresponding first convolution kernel group of even level convolutional layer and the first biasing set;
6th piecemeal be used for store not be the last layer the corresponding output characteristic pattern of odd-level convolutional layer.
10. device according to claim 8 or claim 9, which is characterized in that the convolutional neural networks accelerator further include:
Convolutional layer input-buffer, connection floating-point-block floating point converter, block floating point-floating-point converter and convolutional layer accelerator, is used
Gather in the second input feature vector figure group of storage convolutional layer, the second convolution kernel group and the second biasing, and defeated by the second of convolutional layer
Enter characteristic pattern group, the second convolution kernel group and the second biasing set and is sent to convolutional layer accelerator;
Convolutional layer output caching, connection block floating point-floating-point converter, convolutional layer accelerator and full articulamentum input-buffer are used
In storing block floating point output as a result, and will not be that the block floating point output result of the last layer convolutional layer is sent to block floating point-
The block floating point output result of the last layer convolutional layer is sent to full articulamentum input-buffer by floating-point converter;
Full articulamentum input-buffer, connects full articulamentum accelerator, for receiving and storing the block floating point of the last layer convolutional layer
Output is as a result, and be sent to full articulamentum accelerator;
Full articulamentum accelerator connects full articulamentum output caching, ties for being exported according to the block floating point of the last layer convolutional layer
Fruit carries out full attended operation, obtains block floating point final result, and block floating point final result is sent to full articulamentum and is exported and is delayed
It deposits;
Full articulamentum output caching, connects block floating point-floating-point converter, for block floating point final result to be sent to block floating point-
Floating-point converter, so that the block floating point final result is converted to floating-point final result by block floating point-floating-point converter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810865157.9A CN109063825B (en) | 2018-08-01 | 2018-08-01 | Convolutional neural network accelerator |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810865157.9A CN109063825B (en) | 2018-08-01 | 2018-08-01 | Convolutional neural network accelerator |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109063825A true CN109063825A (en) | 2018-12-21 |
CN109063825B CN109063825B (en) | 2020-12-29 |
Family
ID=64832421
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810865157.9A Active CN109063825B (en) | 2018-08-01 | 2018-08-01 | Convolutional neural network accelerator |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109063825B (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409509A (en) * | 2018-12-24 | 2019-03-01 | 济南浪潮高新科技投资发展有限公司 | A kind of data structure and accelerated method for the convolutional neural networks accelerator based on FPGA |
CN109697083A (en) * | 2018-12-27 | 2019-04-30 | 深圳云天励飞技术有限公司 | Fixed point accelerated method, device, electronic equipment and the storage medium of data |
CN109740733A (en) * | 2018-12-27 | 2019-05-10 | 深圳云天励飞技术有限公司 | Deep learning network model optimization method, device and relevant device |
CN109901814A (en) * | 2019-02-14 | 2019-06-18 | 上海交通大学 | Customized floating number and its calculation method and hardware configuration |
CN110059823A (en) * | 2019-04-28 | 2019-07-26 | 中国科学技术大学 | Deep neural network model compression method and device |
CN110059817A (en) * | 2019-04-17 | 2019-07-26 | 中山大学 | A method of realizing low consumption of resources acoustic convolver |
CN110147252A (en) * | 2019-04-28 | 2019-08-20 | 深兰科技(上海)有限公司 | A kind of parallel calculating method and device of convolutional neural networks |
CN110442323A (en) * | 2019-08-09 | 2019-11-12 | 复旦大学 | Carry out the architecture and method of floating number or fixed-point number multiply-add operation |
CN110930290A (en) * | 2019-11-13 | 2020-03-27 | 东软睿驰汽车技术(沈阳)有限公司 | Data processing method and device |
CN111047010A (en) * | 2019-11-25 | 2020-04-21 | 天津大学 | Method and device for reducing first-layer convolution calculation delay of CNN accelerator |
CN111091183A (en) * | 2019-12-17 | 2020-05-01 | 深圳鲲云信息科技有限公司 | Neural network acceleration system and method |
CN111178508A (en) * | 2019-12-27 | 2020-05-19 | 珠海亿智电子科技有限公司 | Operation device and method for executing full connection layer in convolutional neural network |
CN111738427A (en) * | 2020-08-14 | 2020-10-02 | 电子科技大学 | Operation circuit of neural network |
WO2020238472A1 (en) * | 2019-05-30 | 2020-12-03 | 中兴通讯股份有限公司 | Machine learning engine implementation method and apparatus, terminal device, and storage medium |
CN112232499A (en) * | 2020-10-13 | 2021-01-15 | 华中光电技术研究所(中国船舶重工集团公司第七一七研究所) | Convolutional neural network accelerator |
CN112734020A (en) * | 2020-12-28 | 2021-04-30 | 中国电子科技集团公司第十五研究所 | Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network |
CN113273082A (en) * | 2018-12-31 | 2021-08-17 | 微软技术许可有限责任公司 | Neural network activation compression with exception block floating point |
CN113554163A (en) * | 2021-07-27 | 2021-10-26 | 深圳思谋信息科技有限公司 | Convolutional neural network accelerator |
CN113780523A (en) * | 2021-08-27 | 2021-12-10 | 深圳云天励飞技术股份有限公司 | Image processing method, image processing device, terminal equipment and storage medium |
WO2022041188A1 (en) * | 2020-08-31 | 2022-03-03 | 深圳市大疆创新科技有限公司 | Accelerator for neural network, acceleration method and device, and computer storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107239829A (en) * | 2016-08-12 | 2017-10-10 | 北京深鉴科技有限公司 | A kind of method of optimized artificial neural network |
CN108133270A (en) * | 2018-01-12 | 2018-06-08 | 清华大学 | Convolutional neural networks accelerating method and device |
CN108229670A (en) * | 2018-01-05 | 2018-06-29 | 中国科学技术大学苏州研究院 | Deep neural network based on FPGA accelerates platform |
-
2018
- 2018-08-01 CN CN201810865157.9A patent/CN109063825B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107239829A (en) * | 2016-08-12 | 2017-10-10 | 北京深鉴科技有限公司 | A kind of method of optimized artificial neural network |
CN108229670A (en) * | 2018-01-05 | 2018-06-29 | 中国科学技术大学苏州研究院 | Deep neural network based on FPGA accelerates platform |
CN108133270A (en) * | 2018-01-12 | 2018-06-08 | 清华大学 | Convolutional neural networks accelerating method and device |
Non-Patent Citations (3)
Title |
---|
CHUNSHENG MEI 等: "A 200MHZ 202.4GFLOPS@10.8W VGG16 accelerator in Xilinx VX690T", 《2017 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP)》 * |
MARIO DRUMOND 等: "End-to-End DNN Training with Block Floating Point Arithmetic", 《ARXIV:1804.01526V2》 * |
ZHOURUI SONG 等: "Computation Error Analysis of Block Floating Point Arithmetic Oriented Convolution Neural Network Accelerator Design", 《ARXIV:1709.07776V2》 * |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409509A (en) * | 2018-12-24 | 2019-03-01 | 济南浪潮高新科技投资发展有限公司 | A kind of data structure and accelerated method for the convolutional neural networks accelerator based on FPGA |
CN109697083B (en) * | 2018-12-27 | 2021-07-06 | 深圳云天励飞技术有限公司 | Fixed-point acceleration method and device for data, electronic equipment and storage medium |
CN109697083A (en) * | 2018-12-27 | 2019-04-30 | 深圳云天励飞技术有限公司 | Fixed point accelerated method, device, electronic equipment and the storage medium of data |
CN109740733A (en) * | 2018-12-27 | 2019-05-10 | 深圳云天励飞技术有限公司 | Deep learning network model optimization method, device and relevant device |
CN113273082A (en) * | 2018-12-31 | 2021-08-17 | 微软技术许可有限责任公司 | Neural network activation compression with exception block floating point |
CN109901814A (en) * | 2019-02-14 | 2019-06-18 | 上海交通大学 | Customized floating number and its calculation method and hardware configuration |
CN110059817A (en) * | 2019-04-17 | 2019-07-26 | 中山大学 | A method of realizing low consumption of resources acoustic convolver |
CN110147252A (en) * | 2019-04-28 | 2019-08-20 | 深兰科技(上海)有限公司 | A kind of parallel calculating method and device of convolutional neural networks |
CN110059823A (en) * | 2019-04-28 | 2019-07-26 | 中国科学技术大学 | Deep neural network model compression method and device |
WO2020238472A1 (en) * | 2019-05-30 | 2020-12-03 | 中兴通讯股份有限公司 | Machine learning engine implementation method and apparatus, terminal device, and storage medium |
CN110442323B (en) * | 2019-08-09 | 2023-06-23 | 复旦大学 | Device and method for performing floating point number or fixed point number multiply-add operation |
CN110442323A (en) * | 2019-08-09 | 2019-11-12 | 复旦大学 | Carry out the architecture and method of floating number or fixed-point number multiply-add operation |
CN110930290A (en) * | 2019-11-13 | 2020-03-27 | 东软睿驰汽车技术(沈阳)有限公司 | Data processing method and device |
CN111047010A (en) * | 2019-11-25 | 2020-04-21 | 天津大学 | Method and device for reducing first-layer convolution calculation delay of CNN accelerator |
CN111091183A (en) * | 2019-12-17 | 2020-05-01 | 深圳鲲云信息科技有限公司 | Neural network acceleration system and method |
CN111091183B (en) * | 2019-12-17 | 2023-06-13 | 深圳鲲云信息科技有限公司 | Neural network acceleration system and method |
CN111178508A (en) * | 2019-12-27 | 2020-05-19 | 珠海亿智电子科技有限公司 | Operation device and method for executing full connection layer in convolutional neural network |
CN111178508B (en) * | 2019-12-27 | 2024-04-05 | 珠海亿智电子科技有限公司 | Computing device and method for executing full connection layer in convolutional neural network |
CN111738427B (en) * | 2020-08-14 | 2020-12-29 | 电子科技大学 | Operation circuit of neural network |
CN111738427A (en) * | 2020-08-14 | 2020-10-02 | 电子科技大学 | Operation circuit of neural network |
WO2022041188A1 (en) * | 2020-08-31 | 2022-03-03 | 深圳市大疆创新科技有限公司 | Accelerator for neural network, acceleration method and device, and computer storage medium |
CN112232499A (en) * | 2020-10-13 | 2021-01-15 | 华中光电技术研究所(中国船舶重工集团公司第七一七研究所) | Convolutional neural network accelerator |
CN112734020A (en) * | 2020-12-28 | 2021-04-30 | 中国电子科技集团公司第十五研究所 | Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network |
CN113554163B (en) * | 2021-07-27 | 2024-03-29 | 深圳思谋信息科技有限公司 | Convolutional neural network accelerator |
CN113554163A (en) * | 2021-07-27 | 2021-10-26 | 深圳思谋信息科技有限公司 | Convolutional neural network accelerator |
CN113780523A (en) * | 2021-08-27 | 2021-12-10 | 深圳云天励飞技术股份有限公司 | Image processing method, image processing device, terminal equipment and storage medium |
CN113780523B (en) * | 2021-08-27 | 2024-03-29 | 深圳云天励飞技术股份有限公司 | Image processing method, device, terminal equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109063825B (en) | 2020-12-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109063825A (en) | Convolutional neural networks accelerator | |
WO2021004366A1 (en) | Neural network accelerator based on structured pruning and low-bit quantization, and method | |
CN108805266B (en) | Reconfigurable CNN high-concurrency convolution accelerator | |
CN108108809B (en) | Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof | |
CN109447241B (en) | Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things | |
CN108564168A (en) | A kind of design method to supporting more precision convolutional neural networks processors | |
CN107832082A (en) | A kind of apparatus and method for performing artificial neural network forward operation | |
CN108090560A (en) | The design method of LSTM recurrent neural network hardware accelerators based on FPGA | |
CN110070178A (en) | A kind of convolutional neural networks computing device and method | |
CN109635944A (en) | A kind of sparse convolution neural network accelerator and implementation method | |
CN108154229B (en) | Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework | |
CN110516801A (en) | A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput | |
CN107066239A (en) | A kind of hardware configuration for realizing convolutional neural networks forward calculation | |
CN109934336A (en) | Neural network dynamic based on optimum structure search accelerates platform designing method and neural network dynamic to accelerate platform | |
CN108764466A (en) | Convolutional neural networks hardware based on field programmable gate array and its accelerated method | |
CN111242277A (en) | Convolutional neural network accelerator supporting sparse pruning and based on FPGA design | |
CN110163354A (en) | A kind of computing device and method | |
CN108629411A (en) | A kind of convolution algorithm hardware realization apparatus and method | |
CN109086879B (en) | Method for realizing dense connection neural network based on FPGA | |
CN109615071A (en) | A kind of neural network processor of high energy efficiency, acceleration system and method | |
CN110321997A (en) | High degree of parallelism computing platform, system and calculating implementation method | |
CN108596331A (en) | A kind of optimization method of cell neural network hardware structure | |
CN108491924B (en) | Neural network data serial flow processing device for artificial intelligence calculation | |
CN113298237A (en) | Convolutional neural network on-chip training accelerator based on FPGA | |
CN113222129B (en) | Convolution operation processing unit and system based on multi-level cache cyclic utilization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |