CN109063825A

CN109063825A - Convolutional neural networks accelerator

Info

Publication number: CN109063825A
Application number: CN201810865157.9A
Authority: CN
Inventors: 季向阳; 连晓聪
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2018-12-21
Anticipated expiration: 2038-08-01
Also published as: CN109063825B

Abstract

This disclosure relates to a kind of convolutional neural networks accelerator, by the way that input feature vector figure group and convolution kernel are respectively converted into block floating point, traditional floating-point operation is replaced with block float point arithmetic, outputting and inputting for convolutional calculation is all fixed point format, and it is costly to solve the problems, such as that FPGA carries out floating-point operation.Shifting function is all only needed in floating-point-block floating point and block floating point-floating-point conversion process, the operation that rounds up is applied in conversion process to avoid the accumulation of error, data transmission and convolutional calculation all only need the mantissa part of block floating point, data bit width can be expanded in calculating process, position truncation will not occur.Therefore, the disclosure can also guarantee the accuracy rate of convolutional neural networks model, deviation of convolutional neural networks model parameter during forward-propagating is effectively prevented, therefore during the forward direction of convolutional neural networks is inferred, without carrying out re -training to convolutional neural networks model.

Description

Convolutional neural networks accelerator

Technical field

This disclosure relates to nerual network technique field more particularly to a kind of convolutional neural networks accelerator.

Background technique

Convolutional neural networks are in the application field of artificial intelligence, especially image recognition, natural language processing and strategy Deployment etc. achieves outstanding performance.The success of convolutional neural networks is mainly from the promotion for calculating equipment performance.But with The increase of the network number of plies, the weight data of convolutional neural networks can reach several hundred megabits even more than gigabits, network The computing resource for carrying out positive feature extraction and calculation, classification and error propagation consumption is very huge.Therefore to convolutional Neural net Network accelerate being the key that improve convolutional neural networks model computational efficiency.

In the related technology, FPGA (Field-Programmable Gate Array, field programmable gate array) It is a kind of logic gate array unit that can be stylized, there is outstanding computation capability.By the FPGA tool specially designed There is the features such as low in energy consumption, speed is fast, restructural, moreover, FPGA does not use operating system, it is deterministic that a certain item can be absorbed in Task can reduce a possibility that problem occurs.Therefore, the convolutional neural networks speeding scheme based on FPGA platform becomes one kind Popular developing direction.

However, disposing convolutional neural networks in FPGA platform, there are two big obstacles: FPGA piece transmission bottleneck and huge outside Floating-point operation cost.The frequent access mainly from network parameter and characteristic pattern is transmitted outside piece, the increase etc. of the network number of plies is led The bandwidth demand of FPGA is caused to increase.Lack FPU Float Point Unit on FPGA, completion floating-point arithmetic operation, which will cause, on FPGA gulps down Decline while spitting rate and power consumption efficiency.

Many methods, such as data-reusing, compression and beta pruning, are suggested the bandwidth demand for meeting FPGA.However, these Method needs to carry out retraining or entropy coding to network, can the times more more than original network consumption, hinder convolutional Neural The real-time processing of network.

Fixed-point calculation is commonly used to the calculated performance for replacing floating-point operation to improve FPGA.However, the common drawback of these methods It is that re -training is needed to carry out undated parameter.Re -training is the process for consuming very much resource, when applied in depth network model When, it may be desirable to more hardware resources.

Summary of the invention

In view of this, to solve the above problems, raising is handled up the present disclosure proposes a kind of convolutional neural networks accelerator Amount and power consumption efficiency.

According to the one side of the disclosure, a kind of convolutional neural networks accelerator is provided, comprising:

Floating-point-block floating point converter turns the first input feature vector figure group and the first convolution kernel group of convolutional layer respectively It changes and generates the second input feature vector figure group and the second convolution kernel group；

Wherein, the data in the second input feature vector figure group and the second convolution kernel group are block floating point data；

Shift unit will be rolled up according to the block index of data in the second input feature vector figure group and the second convolution kernel group First biasing set of lamination is converted to the second biasing set；

Wherein, the data in the second biasing set are fixed-point data；

Convolutional layer accelerator, according to the second input feature vector figure group, the second convolution kernel group and second biasing Set carries out the block floating point output result that the multiply-add operation of convolution obtains the convolutional layer；

Block floating point-floating-point converter carries out being converted to the convolutional layer to the block floating point output result of the convolutional layer Output characteristic pattern of the floating-point output result as the convolutional layer.

In one possible implementation, the convolutional layer accelerator includes multiple processing engines；

Convolutional layer accelerator, according to the second input feature vector figure group, the second convolution kernel group and second biasing Set carries out the block floating point output result that the multiply-add operation of convolution obtains the convolutional layer, comprising:

Each processing engine obtains the corresponding multiple convolution kernels of the processing engine from the second convolution kernel group respectively；

Each processing engine obtains corresponding second input of the processing engine from the second input feature vector figure group respectively Characteristic pattern；

The corresponding second input feature vector figure of reason engine and convolution kernel carry out convolution behaviour to each processing engine according to this simultaneously Make, obtains multiple convolution results；

The convolutional layer accelerator carries out accumulation operations to the multiple convolution results and activation operation obtains the convolution The block floating point of layer exports result.

In one possible implementation, each processing engine includes multiple processing units；

Each processing engine obtains the corresponding multiple convolution kernels of the processing engine from the second convolution kernel group respectively, packet It includes:

Each processing unit in processing engine obtains the corresponding convolution kernel of each processing unit respectively.

In one possible implementation, the convolution operation that processing engine carries out includes the processing unit handled in engine The multiple convolution of progress operates；

When handling multiple processing units in engine every time while carrying out convolution operation, multiple processing in engine are handled Multiple pixels that the shared processing engine of unit is obtained from the corresponding second input feature vector figure of processing engine by convolution window, In, when processing unit carries out each convolution operation, convolution window obtains pixel from the corresponding second input feature vector figure of processing engine Position it is different.

The multiple processing units handled in engine in one possible implementation repeatedly carry out following convolution behaviour simultaneously Make, obtain the corresponding convolution results of each processing unit:

Processing unit carries out convolution according to the corresponding convolution kernel of the multiple pixel and processing unit of the secondary acquisition simultaneously Operation, obtains the corresponding convolution results of processing unit.

In one possible implementation, the multiple pixel in a convolution operation includes that the convolution window is divided to two The the first pixel group and the second pixel group of secondary acquisition, the processing unit include multiplier, the first accumulator, the second accumulator, The first register being connect with the first accumulator, the second register being connect with the second accumulator；

Processing unit carries out convolution operation according to the multiple pixel and the corresponding convolution kernel of processing unit, and it is single to obtain processing The corresponding convolution results of member, comprising:

The multiplier is every time from obtaining the first pixel and obtain the second pixel from the second pixel group in the first pixel group Be composed third pixel group, to the weight corresponding with the first pixel, the second pixel in third pixel group and convolution kernel into Product is calculated in row multiplication；

Wherein, the first pixel, the digit of the second pixel are M, and M is positive integer, successively by the first pixel, M vacancy and the Two pixels form third pixel group；

First accumulator adds up preceding 2M data of the product, obtains the first pixel group corresponding first Accumulation result；

Wherein, first register is for storing the first accumulation result that the first accumulator obtains every time；

Second accumulator adds up rear 2M data of the product, obtains the second pixel group corresponding second Accumulation result；

Wherein, second register is for storing the second accumulation result that the second accumulator obtains every time；

First accumulation result and second accumulation result form the corresponding convolution results of the processing unit.

In one possible implementation, the convolutional layer accelerator further includes multiple third accumulators and each The corresponding active module of three accumulators, each third accumulator connect a processing unit in multiple processing engines；

The convolutional layer accelerator carries out accumulation operations to the multiple convolution results and activation operation obtains the convolution The block floating point of layer exports result, comprising:

For each third accumulator, which uses different processing units the convolution kernel of same output channel Obtained convolution results add up, and obtain third accumulation result, and the third accumulation result is output to the third and is added up The corresponding active module of device；

For each active module, the third accumulation result which obtains for corresponding third accumulator is carried out Activation operation obtains the block floating point output result of the convolutional layer.

In one possible implementation, convolutional neural networks accelerator further includes memory module, the storage mould Block includes first memory, and the first memory includes the first piecemeal, the second piecemeal, third piecemeal and the 4th piecemeal；

First piecemeal is for storing the corresponding first input feature vector figure group of first layer convolutional layer；

Second piecemeal is for storing the corresponding first convolution kernel group of odd-level convolutional layer and the first biasing set；

The third piecemeal be used for store not be the last layer the corresponding output characteristic pattern of even level convolutional layer；

4th piecemeal is used to store the output vector of full articulamentum.

In one possible implementation, the memory module includes second memory, and the second memory includes 5th piecemeal and the 6th piecemeal；

5th piecemeal is for storing the corresponding first convolution kernel group of even level convolutional layer and the first biasing set；

6th piecemeal be used for store not be the last layer the corresponding output characteristic pattern of odd-level convolutional layer.

In one possible implementation, the convolutional neural networks accelerator further include:

Convolutional layer input-buffer, connection floating-point-block floating point converter, block floating point-floating-point converter and convolutional layer accelerate Device, for storing the second input feature vector figure group of convolutional layer, the second convolution kernel group and the second biasing set, and by the of convolutional layer Two input feature vector figure groups, the second convolution kernel group and the second biasing set are sent to convolutional layer accelerator；

Convolutional layer output caching, connection block floating point-floating-point converter, convolutional layer accelerator and the input of full articulamentum are slow Deposit, for store block floating point output as a result, and by be not the last layer convolutional layer block floating point output result be sent to it is floating The block floating point output result of the last layer convolutional layer is sent to full articulamentum input-buffer by point-block floating point converter；

Full articulamentum input-buffer, connects full articulamentum accelerator, for receiving and storing the block of the last layer convolutional layer Floating-point exports as a result, and being sent to full articulamentum accelerator；

Full articulamentum accelerator connects full articulamentum output caching, defeated for the block floating point according to the last layer convolutional layer Out as a result, carrying out full attended operation, block floating point final result is obtained, and block floating point final result is sent to full articulamentum and is exported Caching；

Full articulamentum output caching, connects block floating point-floating-point converter, floats for block floating point final result to be sent to block Point-floating-point converter, so that the block floating point final result is converted to floating-point final result by block floating point-floating-point converter.

The utility model has the advantages that

Input feature vector figure group and convolution kernel are respectively converted into block floating point by floating-point-block floating point conversion by the disclosure, Traditional floating-point operation is replaced with block float point arithmetic, outputting and inputting for convolutional calculation is all fixed point format, is cleverly avoided The defect for lacking FPU Float Point Unit on FPGA solves the problems, such as that FPGA progress floating-point operation is costly, significantly reduces The power consumption of convolutional neural networks accelerator is disposed in FPGA platform, improves handling capacity.

Shifting function is all only needed in floating-point-block floating point and block floating point-floating-point conversion process, the operation that rounds up is applied to The accumulation of error is avoided in conversion process, data transmission and convolutional calculation all only need the mantissa part of block floating point, calculate Data bit width can be expanded in the process, position truncation will not occur.Therefore, the disclosure can also guarantee convolutional neural networks mould The accuracy rate of type effectively prevents deviation of convolutional neural networks model parameter during forward-propagating, therefore in convolution mind During forward direction through network is inferred, without carrying out re -training, different convolutional neural networks to convolutional neural networks model Model can be on the accelerator by adjusting parameter configuration to the disclosure.

According to below with reference to the accompanying drawings to detailed description of illustrative embodiments, the other feature and aspect of the disclosure will become It is clear.

Detailed description of the invention

Comprising in the description and constituting the attached drawing of part of specification and specification together illustrates the disclosure Exemplary embodiment, feature and aspect, and for explaining the principles of this disclosure.

Fig. 1 is a kind of block diagram of convolutional neural networks accelerator shown according to an exemplary embodiment.

Fig. 2 shows the schematic diagrames according to the exemplary convolutional layer accelerator of the disclosure one.

Fig. 3 shows the schematic diagram according to the exemplary processing unit of the disclosure one.

Fig. 4 shows the schematic diagram according to the exemplary data format of the disclosure one.

Fig. 5 shows the stream of the accelerated method based on block float point arithmetic of the convolutional neural networks according to one embodiment of the disclosure Cheng Tu.

Fig. 6 shows the data flow of single output channel of the convolutional neural networks accelerator according to one embodiment of the disclosure Figure.

Specific embodiment

Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove It non-specifically points out, it is not necessary to attached drawing drawn to scale.

Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary " Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.

In addition, giving numerous details in specific embodiment below to better illustrate the disclosure. It will be appreciated by those skilled in the art that without certain details, the disclosure equally be can be implemented.In some instances, for Device, means, element and circuit well known to those skilled in the art are not described in detail, in order to highlight the purport of the disclosure.

Fig. 1 is a kind of block diagram of convolutional neural networks accelerator shown according to an exemplary embodiment.The disclosure Convolutional neural networks accelerator can be applied to all kinds of FPGA or ASIC (Application Specific Integrated Circuit, specific integrated circuit), it is not limited here, as shown in Figure 1, the convolutional neural networks accelerator may include: Floating-point-block floating point converter, shift unit and convolutional layer accelerator, block floating point-floating-point converter.

Floating-point-block floating point converter turns the first input feature vector figure group and the first convolution kernel group of convolutional layer respectively It changes and generates the second input feature vector figure group and the second convolution kernel group；Wherein, the second input feature vector figure group and second convolution Data in core group are block floating point data.

Shift unit will be rolled up according to the block index of data in the second input feature vector figure group and the second convolution kernel group First biasing set of lamination is converted to the second biasing set；Wherein, the data in the second biasing set are fixed-point data.

Wherein, include in the first input feature vector figure group and the first convolution kernel group can be floating number.

Floating number can be in a computer to some any real number of approximate representation, and it is 10 that this representation method, which is similar to radix, Scientific notation.For example, this real number can be by an integer or fixed-point number (i.e. mantissa) multiplied by some radix (in computer Usually 2) integral number power obtains, and (Normalized) floating number expression way of specification has following form:

±m×β^e

Wherein, m is mantissa, if the precision of m is p, then m (i.e. mantissa) can be the position p shaped like ± d.ddd...ddd Number, 0≤d < β；β is radix, and e is index.

Fixed-point data can be expressed as the representation method of another number used in computer, participate in the decimal of the number of operation Point position immobilizes.Such as Q format indicates are as follows: Qm.n indicates that data indicate that integer part, n-bit indicate decimal with m bit Part, needs m+n+1 altogether to indicate this data, and extra one is used as and meets position.

Block floating point algorithm can be expressed as imitating floating-point operation with the method for software on fixed-point dsp, with Obtain higher computational accuracy and biggish dynamic range.

Block floating point data can share identical index for a data block, for example, it is assumed that X includes N number of floating number Block, X can be indicated are as follows:

Wherein x_iIt is i-th of element in X, m_iAnd e_iIt is x_iMantissa and exponential part；Maximum exponent bits are determined in block Justice is block exponent bits ∈_X

After derivation block exponent bits, x_iMantissa bit be right-shifted by d_iPosition, wherein d_i=∈_X-e_i；Therefore, the block floating point lattice of X Formula X ' be represented as

Wherein, m '_i=m_i>>d_iIt is the mantissa bit after conversion.

It may include the corresponding first input feature vector figure of one or more input channels in first input feature vector figure group, In, the first input feature vector figure can be expressed as carrying out input feature vector figure obtained from feature extraction to pending data.For example, if Pending data is color image, can carry out red channel feature extraction, green channel feature extraction respectively to color image With blue channel feature extraction, red channel, green channel and the corresponding input feature vector figure of blue channel are respectively obtained.

Generally, in image procossing, for given input picture, each pixel value is defeated in the output image Enter the weighted average of each pixel value in a zonule in image, wherein weight is defined by a function, this function can be with Referred to as convolution kernel.First convolution kernel group may include one or more first convolution kernels.

Floating-point-block floating point converter can obtain the first input feature vector figure group, the first convolution kernel respectively by memory interface Group, and the data in the first input feature vector figure group and the first convolution kernel group are converted into block floating point data, obtain the second input spy Levy figure group and the second convolution kernel group.

Convolutional layer accelerator may include multiple processing engines, for example, Fig. 2 shows according to the exemplary convolutional layer of the disclosure one The schematic diagram of accelerator, in the example of the disclosure by taking 16 channel per treatment as an example, convolutional layer accelerator may include at 16 Manage engine PE1, PE2 ... PE16.

Each processing engine obtains the corresponding multiple convolution kernels of the processing engine from the second convolution kernel group respectively.Often A processing engine may include multiple processing units, the corresponding output channel of a processing unit, each of processing engine Processing unit can obtain the corresponding convolution kernel of each processing unit respectively.

The pixel of the input feature vector figure of all input channels forms a block, shares same piece of exponent bits.Each output All weights (corresponding to the convolution kernel of all input channels) in channel form a block, share same piece of exponent bits.It is this Method of partition is aligned all input datas before convolutional calculation, and data transmission and convolutional calculation all only need block floating point Mantissa part.

As shown in Fig. 2, a processing engine may include 64 processing units, 64 convolution kernels in the second convolution kernel group It is corresponding with 64 processing units in each processing engine, that is to say, that for 64 processing units in a processing engine, Each processing unit carries out the multiply-add operation of convolution using convolution kernel corresponding with itself.The exponent bits of processing unit are equal to corresponding The exponent bits of the exponent bits and weight (convolution kernel) of input feature vector figure and, for example, by taking PU1_1 as an example, exponent bits are equal to PE1 Input feature vector figure exponent bits convolution kernel corresponding with PU1_1 exponent bits and.

First biasing set may include multiple first biasings, and corresponding one first an of output channel of convolutional layer is partially It sets.

Shift unit can be according to the difference of the first of each output channel of convolutional layer the biasing and the exponent bits of processing unit, will First bias shift obtains the second biasing of corresponding fixed point format, b1, b2 ... b64 as shown in Figure 2, to obtain second partially Set set.

Floating-point-block floating point converter and shift unit can also parameter (convolution kernel and biasing) to full articulamentum more than Process is converted.

Convolutional neural networks accelerator can also include memory module, and memory module may include: first memory DDR3M1, second memory DDR3M0, first memory DDR3M1 and second memory DDR3M0 are connect with memory interface respectively, First memory DDR3M1 and second memory DDR3M0 memory capacity can be 4GB.

First input feature vector figure group, the first convolution kernel group and the first biasing set can be stored to first memory DDR3M1 and/or second memory DDR3M0.

The first memory may include the first piecemeal, the second piecemeal, third piecemeal and the 4th piecemeal, and described second deposits Reservoir may include the 5th piecemeal and the 6th piecemeal.

Convolutional neural networks accelerator can also include: PCIe interface, PCIe interface respectively with first memory and The connection of two memories, by PCIe interface can by pending data (for example, first input feature vector figure group) and parameter (such as First biasing set of convolutional layer and the first convolution kernel group) first memory and second memory is written.

First piecemeal is for storing the corresponding first input feature vector figure group of first layer convolutional layer, for example, passing through PCIe Interface the first input feature vector figure group can be written in the first piecemeal of first memory.

Second piecemeal is used to store the parameter of the odd-level of convolutional layer and full articulamentum, and the parameter may include volume Product core, biasing.For example, second piecemeal can be used for storing the corresponding first convolution kernel group of odd-level convolutional layer and first partially Set set.

5th piecemeal is used to store the parameter of the even level of convolutional layer and full articulamentum, and the parameter may include volume Product core, biasing.For example, the 5th piecemeal can be used for storing the corresponding first convolution kernel group of even level convolutional layer and first partially Set set.

The third piecemeal be used for store not be the last layer the corresponding output characteristic pattern of even level convolutional layer；Described Six piecemeals be used for store not be the last layer the corresponding output characteristic pattern of odd-level convolutional layer.4th piecemeal is for storing The output vector of full articulamentum.

Convolutional neural networks accelerator can also include: convolutional layer input-buffer, convolutional layer output caching, full articulamentum Input-buffer, full articulamentum accelerator and full articulamentum output caching.

Convolutional layer input-buffer, connection floating-point-block floating point converter, block floating point-floating-point converter and convolutional layer accelerate Device, can be used for storing the second input feature vector figure group of convolutional layer, the second convolution kernel group and the second biasing set, and by convolutional layer The second input feature vector figure group, the second convolution kernel group and second biasing set be sent to convolutional layer accelerator.

Floating-point-block floating point converter can be by memory interface (according to the number of plies of convolutional layer) from first memory or second Memory reads the first input feature vector figure of convolutional layer, the first convolution kernel group, the first input feature vector figure group to convolutional layer and the One convolution kernel group carries out conversion respectively and generates the second input feature vector figure group and the second convolution kernel group, and shift unit can be connect by storage Mouth (according to the number of plies of convolutional layer) reads the first biasing set of convolutional layer from first memory or second memory, partially to first Set is set to be shifted to obtain the second biasing set.

Convolutional layer input-buffer can store the second input feature vector figure of the convolutional layer after floating-point-block floating point converter conversion Group and the second convolution kernel group.

Convolutional layer input-buffer may include a convolution window, two pixel memories, two weight storage devices and one Block index bit memory.Therefore, convolutional layer input-buffer can deposit the data read from floating-point-block floating point converter Storage.

Convolutional layer accelerator, according to the second input feature vector figure group, the second convolution kernel group and second biasing Set carries out the block floating point output result that the multiply-add operation of convolution obtains the convolutional layer.Convolutional layer accelerator can be according to convolutional layer The the second input feature vector figure group and the second convolution kernel group of the convolutional layer of input-buffer storage and the second biasing of shift unit output Set carries out the multiply-add operation of convolution.

Specifically, each processing engine obtains the corresponding multiple volumes of the processing engine from the second convolution kernel group respectively Product core.Specifically, each processing unit handled in engine obtains the corresponding convolution kernel of each processing unit respectively.

As described above, each processing engine includes 64 processing units, each processing unit is respectively from the second convolution kernel group Obtain corresponding convolution kernel.

Each processing engine obtains corresponding second input of the processing engine from the second input feature vector figure group respectively Characteristic pattern.

The corresponding second input feature vector figure of reason engine and convolution kernel carry out convolution behaviour to each processing engine according to this simultaneously Make, obtains multiple convolution results.

In one possible implementation, multiple processing units in engine are handled and carry out multiple convolution operation to obtain Multiple convolution results.When handling multiple processing units in engine every time while carrying out convolution operation, handle more in engine Multiple pictures that the shared processing engine of a processing unit is obtained from the corresponding second input feature vector figure of processing engine by convolution window Element.The quantity of the multiple pixel can be determined according to the quantity of weight in convolution kernel, for example, the matrix that convolution kernel is 2 × 2, Can so obtain 2 × 2 pixels from the second input feature vector figure, 2 × 2 pixels be also in the second input feature vector figure with The formal distribution of matrix.

Wherein, when processing unit carries out each convolution operation, convolution window is from the corresponding second input feature vector figure of processing engine The middle position for obtaining pixel is different, and in one example, the step-length that convolution window obtains pixel from the second input feature vector figure can be with It is 1.

By taking a certain secondary convolution operation as an example, multiple processing units proceed as follows simultaneously: multiplied unit is more in acquisition After a pixel, while convolution operation is carried out according to the multiple pixel and the corresponding convolution kernel of processing unit, obtains processing unit Corresponding convolution results.

In the secondary convolution operation, the multiple pixel may include the first pixel group that the convolution window obtains in two times With the second pixel group.For example, convolution window obtains 2 × 2 pixels from the second input feature vector figure respectively twice, as the first pixel Group and the second pixel group, twice between the step-length that obtains be 1.As shown in Fig. 2, ix (m, n) is x-th of input channel in position Pixel on (m, n).

Fig. 3 shows the schematic diagram according to the exemplary processing unit of the disclosure one, as shown in figure 3, the processing unit can be with Connect including multiplier, the first accumulator, the second accumulator, the first register being connect with the first accumulator and the second accumulator The second register connect.

Wherein, k_xyIt is the convolution kernel of y-th of output channel in x-th of input channel.

In the secondary convolution operation, the multiplier is by the first pixel obtained from the first pixel group and from the second pixel The second combination of pixels obtained in group at third pixel group, in third pixel group and convolution kernel with the first pixel, second The corresponding weight of pixel carries out multiplication and product is calculated.

Multiplier is identical with mode, the quantity of acquisition pixel in the second pixel group from the first pixel group.

For example, the first pixel, the digit of the second pixel are M, M is positive integer, successively by the first pixel, M vacancy and the Two pixels form third pixel group.Fig. 4 shows the schematic diagram according to the exemplary data format of the disclosure one, as shown in figure 3, A table Show third pixel group, be the first pixel in the position 0-7 of A, 7-15 be vacancy, in 15-23 be the second pixel；B is that processing is single Weight corresponding with the first pixel and the second pixel in the corresponding convolution kernel of member.

Multiplier carries out multiplication to the weight corresponding with the first pixel, the second pixel in third pixel group and convolution kernel Product is calculated, can refer to, the first pixel in third pixel group is multiplied respectively with corresponding weight, the second pixel with Corresponding weight is multiplied respectively, and obtained product digit is 4M.Two multiplication operations can be by a piece of as shown in Figure 3 DSP48E1 is realized.Obtained product P as shown in Figure 3, preceding 2M in product in third pixel group the first pixel with Corresponding weight is multiplied respectively obtaining as a result, rear 2M in product are multiplied respectively with corresponding weight for the second pixel and obtain Result.

First accumulator adds up preceding 2M data of the product, obtains the first pixel group corresponding first Accumulation result；Second accumulator adds up rear 2M data of the product, obtains the second pixel group corresponding Two accumulation results.

Wherein, first register is for storing the first accumulation result that the first accumulator obtains every time, and described second Register is for storing the second accumulation result that the second accumulator obtains every time.

First accumulation result and second accumulation result form the corresponding convolution results of the processing unit.Data Transmission and convolutional calculation all only need the mantissa part of block floating point, can expand data bit width in calculating process, no Position truncation can occur.

It is parallel that three-level is designed in process of convolution array: input channel is parallel, output channel is parallel and Pixel-level is parallel, leads to It crosses using three-level parallel-convolution processing array and Pingpang Memory structure, improves the calculated performance of system.

The convolutional layer accelerator carries out accumulation operations to the multiple convolution results and activation operation obtains the convolution Output characteristic pattern of the block floating point output result of layer as the convolutional layer.

Specifically, the convolutional layer accelerator further includes that multiple third accumulators and each third accumulator are corresponding sharp Flexible module, each third accumulator connect a processing unit in multiple processing engines.It may include Relu in active module Activation primitive.

For example, as shown in Fig. 2, convolutional layer accelerator may include 64 third accumulator A1, A2 ... A64, each third Accumulator connect it is multiple processing engines in a processing units, such as accumulator A1 respectively with it is each processing engine in processing Unit PU1_1, PU2_1 ... PU64_1 connection, accumulator A2 respectively with it is each processing engine in processing unit PU1_2, PU2_ 2 ... PU64_2 connections, etc..The convolution kernel that all processing units of one third accumulator connection use can be identical.

For each third accumulator, which uses different processing units the convolution kernel of same output channel Obtained convolution results add up, and obtain third accumulation result, and the third accumulation result is output to the third and is added up The corresponding active module of device.

Multiple processing engines execute convolution to different input channels simultaneously, and calculated result is added in accumulator.

In one possible implementation, convolutional layer output caching for storing the block floating point output as a result, And by be not the last layer convolutional layer block floating point output result be sent to block floating point-floating-point converter.

Block floating point-floating-point converter, the floating-point for be converted to the convolutional layer to block floating point output result are defeated Output characteristic pattern of the result as the convolutional layer out.

The output characteristic pattern of even level convolutional layer for not being the last layer is possibly stored to third piecemeal as next First input feature vector figure of layer convolutional layer, the output characteristic pattern of odd-level convolutional layer for not being the last layer are possibly stored to First input feature vector figure of 6th piecemeal as next layer of convolutional layer.

Block exponent bits for exporting characteristic pattern are possibly stored to convolutional layer input-buffer, the as next layer of convolutional layer The block exponent bits of two input feature vector figure groups.

Block floating point output for the last layer convolutional layer is as a result, convolutional layer output caching sends block floating point output result To full articulamentum input-buffer, full articulamentum input-buffer is received and the block floating point of storage the last layer convolutional layer exports result simultaneously It is sent to full articulamentum accelerator.Full articulamentum accelerator, for exported according to the block floating point of the last layer convolutional layer as a result, into The full attended operation of row obtains block floating point final result, and block floating point final result is sent to full articulamentum output caching.

Full articulamentum output caching, for block floating point final result to be sent to block floating point-floating-point converter, so that block is floating The block floating point final result is converted to floating-point final result by point-floating-point converter.Floating-point final result can be full connection The output vector of layer, block floating point-floating-point converter can store the output vector of full articulamentum to the 4th piecemeal.

Shifting function is all only needed in floating-point-block floating point and block floating point-floating-point conversion process, the operation that rounds up is applied to The accumulation of error is avoided in conversion process.

Fig. 5 shows the stream of the accelerated method based on block float point arithmetic of the convolutional neural networks according to one embodiment of the disclosure Cheng Tu.Fig. 6 shows the data flow diagram of single output channel of the convolutional neural networks accelerator according to one embodiment of the disclosure.

As shown in Figure 5, which comprises

Step S10, convolutional layer input-buffer read convolutional layer the first input feature vector figure group and convolutional layer first partially Set set and the first convolution kernel group；

Before step S10, the first input feature vector figure group can be written the first of first memory by PCIe interface Piecemeal, by the first convolution kernel and the first biasing according to corresponding in the convolutional layer write-in first memory or second memory Piecemeal.

In one possible implementation, the parameter of convolutional layer and the odd-level of full articulamentum first can be written to deposit Second piecemeal of reservoir, by the 5th piecemeal of the parameter of convolutional layer and the even level of full articulamentum write-in second memory.Convolution The parameter of layer and full articulamentum can be convolution kernel, biasing and block exponent bits as described above.

Convolutional layer input-buffer can read current convolutional layer to be processed from corresponding position according to rule stored above The first input feature vector figure group and convolutional layer first biasing set and the first convolution kernel group.

Whole parameters of one convolutional layer all directly read in on-chip memory in the initial stage, can be improved and handle up Amount.

Step S11, the first input feature vector figure group and the first convolution kernel group to convolutional layer carry out floating-point-block floating point conversion, The the second input feature vector figure group and the second convolution kernel group of block floating point data format are obtained, the first biasing set is shifted to obtain Second biasing set.

Wherein, the second biasing set remains as fixed-point data.Floating-point-block floating point conversion as described above, repeats no more.

Second input feature vector figure group of convolutional layer, the second convolution kernel group and the second biasing set are sent to volume by step S12 Lamination accelerator.

Step S13, convolutional layer accelerator is according to the second input feature vector figure group, the second convolution kernel group and described The block floating point that two biasing set carry out the convolution multiply-add operation acquisition convolutional layer exports result.

As shown in fig. 6, being gathered according to the second input feature vector figure group, the second convolution kernel group and second biasing Carry out the multiply-add operation of convolution.

Step S14, if the convolutional layer is not the last layer convolutional layer, convolutional layer output caching, which exports block floating point, to be tied Fruit is sent to block floating point-floating-point converter (such as Fig. 6), if the convolutional layer is the last layer convolutional layer, convolutional layer output is slow It deposits and block floating point output result is sent to full articulamentum input-buffer.

Step S15, block floating point-floating-point converter carry out the floating-point for being converted to the convolutional layer to block floating point output result Export output characteristic pattern of the result as the convolutional layer.

As shown in fig. 6, for not being that the output characteristic pattern of even level convolutional layer of the last layer is possibly stored to third point First input feature vector figure of the block (in the first memory in external memory) as next layer of convolutional layer, for not being last The output characteristic pattern of the odd-level convolutional layer of layer is possibly stored to the 6th piecemeal (in the second memory in external memory) work For the first input feature vector figure of next layer of convolutional layer.

Step S16, full articulamentum input-buffer receive the block floating point output result of the last layer convolutional layer and are sent to complete Articulamentum accelerator.

Step S17, full articulamentum accelerator are exported according to the block floating point of the last layer convolutional layer as a result, carrying out full connection behaviour Make, obtains block floating point final result, and block floating point final result is sent to full articulamentum output caching.

Block floating point final result is sent to block floating point-floating-point converter by step S18, full articulamentum output caching, so that The block floating point final result is converted to floating-point final result by block floating point-floating-point converter.

Shifting function is all only needed in floating-point-block floating point and block floating point-floating-point conversion process, the operation that rounds up is applied to The accumulation of error is avoided in conversion process, data transmission and convolutional calculation all only need the mantissa part of block floating point, calculate Data bit width can be expanded in the process, position truncation will not occur.Therefore the disclosure can also guarantee convolutional neural networks mould The accuracy rate of type effectively prevents deviation of convolutional neural networks model parameter during forward-propagating, therefore in convolution mind During forward direction through network is inferred, without carrying out re -training, different convolutional neural networks to convolutional neural networks model Model can be on the accelerator by adjusting parameter configuration to the disclosure.

The presently disclosed embodiments is described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In the principle, practical application or technological improvement to the technology in market for best explaining each embodiment, or lead this technology Other those of ordinary skill in domain can understand each embodiment disclosed herein.

Claims

1. a kind of convolutional neural networks accelerator characterized by comprising

Floating-point-block floating point converter carries out conversion life to the first input feature vector figure group and the first convolution kernel group of convolutional layer respectively At the second input feature vector figure group and the second convolution kernel group；

Shift unit, according to the block index of data in the second input feature vector figure group and the second convolution kernel group, by convolutional layer First biasing set be converted to the second biasing set；

Wherein, the data in the second biasing set are fixed-point data；

Convolutional layer accelerator is gathered according to the second input feature vector figure group, the second convolution kernel group and second biasing Carry out the block floating point output result that the multiply-add operation of convolution obtains the convolutional layer；

Block floating point-floating-point converter carries out being converted to the floating of the convolutional layer to the block floating point output result of the convolutional layer Output characteristic pattern of the point output result as the convolutional layer.

2. the apparatus according to claim 1, which is characterized in that the convolutional layer accelerator includes multiple processing engines；

Convolutional layer accelerator is gathered according to the second input feature vector figure group, the second convolution kernel group and second biasing Carry out the block floating point output result that the multiply-add operation of convolution obtains the convolutional layer, comprising:

Each processing engine obtains corresponding second input feature vector of the processing engine from the second input feature vector figure group respectively Figure；

The corresponding second input feature vector figure of reason engine and convolution kernel carry out convolution operation to each processing engine according to this simultaneously, obtain To multiple convolution results；

The convolutional layer accelerator carries out accumulation operations to the multiple convolution results and activation operation obtains the convolutional layer Block floating point exports result.

3. the apparatus of claim 2, which is characterized in that each processing engine includes multiple processing units；

Each processing engine obtains the corresponding multiple convolution kernels of the processing engine from the second convolution kernel group respectively, comprising:

4. device according to claim 3, which is characterized in that the convolution operation that processing engine carries out includes in processing engine Processing unit carry out multiple convolution operation；

When handling multiple processing units in engine every time while carrying out convolution operation, multiple processing units in engine are handled Multiple pixels that shared processing engine is obtained from the corresponding second input feature vector figure of processing engine by convolution window, wherein place When managing the unit each convolution operation of progress, convolution window obtains the position of pixel from the corresponding second input feature vector figure of processing engine It is different.

5. device according to claim 4, which is characterized in that

Multiple processing units in processing engine repeatedly carry out following convolution operation simultaneously, obtain the corresponding volume of each processing unit Product result:

Processing unit carries out convolution operation according to the corresponding convolution kernel of the multiple pixel and processing unit of the secondary acquisition simultaneously, Obtain the corresponding convolution results of processing unit.

6. device according to claim 5, which is characterized in that the multiple pixel in a convolution operation includes described The the first pixel group and the second pixel group that convolution window obtains in two times,

The processing unit includes multiplier, the first accumulator, the second accumulator, the first deposit connecting with the first accumulator Device, the second register being connect with the second accumulator；

Processing unit carries out convolution operation according to the multiple pixel and the corresponding convolution kernel of processing unit, obtains processing unit pair The convolution results answered, comprising:

The multiplier is every time from obtaining the first pixel and obtain the second combination of pixels from the second pixel group in the first pixel group Multiply into third pixel group, to the weight corresponding with the first pixel, the second pixel in third pixel group and convolution kernel Product is calculated in method；

Wherein, the first pixel, the digit of the second pixel are M, and M is positive integer, successively by the first pixel, M vacancy and the second picture Element composition third pixel group；

First accumulator adds up preceding 2M data of the product, obtains the first pixel group corresponding first and adds up As a result；

Second accumulator adds up rear 2M data of the product, obtains the second pixel group corresponding second and adds up As a result；

7. the apparatus according to claim 1, which is characterized in that the convolutional layer accelerator further includes that multiple thirds are cumulative Device and the corresponding active module of each third accumulator, each third accumulator connect a processing in multiple processing engines Unit；

The convolutional layer accelerator carries out accumulation operations to the multiple convolution results and activation operation obtains the convolutional layer Block floating point exports result, comprising:

For each third accumulator, which obtains different processing units using the convolution kernel of same output channel Convolution results add up, obtain third accumulation result, and the third accumulation result is output to the third accumulator pair The active module answered；

For each active module, the third accumulation result which obtains for corresponding third accumulator is activated Operation obtains the block floating point output result of the convolutional layer.

8. the apparatus according to claim 1, which is characterized in that convolutional neural networks accelerator further includes memory module, The memory module includes first memory, and the first memory includes the first piecemeal, the second piecemeal, third piecemeal and the 4th Piecemeal；

4th piecemeal is used to store the output vector of full articulamentum.

9. device according to claim 1 or 8, which is characterized in that the memory module includes second memory, and described Two memories include the 5th piecemeal and the 6th piecemeal；

10. device according to claim 8 or claim 9, which is characterized in that the convolutional neural networks accelerator further include:

Convolutional layer input-buffer, connection floating-point-block floating point converter, block floating point-floating-point converter and convolutional layer accelerator, is used Gather in the second input feature vector figure group of storage convolutional layer, the second convolution kernel group and the second biasing, and defeated by the second of convolutional layer Enter characteristic pattern group, the second convolution kernel group and the second biasing set and is sent to convolutional layer accelerator；

Convolutional layer output caching, connection block floating point-floating-point converter, convolutional layer accelerator and full articulamentum input-buffer are used In storing block floating point output as a result, and will not be that the block floating point output result of the last layer convolutional layer is sent to block floating point- The block floating point output result of the last layer convolutional layer is sent to full articulamentum input-buffer by floating-point converter；

Full articulamentum input-buffer, connects full articulamentum accelerator, for receiving and storing the block floating point of the last layer convolutional layer Output is as a result, and be sent to full articulamentum accelerator；

Full articulamentum accelerator connects full articulamentum output caching, ties for being exported according to the block floating point of the last layer convolutional layer Fruit carries out full attended operation, obtains block floating point final result, and block floating point final result is sent to full articulamentum and is exported and is delayed It deposits；

Full articulamentum output caching, connects block floating point-floating-point converter, for block floating point final result to be sent to block floating point- Floating-point converter, so that the block floating point final result is converted to floating-point final result by block floating point-floating-point converter.