CN109063825B

CN109063825B - Convolutional neural network accelerator

Info

Publication number: CN109063825B
Application number: CN201810865157.9A
Authority: CN
Inventors: 季向阳; 连晓聪
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2020-12-29
Anticipated expiration: 2038-08-01
Also published as: CN109063825A

Abstract

The utility model relates to a convolution neural network accelerating device, through converting input characteristic graph group and convolution kernel respectively into block floating point format, replace traditional floating point operation with block floating point operation, the input and the output of convolution calculation are all fixed point format, have solved FPGA and have carried out the huge difficult problem of floating point operation cost. The floating point-block floating point conversion process and the block floating point-floating point conversion process only need shift operation, rounding operation is applied to the conversion process to avoid error accumulation, data transmission and convolution calculation only need mantissa parts in a block floating point format, data bit width can be expanded in the calculation process, and bit truncation cannot occur. Therefore, the method can also ensure the accuracy of the convolutional neural network model, and effectively avoids the deviation of the convolutional neural network model parameters in the forward propagation process, so that the convolutional neural network model does not need to be retrained in the forward inference process of the convolutional neural network.

Description

Convolutional neural network accelerator

Technical Field

The present disclosure relates to the field of neural network technology, and in particular, to a convolutional neural network acceleration apparatus.

Background

The convolutional neural network has excellent performance in the application field of artificial intelligence, particularly in the aspects of image recognition, natural language processing, strategic deployment and the like. The success of convolutional neural networks comes primarily from the increase in performance of the computing device. However, with the increase of the number of network layers, the weight data of the convolutional neural network can reach hundreds of megabits and even exceed gigabits, and the network consumes huge computing resources for forward feature extraction computation, classification and error propagation. Therefore, accelerating the convolutional neural network is the key for improving the calculation efficiency of the convolutional neural network model.

In the related art, an FPGA (Field-Programmable Gate Array) is a Programmable logic Gate Array unit, and has an excellent parallel computing capability. The specially designed FPGA has the characteristics of low power consumption, high speed, reconfigurability and the like, and the FPGA can be concentrated on a certain deterministic task without using an operating system, so that the possibility of problems can be reduced. Therefore, the convolutional neural network acceleration scheme based on the FPGA platform becomes a hot development direction.

However, there are two major obstacles to deploying convolutional neural networks on FPGA platforms: FPGA off-chip transmission bottlenecks and huge floating point arithmetic costs. The off-chip transmission mainly comes from frequent access of network parameters and characteristic diagrams, and the bandwidth requirement of the FPGA is increased due to the increase of the number of network layers and the like. The lack of floating point arithmetic units on the FPGA causes the throughput and power efficiency to be reduced at the same time when the floating point arithmetic operations are completed on the FPGA.

Many approaches, such as data multiplexing, compression, and pruning, have been proposed to meet the bandwidth requirements of FPGAs. However, these methods require retraining or entropy encoding of the network, which consumes more time than the original network, preventing real-time processing of convolutional neural networks.

Fixed-point operations are often used to replace floating-point operations to improve the computational performance of FPGAs. However, a common drawback of these approaches is the need for retraining to update the parameters. Retraining is a very resource consuming process that requires more hardware resources when applied in a deep network model.

Disclosure of Invention

In view of the above, the present disclosure provides a convolutional neural network acceleration apparatus to solve the above problems and improve throughput and power consumption efficiency.

According to an aspect of the present disclosure, there is provided a convolutional neural network acceleration apparatus, including:

the floating point-block floating point converter is used for respectively converting the first input characteristic graph group and the first convolution kernel group of the convolution layer to generate a second input characteristic graph group and a second convolution kernel group;

wherein the data in the second input feature map group and the second convolution kernel group are block floating point data;

a shifter for converting the first bias set of convolution layers into a second bias set according to the second input feature map set and the block index of the data in the second convolution kernel set;

wherein, the data in the second bias set is fixed point data;

the convolution layer accelerator is used for performing convolution multiply-add operation according to the second input feature map group, the second convolution kernel group and the second bias set to obtain a block floating point output result of the convolution layer;

and the block floating point-floating point converter is used for converting the block floating point output result of the convolution layer to obtain the floating point output result of the convolution layer as the output characteristic diagram of the convolution layer.

In one possible implementation, the convolutional layer accelerator comprises a plurality of processing engines;

the convolutional layer accelerator performs convolutional multiply-add operation according to the second input feature map group, the second convolutional kernel group and the second bias set to obtain a block floating point output result of the convolutional layer, and includes:

each processing engine acquires a plurality of convolution kernels corresponding to the processing engine from the second convolution kernel group respectively;

each processing engine acquires a second input feature map corresponding to the processing engine from the second input feature map group;

each processing engine simultaneously performs convolution operation according to the second input feature map and the convolution kernel corresponding to the processing engine to obtain a plurality of convolution results;

and the convolutional layer accelerator performs accumulation operation and activation operation on the plurality of convolution results to obtain a block floating point output result of the convolutional layer.

In one possible implementation, each processing engine includes a plurality of processing units;

each processing engine respectively acquires a plurality of convolution kernels corresponding to the processing engine from the second convolution kernel group, and the method comprises the following steps:

each processing unit in the processing engine respectively acquires a convolution kernel corresponding to each processing unit.

In one possible implementation, the convolution operation performed by the processing engine includes a plurality of convolution operations performed by processing units in the processing engine;

and when the plurality of processing units in the processing engine simultaneously perform convolution operation each time, the plurality of processing units in the processing engine share a plurality of pixels acquired by the processing engine from the second input feature map corresponding to the processing engine through the convolution window, wherein the positions of the pixels acquired by the convolution window from the second input feature map corresponding to the processing engine are different when the processing unit performs convolution operation each time.

In one possible implementation, a plurality of processing units in the processing engine perform the following convolution operations multiple times at the same time, so as to obtain a convolution result corresponding to each processing unit:

and the processing unit simultaneously performs convolution operation according to the obtained multiple pixels and convolution kernels corresponding to the processing unit to obtain a convolution result corresponding to the processing unit.

In one possible implementation, the plurality of pixels in one convolution operation includes a first pixel group and a second pixel group obtained by the convolution window in two times, and the processing unit includes a multiplier, a first accumulator, a second accumulator, a first register connected to the first accumulator, and a second register connected to the second accumulator;

the processing unit performs convolution operation according to the plurality of pixels and convolution kernels corresponding to the processing unit to obtain convolution results corresponding to the processing unit, and the convolution results comprise:

the multiplier acquires a first pixel from the first pixel group and a second pixel from the second pixel group each time to form a third pixel group, and multiplies the third pixel group and weights corresponding to the first pixel and the second pixel in the convolution kernel to obtain a product;

the first pixel, the M-bit vacancy and the second pixel are sequentially formed into a third pixel group;

the first accumulator accumulates the first 2M bit data of the product to obtain a first accumulation result corresponding to a first pixel group;

the first register is used for storing a first accumulation result obtained by the first accumulator each time;

the second accumulator accumulates the post-2M bit data of the product to obtain a second accumulation result corresponding to a second pixel group;

the second register is used for storing a second accumulation result obtained by the second accumulator each time;

and the first accumulation result and the second accumulation result form a convolution result corresponding to the processing unit.

In one possible implementation, the convolutional layer accelerator further includes a plurality of third accumulators, and an activation module corresponding to each third accumulator, where each third accumulator is connected to one processing unit in the plurality of processing engines;

the convolutional layer accelerator performs accumulation operation and activation operation on the plurality of convolution results to obtain a block floating point output result of the convolutional layer, and the method comprises the following steps:

for each third accumulator, accumulating convolution results obtained by different processing units by using convolution kernels of the same output channel by the third accumulator to obtain a third accumulation result, and outputting the third accumulation result to an activation module corresponding to the third accumulator;

and for each activation module, performing activation operation on the third accumulation result obtained by the corresponding third accumulator by the activation module to obtain a block floating point output result of the convolutional layer.

In one possible implementation, the convolutional neural network acceleration device further includes a storage module, where the storage module includes a first memory, and the first memory includes a first partition, a second partition, a third partition, and a fourth partition;

the first partitioning block is used for storing a first input feature map set corresponding to the first layer of convolutional layer;

the second partition is used for storing a first convolution kernel group and a first bias set corresponding to the odd-layer convolution layer;

the third partition is used for storing an output characteristic diagram corresponding to the convolution layer of the even number layer which is not the last layer;

the fourth block is used for storing the output vector of the full connection layer.

In one possible implementation, the storage module includes a second memory, and the second memory includes a fifth partition and a sixth partition;

the fifth partition is used for storing a first convolution kernel group and a first bias set corresponding to the convolution layer of the even layer;

and the sixth block is used for storing the output characteristic diagram corresponding to the convolution layer of the odd layer which is not the last layer.

In one possible implementation, the convolutional neural network acceleration apparatus further includes:

the convolution layer input cache is connected with the floating point-block floating point converter, the block floating point-floating point converter and the convolution layer accelerator, is used for storing a second input characteristic graph group, a second convolution kernel group and a second bias set of the convolution layer, and sends the second input characteristic graph group, the second convolution kernel group and the second bias set of the convolution layer to the convolution layer accelerator;

the convolution layer output cache is connected with the block floating point-floating point converter, the convolution layer accelerator and the full connection layer input cache and is used for storing the block floating point output result, sending the block floating point output result which is not the last layer of convolution layer to the floating point-block floating point converter and sending the block floating point output result of the last layer of convolution layer to the full connection layer input cache;

the full-connection layer input cache is connected with the full-connection layer accelerator and used for receiving and storing the block floating point output result of the last layer of convolution layer and sending the result to the full-connection layer accelerator;

the full-connection layer accelerator is connected with the full-connection layer output cache and used for performing full-connection operation according to the block floating point output result of the last convolution layer to obtain a block floating point final result and sending the block floating point final result to the full-connection layer output cache;

and the full connection layer outputs a cache, and the connection block floating point-floating point converter is used for sending the block floating point final result to the block floating point-floating point converter so that the block floating point-floating point converter converts the block floating point final result into a floating point final result.

Has the advantages that:

according to the method, the input characteristic graph group and the convolution kernel are respectively converted into the block floating point format through floating point-block floating point conversion, the traditional floating point operation is replaced by the block floating point operation, the input and the output of the convolution calculation are in the fixed point format, the defect that a floating point operation unit is lacked on an FPGA is ingeniously avoided, the problem that the floating point operation cost of the FPGA is huge is solved, the power consumption of a convolution neural network accelerating device deployed on an FPGA platform is greatly reduced, and the throughput is improved.

The floating point-block floating point conversion process and the block floating point-floating point conversion process only need shift operation, rounding operation is applied to the conversion process to avoid error accumulation, data transmission and convolution calculation only need mantissa parts in a block floating point format, data bit width can be expanded in the calculation process, and bit truncation cannot occur. Therefore, the accuracy of the convolutional neural network model can be guaranteed, the deviation of parameters of the convolutional neural network model in the forward propagation process is effectively avoided, the convolutional neural network model does not need to be retrained in the forward inference process of the convolutional neural network, and different convolutional neural network models can be configured on the accelerating device disclosed by the invention through adjusting the parameters.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a block diagram illustrating a convolutional neural network acceleration device, according to an exemplary embodiment.

Fig. 2 shows a schematic diagram of a convolutional layer accelerator according to an example of the present disclosure.

Fig. 3 shows a schematic diagram of a processing unit according to an example of the present disclosure.

Fig. 4 shows a schematic diagram of a data format according to an example of the present disclosure.

FIG. 5 illustrates a flow diagram of a block floating point operation based acceleration method of a convolutional neural network, according to an embodiment of the present disclosure.

Fig. 6 illustrates a data flow diagram for a single output channel of a convolutional neural network acceleration device, according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, devices, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

FIG. 1 is a block diagram illustrating a convolutional neural network acceleration device, according to an exemplary embodiment. The convolutional neural network acceleration device of the present disclosure may be applied to various FPGAs or ASICs (Application Specific Integrated circuits), which are not limited herein, and as shown in fig. 1, the convolutional neural network acceleration device may include: floating point-to-block floating point converters, shifters and convolutional layer accelerators, block floating point-to-floating point converters.

The floating point-block floating point converter is used for respectively converting the first input characteristic graph group and the first convolution kernel group of the convolution layer to generate a second input characteristic graph group and a second convolution kernel group; and the data in the second input feature map group and the second convolution kernel group are block floating point data.

A shifter for converting the first bias set of convolution layers into a second bias set according to the second input feature map set and the block index of the data in the second convolution kernel set; and the data in the second bias set is fixed point data.

The first input feature map group and the first convolution kernel group can both be floating point numbers.

Floating point numbers can be used in computers to approximate any real number, similar to the scientific notation of radix 10. For example, this real number may be derived from an integer or fixed-point number (i.e., mantissa) multiplied by the integer power of some radix (typically 2 in computers), and the canonical (Normalized) floating-point number representation has the form:

±m×β^e

where m is a mantissa, provided that the precision of m is p, then m (i.e., the mantissa) may be a p-bit number shaped as ± d.ddd.. ddd, with 0 ≦ d < β; β is the base and e is the exponent.

Fixed point data may be represented as another method of representation of numbers employed in computers, with the decimal point positions of the numbers involved in the operation being fixed. For example, the Q format is expressed as: qm, n, representing data with m bits representing the integer part and n bits representing the fractional part, requiring m + n +1 bits in total to represent this data, with the extra one bit serving as the coincident bit.

Block floating point algorithms can be expressed as software-emulating floating point operations on a fixed point digital signal processor to achieve higher computational accuracy and greater dynamic range.

The floating-point data of a block may share the same exponent for one data block, for example, assuming that X is a block containing N floating-point numbers, X may be represented as:

wherein x_iIs the i-th element in X, m_iAnd e_iIs x_iThe mantissa and exponent portions of; the largest exponent bit in the block is defined as the block exponent bit e_X

After deriving the block exponent bits, x_iIs right shifted by d_iBit in which d_i＝∈_X-e_i(ii) a Thus, the block floating point format X' of X is represented as

Wherein m'_i＝m_i>>d_iIs the converted mantissa bit.

The first input feature map group may include first input feature maps corresponding to one or more input channels, where the first input feature maps may be represented as input feature maps obtained by performing feature extraction on data to be processed. For example, if the data to be processed is a color image, red light channel feature extraction, green light channel feature extraction, and blue light channel feature extraction may be performed on the color image, so as to obtain input feature maps corresponding to the red light channel, the green light channel, and the blue light channel, respectively.

In general, in image processing, for a given input image, each pixel value in the output image is a weighted average of pixel values in a small region in the input image, where the weight is defined by a function, which may be referred to as a convolution kernel. The first set of convolution kernels may include one or more first convolution kernels.

The floating point-block floating point converter can respectively obtain the first input characteristic graph group and the first convolution kernel group through the storage interface, and convert data in the first input characteristic graph group and the first convolution kernel group into block floating point data to obtain a second input characteristic graph group and a second convolution kernel group.

The convolutional layer accelerator may include a plurality of processing engines, for example, fig. 2 shows a schematic diagram of a convolutional layer accelerator according to an example of the present disclosure, which may include 16 processing engines PE1, PE2 … PE16, taking 16 channels per process as an example.

And each processing engine acquires a plurality of convolution kernels corresponding to the processing engine from the second convolution kernel group respectively. Each processing engine may include a plurality of processing units, one processing unit corresponding to one output channel, and each processing unit in the processing engine may obtain a convolution kernel corresponding to each processing unit, respectively.

The pixels of the input feature map of all input channels constitute one block, sharing the same block exponent bits. All weights (i.e., convolution kernels corresponding to all input channels) for each output channel constitute a block, sharing the same block exponent bits. The blocking method enables all input data to be aligned before convolution calculation, and only mantissa parts in block floating point format are needed for data transmission and convolution calculation.

As shown in fig. 2, one processing engine may include 64 processing units, and the 64 convolution kernels in the second convolution kernel group correspond to 64 processing units in each processing engine, that is, for the 64 processing units in one processing engine, each processing unit performs convolution multiply-add operation with its corresponding convolution kernel. The exponent bit of the processing unit is equal to the sum of the exponent bit of the corresponding input feature map and the exponent bit of the weight (convolution kernel), e.g., PU1_1, whose exponent bit is equal to the sum of the exponent bit of the input feature map of PE1 and the exponent bit of the convolution kernel corresponding to PU1_ 1.

The first set of offsets may include a plurality of first offsets, one first offset for each output channel of the convolutional layer.

The shifter may shift the first offset into a second offset of a corresponding fixed-point format according to a difference between the first offset of each output channel of the convolutional layer and an exponent bit of the processing unit, such as b1, b2 … b64 shown in fig. 2, thereby obtaining a second offset set.

The floating point to block floating point converter and shifter may also convert the parameters (convolution kernel and offset) of the fully-connected layer according to the above procedure.

The convolutional neural network accelerating device may further include a storage module, and the storage module may include: the first memory DDR3M1, the second memory DDR3M0, the first memory DDR3M1 and the second memory DDR3M0 are respectively connected with the memory interface, and the memory capacity of the first memory DDR3M1 and the memory capacity of the second memory DDR3M0 can be 4 GB.

The first set of input signatures, the first set of convolution cores, and the first set of offsets may be stored to a first memory DDR3M1 and/or a second memory DDR3M 0.

The first memory may include a first partition, a second partition, a third partition, and a fourth partition, and the second memory may include a fifth partition and a sixth partition.

The convolutional neural network accelerating device may further include: and the PCIe interface is respectively connected with the first memory and the second memory, and data to be processed (for example, the first input feature map group) and parameters (for example, the first bias set and the first convolution kernel group of the convolution layer) can be written into the first memory and the second memory through the PCIe interface.

The first partition is used for storing a first input feature map set corresponding to the first layer convolutional layer, for example, the first input feature map set may be written into the first partition of the first memory through a PCIe interface.

The second partition is used for storing parameters of the convolution layer and the odd layer of the full connection layer, and the parameters can comprise convolution kernels and offset. For example, the second partition may be used to store a first set of convolution kernels and a first set of offsets corresponding to odd layer convolution layers.

The fifth partition is used for storing parameters of even layers of the convolutional layer and the fully-connected layer, and the parameters can comprise convolutional kernels and offset. For example, the fifth partition may be used to store a first set of convolution kernels and a first set of offsets corresponding to even layer convolutional layers.

The third partition is used for storing an output characteristic diagram corresponding to the convolution layer of the even number layer which is not the last layer; and the sixth block is used for storing the output characteristic diagram corresponding to the convolution layer of the odd layer which is not the last layer. The fourth block is used for storing the output vector of the full connection layer.

The convolutional neural network accelerating device may further include: a convolutional layer input cache, a convolutional layer output cache, a fully-connected layer input cache, a fully-connected layer accelerator, and a fully-connected layer output cache.

The convolution layer input cache is connected with the floating point-block floating point converter, the block floating point-floating point converter and the convolution layer accelerator, can be used for storing a second input characteristic graph group, a second convolution kernel group and a second bias set of the convolution layer, and sends the second input characteristic graph group, the second convolution kernel group and the second bias set of the convolution layer to the convolution layer accelerator.

The floating point-block floating point converter can read a first input feature map and a first convolution kernel group of the convolution layer from a first memory or a second memory through a storage interface (according to the number of layers of the convolution layer), respectively convert the first input feature map group and the first convolution kernel group of the convolution layer to generate a second input feature map group and a second convolution kernel group, and the shifter can read a first offset set of the convolution layer from the first memory or the second memory through the storage interface (according to the number of layers of the convolution layer) and shift the first offset set to obtain a second offset set.

The convolution layer input cache may store a second input feature map set and a second convolution kernel set of the convolution layer converted by the floating-point-to-block floating-point converter.

The convolutional layer input buffer may include a convolutional window, two pixel memories, two weight memories, and a block exponent bit memory. Thus, the convolutional layer input cache may store data read from the floating-point-to-block floating-point converter.

And the convolution layer accelerator performs convolution multiply-add operation according to the second input feature map group, the second convolution kernel group and the second bias set to obtain a block floating point output result of the convolution layer. The convolutional layer accelerator may perform a convolutional multiply-add operation according to a second input feature map set and a second convolutional kernel set of the convolutional layer stored by the convolutional layer input buffer, and a second offset set output by the shifter.

Specifically, each processing engine acquires a plurality of convolution kernels corresponding to the processing engine from the second convolution kernel group. Specifically, each processing unit in the processing engine acquires a convolution kernel corresponding to each processing unit.

As described above, each processing engine includes 64 processing units, each of which acquires a corresponding convolution kernel from the second convolution kernel group.

And each processing engine acquires a second input feature map corresponding to the processing engine from the second input feature map group.

And each processing engine simultaneously performs convolution operation according to the second input feature diagram corresponding to the processing engine and the convolution kernel to obtain a plurality of convolution results.

In one possible implementation, multiple convolution operations are performed by multiple processing units in a processing engine to obtain multiple convolution results. And when the plurality of processing units in the processing engine simultaneously perform convolution operation, the plurality of processing units in the processing engine share a plurality of pixels acquired by the processing engine from the second input feature map corresponding to the processing engine through the convolution window. The number of the plurality of pixels may be determined according to the number of the weights in the convolution kernel, for example, the convolution kernel is a2 × 2 matrix, and then 2 × 2 pixels may be obtained from the second input feature map, where the 2 × 2 pixels are also distributed in the form of a matrix in the second input feature map.

In this case, the convolution window may obtain pixels from the second input feature map corresponding to the processing engine at different positions each time the processing unit performs a convolution operation, and in one example, the step size of the convolution window obtaining pixels from the second input feature map may be 1.

Taking a certain convolution operation as an example, the following operations are simultaneously performed by a plurality of processing units: after the multi-processing unit obtains a plurality of pixels, convolution operation is carried out according to the plurality of pixels and convolution kernels corresponding to the processing units, and convolution results corresponding to the processing units are obtained.

In this convolution operation, the plurality of pixels may include a first pixel group and a second pixel group acquired by the convolution window in two times. For example, the convolution window acquires 2 × 2 pixels from the second input feature map as the first pixel group and the second pixel group, respectively, twice with a step size of 1 therebetween. As shown in fig. 2, ix (m, n) is the pixel of the xth input channel at position (m, n).

Fig. 3 shows a schematic diagram of a processing unit according to an example of the present disclosure, which may include a multiplier, a first accumulator, a second accumulator, a first register coupled to the first accumulator, and a second register coupled to the second accumulator, as shown in fig. 3.

Wherein k is_xyIs the convolution kernel for the y output channel of the x input channel.

In the convolution operation, the multiplier combines a first pixel obtained from the first pixel group and a second pixel obtained from the second pixel group into a third pixel group, and multiplies weights corresponding to the first pixel and the second pixel in the third pixel group and the convolution kernel to obtain a product.

The way and the number of the multipliers for acquiring the pixels from the first pixel group and the second pixel group are the same.

For example, the first pixel and the second pixel have a bit number of M, where M is a positive integer, and the first pixel, the M-bit space, and the second pixel are sequentially grouped into a third pixel group. FIG. 4 illustrates a schematic diagram of a data format according to an example of the present disclosure, as shown in FIG. 3, A representing a third group of pixels, A being a first pixel in bits 0-7, a null in bits 7-15, and a second pixel in bits 15-23; and B is the weight corresponding to the first pixel and the second pixel in the convolution kernel corresponding to the processing unit.

The multiplier multiplies the weights corresponding to the first pixel and the second pixel in the third pixel group and the convolution kernel to obtain a product, which may mean that the first pixel in the third pixel group is multiplied by the corresponding weight, the second pixel is multiplied by the corresponding weight, and the obtained product bit number is 4M bits. The two multiplication operations may be implemented by a piece of DSP48E1 as shown in fig. 3. The obtained product is shown as P in fig. 3, the first 2M bits in the product are the result of multiplying the first pixel in the third pixel group by the corresponding weight respectively, and the second 2M bits in the product are the result of multiplying the second pixel by the corresponding weight respectively.

The first accumulator accumulates the first 2M bit data of the product to obtain a first accumulation result corresponding to a first pixel group; and the second accumulator accumulates the post-2M-bit data of the product to obtain a second accumulation result corresponding to the second pixel group.

The first register is used for storing a first accumulation result obtained by the first accumulator each time, and the second register is used for storing a second accumulation result obtained by the second accumulator each time.

And the first accumulation result and the second accumulation result form a convolution result corresponding to the processing unit. Both data transmission and convolution calculation only need the mantissa part in the block floating point format, and the data bit width can be expanded in the calculation process without bit truncation.

Three levels of parallelism are designed in the convolution processing array: the input channel parallelism, the output channel parallelism and the pixel level parallelism are adopted, and the computing performance of the system is improved by adopting a three-level parallel convolution processing array and a ping-pong storage structure.

And the convolutional layer accelerator performs accumulation operation and activation operation on the plurality of convolution results to obtain a block floating point output result of the convolutional layer, and the block floating point output result is used as an output characteristic diagram of the convolutional layer.

Specifically, the convolutional layer accelerator further comprises a plurality of third accumulators, and an activation module corresponding to each third accumulator, wherein each third accumulator is connected with one processing unit in the plurality of processing engines. The Relu activation function may be included in the activation module.

For example, as shown in fig. 2, the convolutional layer accelerator may include 64 third accumulators a1, a2 … a64, each of which is connected to one processing unit of the plurality of processing engines, e.g., the accumulator a1 is connected to the processing unit PU1_1, PU2_1 … PU64_1, respectively, the accumulator a2 is connected to the processing unit PU1_2, PU2_2 … PU64_2, respectively, and so on. The convolution kernels used by all processing units connected to a third accumulator may be the same.

And for each third accumulator, accumulating convolution results obtained by different processing units by using convolution kernels of the same output channel by the third accumulator to obtain a third accumulation result, and outputting the third accumulation result to an activation module corresponding to the third accumulator.

Multiple processing engines simultaneously perform convolutions on different input channels and the results of the calculations are added in an accumulator.

In one possible implementation, the convolutional layer output cache is configured to store the block floating point output result, and send the block floating point output result that is not the last convolutional layer to the block floating point-to-floating point converter.

And the block floating point-floating point converter is used for converting the block floating point output result to obtain a floating point output result of the convolutional layer as an output characteristic diagram of the convolutional layer.

The output characteristic map for even numbered convolutional layers other than the last layer may be stored to the third block as the first input characteristic map for the next convolutional layer, and the output characteristic map for odd numbered layers other than the last layer may be stored to the sixth block as the first input characteristic map for the next convolutional layer.

The block exponent bits for the output feature map may be stored in a convolutional layer input buffer as block exponent bits for a second input feature map set for a next layer of convolutional layers.

And for the block floating point output result of the last layer of the convolution layer, the convolution layer output cache sends the block floating point output result to the full-connection layer input cache, and the full-connection layer input cache receives and stores the block floating point output result of the last layer of the convolution layer and sends the block floating point output result to the full-connection layer accelerator. And the full-connection layer accelerator is used for performing full-connection operation according to the block floating point output result of the last convolution layer to obtain a block floating point final result and sending the block floating point final result to a full-connection layer output cache.

And the full connection layer output cache is used for sending the block floating point final result to the block floating point-floating point converter so that the block floating point-floating point converter converts the block floating point final result into a floating point final result. The floating-point final result may be an output vector of the fully-connected layer, and the block floating-point-to-floating-point converter may store the output vector of the fully-connected layer to the fourth partition.

Both the floating point-to-block floating point and the block floating point-to-floating point conversion processes only need shift operation, and rounding operation is applied to the conversion process to avoid error accumulation.

FIG. 5 illustrates a flow diagram of a block floating point operation based acceleration method of a convolutional neural network, according to an embodiment of the present disclosure. Fig. 6 illustrates a data flow diagram for a single output channel of a convolutional neural network acceleration device, according to an embodiment of the present disclosure.

As shown in fig. 5, the method includes:

step S10, the convolutional layer input buffer reads the first input feature map group of the convolutional layer, and the first bias set and the first convolutional kernel group of the convolutional layer;

before step S10, the first input feature map set may be written into the first partition of the first memory through the PCIe interface, and the first convolution core and the first bias may be written into the corresponding partition of the first memory or the second memory according to the convolution layer.

In one possible implementation, the parameters of the odd layers of the convolutional layer and the fully-connected layer may be written into the second partition of the first memory, and the parameters of the even layers of the convolutional layer and the fully-connected layer may be written into the fifth partition of the second memory. The parameters of the convolutional layer and the fully-connected layer may be the convolutional kernels, offsets, and block exponent bits as described above.

The convolutional layer input cache may read the first input feature map set of the convolutional layer to be currently processed, and the first offset set and the first convolutional core set of the convolutional layer from the corresponding locations according to the above storage rule.

All parameters of one convolutional layer are all directly read into an on-chip memory in the initial stage, so that the throughput can be improved.

Step S11, perform floating point-block floating point conversion on the first input feature map group and the first convolution kernel group of the convolution layer to obtain a second input feature map group and a second convolution kernel group in a block floating point data format, and shift the first bias set to obtain a second bias set.

Wherein the second bias set remains fixed-point data. The floating point to block floating point conversion is as described above and will not be described further.

Step S12, sending the second input feature map group, the second convolution kernel group, and the second bias set of the convolution layer to the convolution layer accelerator.

Step S13, the convolutional layer accelerator performs a convolution multiply-add operation according to the second input feature map group, the second convolutional kernel group, and the second bias set to obtain a block floating point output result of the convolutional layer.

As shown in fig. 6, a convolution multiply-add operation is performed according to the second input feature map group, the second convolution kernel group, and the second bias set.

Step S14, if the convolutional layer is not the last convolutional layer, the convolutional layer output cache sends the block floating point output result to the block floating point-to-floating point converter (as shown in fig. 6), and if the convolutional layer is the last convolutional layer, the convolutional layer output cache sends the block floating point output result to the fully-connected layer input cache.

In step S15, the block floating point-to-floating point converter converts the block floating point output result to obtain a floating point output result of the convolutional layer as the output characteristic diagram of the convolutional layer.

As shown in fig. 6, the output characteristic map for the even-numbered convolutional layers that are not the last layer may be stored in the third block (in the first memory in the external memory) as the first input characteristic map for the next-layer convolutional layer, and the output characteristic map for the odd-numbered convolutional layers that are not the last layer may be stored in the sixth block (in the second memory in the external memory) as the first input characteristic map for the next-layer convolutional layer.

In step S16, the full-link input buffer receives the block floating point output result of the last convolutional layer and sends the result to the full-link accelerator.

And step S17, the accelerator of the full connection layer performs full connection operation according to the block floating point output result of the last convolution layer to obtain the final block floating point result, and sends the final block floating point result to the output cache of the full connection layer.

In step S18, the full link layer outputs the cache, and sends the block floating point final result to the block floating point-floating point converter, so that the block floating point-floating point converter converts the block floating point final result into a floating point final result.

The floating point-block floating point conversion process and the block floating point-floating point conversion process only need shift operation, rounding operation is applied to the conversion process to avoid error accumulation, data transmission and convolution calculation only need mantissa parts in a block floating point format, data bit width can be expanded in the calculation process, and bit truncation cannot occur. Therefore, the method can also ensure the accuracy of the convolutional neural network model, and effectively avoid the deviation of the parameters of the convolutional neural network model in the forward propagation process, so that the convolutional neural network model does not need to be retrained in the forward inference process of the convolutional neural network, and different convolutional neural network models can be configured on the accelerating device disclosed by the invention by adjusting the parameters.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A convolutional neural network acceleration device applied to a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC), comprising:

wherein, the data in the second bias set is fixed point data;

the block floating point-floating point converter is used for converting the block floating point output result of the convolutional layer to obtain the floating point output result of the convolutional layer as an output characteristic diagram of the convolutional layer;

the convolutional layer accelerator comprises a plurality of processing engines; each processing engine includes a plurality of processing units;

each processing engine simultaneously performs convolution operation, and the convolution operation performed by the processing engine comprises multiple convolution operations performed by processing units in the processing engines;

when a plurality of processing units in the processing engine simultaneously perform convolution operation each time, the plurality of processing units in the processing engine share a plurality of pixels which are acquired by the processing engine from a second input feature map corresponding to the processing engine through a convolution window, wherein the positions of the pixels acquired by the convolution window from the second input feature map corresponding to the processing engine are different when the processing units perform convolution operation each time;

the plurality of pixels in one convolution operation includes a first pixel group and a second pixel group acquired by the convolution window in two times,

the processing unit comprises a multiplier, a first accumulator, a second accumulator, a first register connected with the first accumulator and a second register connected with the second accumulator;

2. The device of claim 1, wherein the device is a magnetic disk drive

3. The apparatus of claim 2,

4. The apparatus of claim 2, wherein the convolutional layer accelerator further comprises a plurality of third accumulators, and an activation module corresponding to each third accumulator, each third accumulator coupled to one processing unit of the plurality of processing engines;

5. The apparatus of claim 1, wherein the convolutional neural network acceleration means further comprises a storage module comprising a first memory comprising a first partition, a second partition, a third partition, and a fourth partition;

6. The apparatus of claim 5, wherein the storage module comprises a second memory comprising a fifth partition and a sixth partition;

7. The apparatus of claim 5 or 6, wherein the convolutional neural network acceleration means further comprises:

the convolution layer output cache is connected with the block floating point-floating point converter, the convolution layer accelerator and the full connection layer input cache and is used for storing the block floating point output result, sending the block floating point output result which is not the last convolution layer to the block floating point-floating point converter and sending the block floating point output result of the last convolution layer to the full connection layer input cache;