CN110852416A

CN110852416A - CNN accelerated computing method and system based on low-precision floating-point data expression form

Info

Publication number: CN110852416A
Application number: CN201910940659.8A
Authority: CN
Inventors: 吴晨; 王铭宇; 徐世平
Original assignee: Chengdu Star Innovation Technology Co ltd
Current assignee: Shenzhen Biong Core Technology Co ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-02-28
Anticipated expiration: 2039-09-30
Also published as: CN110852416B

Abstract

The invention discloses a CNN accelerated calculation method and a CNN accelerated calculation system based on a low-precision floating point data representation form, and relates to the field of CNN accelerated calculation; the accelerated calculation method comprises the following steps: the floating point number functional module receives an input activation value and a weight from the storage system according to the control signal, and distributes the input activation value and the weight to different processing units PE for convolution calculation to complete CNN accelerated calculation; the convolution calculation comprises forward calculation of convolution layers which are completed by performing dot product calculation on MaEb floating point numbers quantized through low-precision floating point number representation forms; the invention realizes the accurate CNN after quantification under the condition of not needing retraining by calculating and using the expression form MaEb of low-precision floating point numberRate; by performing low-precision floating-point number multiplication, N is realized by DSP_mThe MaEb floating-point number multiplier greatly improves the accelerated performance of a customized circuit or an un-customized circuit under the condition of ensuring the accuracy, the customized circuit is an ASIC (application specific integrated circuit) or an SOC (system on chip), and the un-customized circuit comprises an FPGA (field programmable gate array).

Description

CNN accelerated computing method and system based on low-precision floating-point data expression form

Technical Field

The invention relates to the field of deep convolutional neural network quantization, in particular to a CNN (convolutional neural network) accelerated calculation method and system based on a low-precision floating point data representation form.

Background

In recent years, the application of AI (Artificial Intelligence) has penetrated many aspects such as face recognition, game play, image processing, simulation, and the like, and although the processing accuracy is improved, since a neural network includes many layers and a large number of parameters, it requires a very large computational cost and storage space. In this regard, technicians have proposed a neural network compression processing scheme, that is, parameters or storage space of the network are reduced by changing the network structure or using a quantization and approximation method, and network cost and storage space are reduced without greatly affecting the performance of the neural network.

Patent numbers in the prior art: CN109740737A, patent name: a convolutional neural network quantization processing method, a device and a computer device are provided, the method comprises the following steps: acquiring the maximum weight and the maximum deviation of each convolution layer in the convolution neural network; calculating a first dynamic bit precision value of the maximum weight value and a second dynamic bit precision value of the maximum deviation amount, wherein the first dynamic bit precision value is different from the second dynamic bit precision value; quantizing the weight and deviation of the corresponding convolutional layer by using the first dynamic bit precision value and the second dynamic bit precision value corresponding to each convolutional layer; and obtaining the convolution result of the convolutional neural network based on the quantized weight and the quantized deviation amount in each convolutional layer. According to the scheme, a double-precision quantization processing method is adopted to improve the accuracy after quantization, specifically, the maximum weight and the maximum deviation of a convolutional layer in a convolutional neural network are obtained, the dynamic bit precision value of the maximum weight and the dynamic bit precision value of the maximum deviation are respectively calculated, and then convolution calculation is realized by utilizing the two dynamic bit precision values.

Although the prior art improves the quantization and improves the quantization accuracy, there are still several limitations: 1) for quantized deep convolutional neural networks (the number of convolutional layers/fully-connected layers exceeds 100 layers), retraining is required to ensure accuracy; 2) quantization requires the use of 16-bit floating point numbers or 8-bit fixed point numbers to ensure accuracy; 3) on the premise of not using retraining and ensuring accuracy, the prior art can only realize two multiplication operations at most in one DSP, thereby causing lower acceleration performance on an FPGA.

Therefore, a CNN accelerated computation method and system based on low-precision floating-point data representation form are needed, which overcome the above problems, achieve finding the optimal data representation form MaEb without retraining, and achieve N through DSP_mAnd the MaEb floating point number multiplier ensures the accuracy of the quantized convolutional neural network and improves the acceleration performance of a custom circuit or a non-custom circuit.

Disclosure of Invention

The invention aims to: the invention provides a CNN (convolutional neural network) accelerated calculation method and system based on a low-precision floating point data representation form, which uses a representation form MaEb of low-precision floating point data to ensure the accuracy of a quantized convolutional neural network without retraining, performs low-precision floating point multiplication operation through the floating point of the MaEb, and realizes N through a DSP (digital signal processor)_mAnd the MaEb floating-point number multiplier improves the acceleration performance of the customized circuit or the non-customized circuit.

The technical scheme adopted by the invention is as follows:

the CNN accelerated calculation method based on the low-precision floating point data expression form comprises the following steps:

the central control module generates a control signal to arbitrate the floating-point number functional module and the storage system;

the floating-point number functional module receives an input activation value and a weight from a storage system according to a control signal, and distributes the input activation value and the weight to different processing units PE to carry out convolution calculation of each convolution layer, so that CNN accelerated calculation is completed;

the convolution calculation includes forward calculation of convolution layers completed by performing dot product calculation on MaEb floating point numbers quantized through low-precision floating point number representation forms, wherein a and b are positive integers.

Preferably, the forward calculation of the convolution layer completed by the dot product calculation of the MaEb floating point number quantized by the low-precision floating point number representation form includes the following steps:

step a: quantizing input data of the single-precision floating point number into a floating point number of MaEb in a low-precision floating point number expression form, wherein the input data comprises an input activation value, a weight and a bias, and a + b is more than 0 and less than or equal to 31;

step b: distributing the floating point number of MaEb to parallel N in the floating point function module_mForward computing the low-precision floating-point number multipliers to obtain a full-precision floating-point number product, wherein N_mRepresenting the number of low-precision floating-point number multipliers of one processing unit PE in the floating-point number functional module;

step c: transmitting the full-precision floating point number product to a data conversion module to obtain a fixed point number result without precision loss;

step b: and after distributing the fixed point number result to 4T parallel fixed point number addition trees, sequentially accumulating, pooling and activating the fixed point number addition tree result and the bias in the input data through a post-processing unit to finish the calculation of the convolution layer, wherein T is a positive integer.

Preferably, the steps a, b, c comprise the steps of:

the original picture and the weight are quantized into a MaEb floating point number through a low-precision floating point number expression form, the bias is quantized into a 16-bit fixed point number, and the quantized original picture, the weight and the bias are input into the network and stored in an external memory;

after the quantized picture and the weight are subjected to low-precision floating point number multiplication to obtain a (2a + b +4) bit floating point number, the (2a + b +4) bit floating point number is converted into a (2a + 2)^(b+1)-1) performing an accumulation calculation after the number of fixed points, and adding the accumulation calculation result and the 16-bit fixed point number of the offset quantization to obtain a 32-bit fixed point number;

and converting the 32-bit fixed point number into a MaEb floating point number as the input of the next layer of the network, and storing the MaEb floating point number into an external memory.

Preferably, the original picture, the floating point number with weight quantized MaEb includes the following steps:

defining a low precision floating point number representation MaEb of the network, the low precision floating point number representation comprising a sign bit, a mantissa, and an exponent;

in the process of optimizing the representation form of the low-precision floating point number, simultaneously changing the combination of the scale factor, the a and the b and calculating the mean square error of the weight and the activation value before and after quantization of each layer of the network, and acquiring the optimal representation form of the low-precision floating point number and the optimal scale factor under the representation form according to the minimum value of the weight and the mean square error of the activation value before and after quantization;

based on the low-precision floating point number representation form and the optimal scale factor, the single-precision floating point number of the original picture and the weight is quantized into a floating point number represented by a low-precision floating point number representation form MaEb;

when a is 4 or 5, the network quantized in the low-precision floating-point number representation is the optimal result.

Preferably, the performing the low precision floating point number multiplication operation on the floating point number of MaEb includes the following steps:

the floating point number of MaEb is divided into an a-bit multiplier-adder and a b-bit adder, and the calculation formula is as follows:

wherein M is_x，M_y，E_x，E_yDenotes the mantissa and exponent of X and Y, respectively, equation 0.M_x×0.M_y+(1.M_x+0.M_y) Realized by an a-bit unsigned fixed-point multiply-add device, equation E_x+E_yCan be realized by a b-bit unsigned fixed point number adder;

based on the DSP implemented multiplier-adder P ═ a × B + C, the blank bits added at the input ports implement a number of a-bit multiplier-adders, where A, B, C denotes the three input ports of the DSP.

Preferably, the maximum value of the A, B, C bit width is 25, 18 and 48 respectively.

A system comprises a customization circuit or a non-customization circuit, wherein the customization circuit or the non-customization circuit comprises a floating point function module, the floating point function module is used for receiving an input activation value and a weight from a storage system according to a control signal, distributing the input activation value and the weight to different processing units PE and calculating the convolution which is quantized into MaEb floating point number through a low-precision floating point number representation form in parallel, wherein a and b are positive integers;

the storage system is used for caching the input characteristic diagram, the weight and the output characteristic diagram;

the central control module is used for arbitrating the floating point number functional module and the storage system after decoding the instruction into a control signal;

the floating-point number functional module comprises N parallel processing units PE, and the processing units PE realize N through DSP_mA MaEb floating-point number multiplier, wherein N is a positive integer and N_mIndicating the number of low precision floating point multipliers of a processing element PE in the floating point function.

Preferably, each processing element PE comprises 4T parallel branches, each of which contains N_m/(4T) multipliers, N_mAnd 4T, the multiplier, the data conversion module, the fixed point number addition tree and the post-processing unit are sequentially connected, wherein T is a positive integer.

Preferably, the storage system comprises an input feature map caching module IFMB with a ping-pong architecture, a weight caching module WB and an output feature map caching module OFMB.

Preferably, the post-processing unit comprises an accumulator, a pooling layer and an activation function connected in sequence.

Preferably, a and b satisfy 0 < a + b ≦ 31, and when a is 4 or 5, the network quantized in the low precision floating point number representation is the optimal result.

Preferably, the custom circuit comprises an ASIC or SOC and the off-the-shelf circuit comprises an FPGA.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. the invention uses a low-precision floating point number representation form MaEb, can find an optimal data representation form without retraining, only needs 4 bits or 5 bits of mantissas, ensures that the accuracy loss of top-1/top-5 can be ignored, and the reduction values of the accuracy of top-1/top-5 are respectively within 0.5%/0.3%;

2. the invention realizes the multiplication operation of 8-bit low-precision floating point number by using a 4-bit multiplier-adder and a 3-bit adder, and realizes 4 low-precision floating point number multiplication operations in the same mode in one DSP, which is equivalent to realizing the multiplication operation in four convolution operations in one DSP, compared with the prior art that at most two multiplication operations can be realized by using one DSP, the invention greatly improves the accelerated performance on a customized circuit or an un-customized circuit under the condition of ensuring the accuracy, wherein the customized circuit comprises an ASIC (application specific integrated circuit) or an SOC (system on chip), and the un-customized circuit comprises an FPGA (field programmable gate array);

3. compared with an Intel i9 CPU, the throughput of the invention is improved by 64.5 times, and compared with the existing FPGA accelerator, the throughput of the invention is improved by 1.5 times; for VGG16 and a YOLO convolutional neural network, compared with the existing six FPGA accelerators, the throughput is respectively improved by 3.5 times and 27.5 times, and the throughput of a single DSP is respectively improved by 4.1 times and 5 times;

4. the data representation form of the invention can also be applied to ASIC aspect, in ASIC design, the number of standard units needed is less than that of 8-bit specific point number multipliers;

5. when the forward calculation of the convolution layer is carried out based on the quantization method, the fixed point number of the accumulation result is converted into the floating point number, so that the storage resource is saved; the floating point number accumulation is converted into fixed point number accumulation, so that a large number of customized circuit or non-customized circuit resources can be saved, and the throughput of the customized circuit or the non-customized circuit is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a block diagram of the system of the present invention;

FIG. 2 is a flow chart of a quantization method of the present invention;

FIG. 3 is a schematic diagram of the forward computational data flow of the quantized convolutional neural network of the present invention;

FIG. 4 is a schematic diagram of a full pipeline architecture of the floating-point function module of the present invention;

FIG. 5 is a schematic diagram of the convolution calculation of the present invention;

FIG. 6 is a diagram of the input form of the DSP port according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The features and properties of the present invention are described in further detail below with reference to examples.

Example 1

The embodiment provides a CNN accelerated calculation method and system based on a low-precision floating point data representation form, wherein a representation form MaEb of low-precision floating point data is used, the accuracy of a quantized convolutional neural network is guaranteed under the condition that retraining is not needed, low-precision floating point multiplication is performed through the floating point of the MaEb, and N is achieved through a DSP_mThe MaEb floating-point number multiplier improves the acceleration performance of a customized circuit or an un-customized circuit, and comprises the following specific steps:

As shown in fig. 4, the forward calculation of the convolution layer completed by the dot product calculation through the MaEb floating point number quantized by the low-precision floating point number representation includes the steps of:

step a: quantizing input data of the single-precision floating point number into a floating point number of MaEb in a low-precision floating point number expression form, wherein the input data comprise an input activation value, weight and bias, and a + b is more than 0 and less than or equal to 31;

The steps a, b and c comprise the following steps:

as shown in fig. 3, the original picture and the weight are quantized into a floating point number of MaEb in a low precision floating point number representation form, the bias is quantized into a fixed point number of 16 bits, and the quantized original picture, the weight and the bias are input into the network and stored in an external memory, wherein a + b is more than 0 and less than or equal to 31, and a and b are positive integers;

As shown in fig. 2, the floating point number of the original picture and the weight quantized MaEb includes the following steps:

defining a low-precision floating point number representation MaEb of the network, wherein the low-precision floating point number representation comprises a sign bit, a mantissa and an exponent, a + b is more than 0 and less than or equal to 31, and a and b are positive integers;

The low-precision floating point number multiplication operation of the MaEb floating point number comprises the following steps:

wherein M is_x，M_y，E_x，E_yDenotes the mantissa and exponent of X and Y, respectively, equation 0.M_x+0.M_y+(1.M_x+0.M_y) Realized by an a-bit unsigned fixed-point multiply-add device, equation E_x+E_yCan be realized by a b-bit unsigned fixed point number adder;

The maximum value of the A, B, C bit width is 25, 18 and 48 respectively.

Quantification details:

defining a low-precision floating point number representation MaEb of the network, wherein the low-precision floating point number representation comprises a sign bit, a mantissa and an exponent, and a and b are positive integers;

and based on the low-precision floating point number representation form and the optimal scale factor, the single-precision floating point number is quantized into the low-precision floating point number.

The decimal value of the low-precision floating point number representation in the step 1 is calculated as follows:

wherein, V_decDecimal values representing the representation of low precision floating point numbers, S, M and E representing sign bits, mantissas and exponents, respectively, all being unsigned values, E_bA bias representing an exponent for introducing a positive and a negative number for the exponent, expressed as:

wherein, DW_EThe bit width of the exponent, the mantissa, and the exponent are all non-fixed.

The optimization in the quantization comprises the following steps:

step aa: the method comprises the following steps of mapping a single-precision floating point number multiplied by a scale factor into a dynamic range which can be represented by a low-precision floating point number, rounding the mapped number to be the nearest low-precision floating point number, and keeping data exceeding the dynamic range to be the maximum value or the minimum value, wherein the calculation formula is as follows:

V_lfp＝quan(V_fp32×2^sf，MIN_lfb，MAX_lfp)

wherein, V_lfpAnd V_fp32Representing decimal values expressed in the form of low-precision floating-point numbers and single-precision floating-point numbers, MIN_lfpAnd MAX_lfpRepresenting the minimum and maximum values that a low precision floating point number can represent, sf representing a scale factor, quan (x, IN, MAX) representing the quantization of any floating point number x over a range MIN to MAX, round (x) representing the rounding of any floating point number x;

step bb: calculating a Mean Square Error (MSE) of weights before and after quantization and an activation value, wherein the mean square error of the weights before and after quantization and the activation value represents a quantization error, and calculating as follows:

wherein N represents the number of weights and activation values;

step cc: changing the scale factor, and repeating the steps aa and bb;

step dd: changing the representation form of the low-precision floating point number, namely the combination of a and b in MaEb, and repeating the steps aa, bb and cc;

step ee: and taking the low-precision representation form and the scale factor corresponding to the minimum value of the mean square error of the weight and the activation value as an optimal result.

As shown in fig. 2, for each convolutional neural network, an optimal low-precision floating-point data representation (bit-width combinations of different mantissas and exponents) is found, thereby ensuring that the quantization error is minimum; in the quantization process of the CNN, quantization or non-quantization can be selected for each layer, and meanwhile, the low-precision floating point number representation form of each layer can be different during quantization, namely a and b only need to satisfy 0 < a + b < 31. Specifically, in the process of optimizing the low-precision floating point number representation form (the optimization can adopt a traversal or other search modes) for each convolutional neural network needing to be quantized, the optimal scale factor under the low-precision floating point number representation form is searched for the weight and the activation value of each layer of the convolutional neural network, and the mean square error of the weight and the activation value before and after quantization is ensured to be minimum; the reason for ensuring the accuracy without retraining is realized by the quantization method of the application is as follows: for a convolutional neural network before quantization, it has an accuracy result itself, and this result is usually defined as a standard value. The method aims to quantize the convolutional neural network on the premise of ensuring the accuracy of the standard; the weight and the activation value of the network before quantization, the data are more close to non-uniform distribution such as Gaussian distribution, gamma distribution and the like, namely the values are concentrated in a certain range, and the probability of the values appearing outside the range is smaller; the quantization weight and the activation value are that the original data are approximately represented by a number with lower precision, the quantization is carried out by a low-precision floating point number, the low-precision floating point number is characterized in that the number which can be represented near zero is more, and the number which can be represented towards two sides is less, namely the characteristic of the low-precision floating point number is closer to the distribution of the weight and the activation value before quantization. And comparing the data before and after quantization, wherein when the quantized data is closer to the data before quantization, the loss of accuracy rate caused by the quantized network is smaller. The mean square error can represent the difference between the quantized data and the data before quantization, and the smaller the mean square error, the more the quantized data is closer to the data before quantization. Therefore, the situation that the mean square error is minimum can be explained, and the situation that the accuracy loss is minimum can be ensured, so that the situation that retraining is not needed can be realized. The optimal data representation form can be found through the quantization method, only 4 bits or 5 bits of mantissas are needed, the loss of the accuracy of top-1/top-5 can be ignored, and the accuracy reduction values of top-1/top-5 are respectively within 0.5%/0.3%.

The forward computational data flow of the quantized neural network is shown in fig. 3. For clarity in explaining the data flow, the data bit width of each step in the data flow is listed using the low precision floating point representation of M4E3 in fig. 3 as an example, i.e., a is 4 and b is 3; all input pictures, weights and offsets are represented by single precision floating point numbers. First, the original picture and the weights are quantized in the data representation of M4E3, while the offset is quantized to 16-bit specific points, and the quantized input picture, the weights and the offset are stored in an external memory in order to reduce quantization error. Next, low-precision floating-point number multiplication is performed on the quantized picture and the weight, and the product is stored as a 15-bit floating-point number M10E 4. And then, the product of the 15-bit floating point number is converted into a 23-bit fixed point number, the 16-bit fixed point number with bias quantization is combined for accumulation calculation, and the final result of accumulation is stored as a 32-bit fixed point number. The above operation has two advantages: 1. no precision loss exists in the whole process, so that the accuracy of the final reasoning result is ensured; 2. converting floating point number accumulation into fixed point number accumulation can save a large amount of resources of a customized circuit (such as an ASIC/SOC) or a non-customized circuit (such as an FPGA), thereby improving the throughput of the realization of the customized circuit (such as the ASIC/SOC) or the non-customized circuit (such as the FPGA). Finally, before being used by another CNN layer, the final output result is converted into a floating point number of M4E3 again and is stored in an external memory, so that the storage space is saved. Only the last data conversion step in the overall data stream will result in a reduction of bit width and loss of precision. The accuracy loss of the part does not influence the final accuracy, and can be verified according to experiments.

The multipliers in each PE are designed for low precision floating point numbers. According to the representation form of the low-precision floating point number, the multiplication of two low-precision floating point numbers can be divided into three parts: 1) carrying out exclusive or on sign bits; 2) multiplying the mantissas; 3) and (4) adding the indexes. Take the form of MaEb for example. We need an a-bit unsigned number multiplier-adder and a b-bit unsigned number adder to implement the multiplication of the two numbers. Although the multiplication of mantissa should use the multiplier with (a +1) bits after considering the first hidden bit (the divisor is 1, and the non-reduced divisor is 0), the present application designs it as a-bit multiplier-adder, which is to improve the efficiency of DSP. Meanwhile, the exponential offset is not included in the adder, because in the embodiment of the present application, all data representations are the same, and the exponential offset is also the same, so that the data representation can be processed in the last step, thereby simplifying the design of the adder.

As shown in fig. 5, in the convolution calculation process, each pixel point of the output channel is calculated by the following formula:

where IC denotes the number of input channels, KW and KH denote the width and height of the convolution kernel, and x, y, w and b denote the input activation value, the output activation value, the weight and the offset, respectively. Since 4 low precision floating point number multiplications are implemented with one DSP and calculated as follows: (a + b) × (c + d) ═ ac + bc + ad + bd, so each PE is designed to compute two output channels simultaneously, and on each output channel, two convolution results can be computed simultaneously, as shown in fig. 5. Specifically, in the first cycle, the values of the first pixel and the corresponding first convolution kernel on the IC input channels are fed into the PE for calculation, labeled a and c in fig. 5, respectively. To follow the parallel computation pattern in the four multipliers, the second pixel point on the IC input channels (labeled b in fig. 5) and the value of the corresponding convolution kernel (labeled d in fig. 5) used to compute another output channel are also fed into the PE for computation. Thus, a and b are used repeatedly to calculate values at different locations on the same output channel, while c and d are used in common to calculate values on different output channels. In the same manner, data for a second location is input in a second cycle. Thus, after KW × KH cycles, one PE can calculate four convolution results.

In the present application, N is used in each PE_mA multiplier, so that the value of IC is designed to be N_m4, therefore, N is computed in parallel within each PE_mAnd 4 input channels. After the corresponding weight and bias are used, two output channels are calculated in parallel, and two pixel points on each output channel are calculated. When the number of input channels is larger than N_mAt/4, or the number per output channel is greater than 2 orWhen the number of output channels is greater than 2, a plurality of rounds of calculation are required to complete one convolution operation. Because of the scale of the PE and CNN convolutional layers, the CNN convolutional layer often cannot obtain the final result by one calculation on the PE, the calculation divides the convolutional layer into a plurality of parts, one part of the convolutional layer is put on the PE for calculation, and the calculation result is an intermediate result. This intermediate result is stored in the OFMB and is retrieved from the OFMB for calculation until the next portion is calculated. To be able to improve parallelism, N is used in this design_pThe PEs are among different PEs, and can be sent to pixel points on different input characteristic graphs and different weights to perform parallel computation of different dimensions. For example, all PEs may share the same input profile and use different weights to compute different output channels in parallel. Or all PEs share the same parameters and use different input profiles to compute the input channels in parallel. Parameter N_mAnd N_pIs determined by considering the CNN network structure, the throughput and the bandwidth requirement.

According to the calculation mode in the PE, both IFMB and WB are set to provide N to each PE separately for each cycle_m2 input enable values and weights, while the OFMB needs to save four output enable values per cycle. Although each pixel point on the output feature map is finally saved as a low precision floating point number, in the case of intermediate results, it is saved as 16 bits to reduce precision loss. Thus, the bit width of the OFMB needs to be set to 64 bits for each PE. Since the input activation values or weights may be shared by different PEs in different parallel computing modes, defining P_ifmAnd P_ofm(P_ifm×P_ofm＝N_p) These two parameters represent the number of PEs used to compute the input and output profiles in parallel, respectively. Thus, P_ifmThe PEs share the same weight, P_ofmThe PEs share the same input activation value. For the bit widths of IFMB, WB, and OFMB, it needs to be set as:

and 64N_pWhereinBW represents the bit width of the low precision floating point number. Parameter N_m，P_ifmAnd P_ofmAre determined by balancing considerations of throughput, bandwidth requirements, and resource usage. The sizes of the three on-chip buffers are also determined by comprehensively considering the throughput and the resource usage. In the design of the processor, throughput, bandwidth requirements, resource utilization and expandability are considered in a balanced manner, so that the buffer size is determined to be the size capable of guaranteeing the transmission time of the hidden DMA. With off-the-shelf circuitry, such as in FPGA implementations, IFMB and OFMB are implemented with block RAM and WB is implemented with distributed RAM, because distributed memory can provide larger loans. In the CNN forward calculation process, only when all the input feature maps are used, or all the weights are used, or the OFMB is full, will the external memory be accessed to read new input feature maps or weights or to save the output feature maps.

The specific implementation of 4 multipliers in one DSP. In an FPGA implementation, the data representation M4E3 is used. To explain clearly that four low precision floating point number multipliers are implemented in one DSP, multiplication by two reduced numbers is used as an example. The mantissa of the product of two numbers may be expressed as:

wherein M is_x，M_y，E_x，E_yDenotes the mantissa and exponent of X and Y, respectively, equation 0.M_x×0.M_y+(1.M_x+0.M_y) Can be realized by a 4-bit unsigned fixed point number multiply-add device, equation E_x+E_yMay be implemented by a 3-bit unsigned fixed point number adder. Since the DSPs in the Xilinx 7-family FPGA can implement a multiplier-adder P ═ axb + C (where the maximum bit widths of a, B, and C are 25, 18, and 48, respectively), blank bits are added to each input port, so that the DSPs are fully used to implement four 4-bit multiplier-adders, and the specific input form of each port of the DSPs is shown in fig. 6. During the calculation, the decimal point position is set at the rightmost sideThat is to say 0.M_xAnd 0.M_yConversion to a 4-bit positive number, 1.M_x+0.M_yTo a positive number of 10 bits to ensure that no overlap occurs during the calculation. In such a way, the exponents and the equation 1.M are implemented using a small number of look-up tables (LUTs) and a small number of flip-flops (FFs)_x+0.M_yIn the case of addition of (2), one DSP may be used to implement multiplication of 4 numbers in the form of M4E3 data representations, thereby greatly increasing the throughput of a single DSP.

To sum up, the present embodiment quantizes the single-precision floating point number of the original picture and the weight to a floating point number represented by a low-precision floating point number representation form MaEb based on the low-precision floating point number representation form and the optimal scale factor, where the low-precision floating point number multiplication operation performed on the floating point number of the MaEb includes splitting the floating point number of the MaEb into an a-bit multiplier-adder and a b-bit adder; based on the multiplier adder P realized by DSP, a-bit multiplier adder is realized by adding blank bits to the input port. For example, a 4-bit multiplier-adder and a 3-bit adder are used to realize the multiplication of an 8-bit low-precision floating point number, and 4 low-precision floating point number multiplications in this way are realized in one DSP, which is equivalent to realizing the multiplication in four convolution operations in one DSP, compared with the existing method that only one DSP can realize two multiplications at most, the accelerated performance on a customized circuit or an un-customized circuit is greatly improved under the condition of ensuring the accuracy; the throughput is improved by 64.5 times compared with an Intel i9 CPU, and is improved by 1.5 times compared with the existing FPGA accelerator; for VGG16 and a YOLO convolutional neural network, compared with the existing six FPGA accelerators, the throughput is respectively improved by 3.5 times and 27.5 times, and the throughput of a single DSP is respectively improved by 4.1 times and 5 times; meanwhile, when forward calculation of the convolution layer is carried out based on a quantization method, the fixed point number of the accumulated result is converted into a floating point number, so that the storage resource is saved; floating point number accumulation is converted into fixed point number accumulation, so that a large number of customized circuit or non-customized circuit resources can be saved, and the throughput of the customized circuit or the non-customized circuit can be improved.

Example 2

Based on embodiment 1, a system includes a customized circuit or an un-customized circuit, the customized circuit includes an ASIC or an SOC, the un-customized circuit includes an FPGA, as shown in fig. 1, the customized circuit or the un-customized circuit includes a floating point function module, the floating point function module is configured to receive an input activation value and a weight from a storage system according to a control signal, distribute the input activation value and the weight to different processing units PE, and calculate a convolution quantized to a MaEb floating point number by a low precision floating point number representation in parallel, where 0 < a + b < 31, and a and b are positive integers;

Each processing element PE comprises 4T parallel branches, each of said parallel branches comprising N_m/(4T) multipliers, N_mAnd 4T, the multiplier, the data conversion module, the fixed point number addition tree and the post-processing unit are sequentially connected, wherein T is a positive integer.

The storage system comprises an input characteristic diagram caching module IFMB, a weight caching module WB and an output characteristic diagram caching module OFMB with a ping-pong structure.

The post-processing unit comprises an accumulator, a pooling layer and an activation function which are connected in sequence.

In MaEb, the values of a and b are 4 and 3, namely M4E3, T is 1, N_mFor 8, each processing unit PE includes 4 parallel branches, each of which includes 2 multipliers, 2 data conversion modules, 1 fixed point addition tree, and 1 post-processing unit PPM.

Distributing the floating point number of MaEb to parallel N in the floating point function module_mForward computing the low-precision floating-point number multipliers to obtain a full-precision floating-point number product, wherein N_mRepresenting the number of low-precision floating-point number multipliers of one processing unit PE in the floating-point number functional module; transmitting the full-precision floating point number product to a data conversion module to obtain a fixed point number result without precision loss; and after distributing the fixed point number result to four parallel (T is 1) fixed point number addition trees, sequentially accumulating, pooling and activating the fixed point number addition tree result and the bias in the input data through a post-processing unit to finish the calculation of the convolutional layer.

In summary, the above modules calculate a convolution layer, and for acceleration of a convolutional neural network CNN, each layer needs to be calculated by the above modules, and in combination with the central control module and the storage system, the floating point function module is configured to receive an input activation value and a weight from the storage system according to a control signal, distribute the input activation value and the weight to different processing units PE, and perform parallel calculation to quantize the convolution value into a MaEb floating point number through a low-precision floating point number representation form; the method and the device can ensure the accuracy of the quantized convolutional neural network without retraining based on MaEb floating point number, and the processing unit PE realizes N through DSP_mA MaEb floating-point number multiplier; the performance of acceleration on a customized circuit or an un-customized circuit is greatly improved under the condition of ensuring the accuracy.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. The CNN accelerated calculation method based on the low-precision floating point data expression form is characterized in that: the method comprises the following steps:

2. The CNN accelerated computation method based on a representation of low-precision floating-point data according to claim 1, wherein: the forward calculation of the convolution layer completed by performing dot product calculation through the MaEb floating point number quantized by the low-precision floating point number representation form comprises the following steps:

3. The CNN accelerated computation method based on a representation of low-precision floating-point data according to claim 2, wherein: the steps a, b and c comprise the following steps:

4. The CNN accelerated computation method based on a representation of low-precision floating-point data according to claim 3, wherein: the floating point number of the original picture and the weight quantized MaEb comprises the following steps:

5. The CNN accelerated computation method based on a representation of low-precision floating-point data according to claim 3, wherein: the low-precision floating point number multiplication operation of the MaEb floating point number comprises the following steps:

6. The CNN accelerated computation method based on a representation of low-precision floating-point data according to claim 5, wherein: the maximum value of the A, B, C bit width is 25, 18 and 48 respectively.

7. A system based on the method of claim 1, characterized in that: the system comprises a customization circuit or a non-customization circuit, wherein the customization circuit or the non-customization circuit comprises a floating point function module, the floating point function module is used for receiving an input activation value and weight from a storage system according to a control signal, distributing the input activation value and the weight to different processing units PE (processor edge) for parallel calculation and quantizing the input activation value and the weight into a convolution of MaEb floating point number through a low-precision floating point number representation form, wherein a and b are positive integers;

the floating-point number functional module comprises N parallel processing units PE, and the processing units PE realize N through DSP_mA MaEb floating-point number multiplier, wherein n is positive integerNumber, N_mIndicating the number of low precision floating point multipliers of a processing element PE in the floating point function.

8. The system of claim 7, wherein: each processing element PE comprises 4T parallel branches, each of which comprises N_m/(4T) multipliers, N_mAnd 4T, the multiplier, the data conversion module, the fixed point number addition tree and the post-processing unit are sequentially connected, wherein T is a positive integer.

9. The system of claim 7, wherein: the storage system comprises an input characteristic diagram caching module IFMB, a weight caching module WB and an output characteristic diagram caching module OFMB with a ping-pong structure.

10. The system of claim 8, wherein: the post-processing unit comprises an accumulator, a pooling layer and an activation function which are connected in sequence.

11. The system of claim 7, wherein: and a and b satisfy 0 < a + b ≦ 31, and when a is 4 or 5, the network quantized in the form of low-precision floating point number representation is the optimal result.

12. The system of claim 7, wherein: the custom circuit comprises an ASIC or SOC and the off-the-shelf circuit comprises an FPGA.