CN111832719A

CN111832719A - Fixed point quantization convolution neural network accelerator calculation circuit

Info

Publication number: CN111832719A
Application number: CN202010736970.3A
Authority: CN
Inventors: 贺雅娟; 周航; 蔡卢麟; 朱飞宇; 候博文; 张波
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-10-27

Abstract

A fixed point quantization convolution neural network accelerator computing circuit belongs to the technical field of integrated circuits. Processing input data of N input channels by using N input channel processing units, and quickly adding a plurality of input characteristic graphs and a plurality of weights in the input data of each input channel after the multiple input characteristic graphs and the multiple weights are correspondingly multiplied to obtain convolution results of the N input channels; the partial accumulation adding unit is used for accumulating all convolution results of the N input channels and outputting the convolution results to the quantization activation unit; and the quantization activation unit sequentially performs offset accumulation, multiplication by an approximate multiplier, right shift, function activation, addition with zero data and output digit amplitude limiting to obtain an output result of the calculation circuit of the convolutional neural network accelerator. The method realizes the acceleration of the calculation speed of the convolutional neural network with fixed point quantization on the premise of not generating obvious precision loss, has the characteristics of low power consumption and small circuit area, and is suitable for the convolutional neural network system needing shaping quantization.

Description

Fixed point quantization convolution neural network accelerator calculation circuit

Technical Field

The invention belongs to the field of integrated circuits, and relates to a computing circuit of a fixed-point quantization convolutional neural network accelerator.

Background

Convolutional Neural Networks (CNNs) have enjoyed great success in the field of image recognition due to their excellent predictive performance. Nevertheless, modern CNNs with high inference accuracy typically have large model size and high computational complexity, resulting in complex deployment on data centers or edge devices, especially for application scenarios requiring low resource consumption or low response delay. To facilitate the application of complex CNNs, the emerging field of model compression research focuses on reducing the model size and execution time of CNNs with minimal loss of precision.

The accuracy of the network parameters can be quantified down to 1 bit, and the XNOR-net and related network variants can achieve 32 times compression of the network parameters, resulting in a significant reduction in model capacity, reducing memory and bandwidth overhead. However, the remarkable decrease in inference accuracy makes it difficult for the recent binary networks to satisfy practical applications. The three-value network uses 2 bits to represent model parameters, so the range of quantization parameters is increased to [ -1,0, +1], which helps to improve the accuracy in reasoning and has better application potential compared with a two-value network. Compared with a binary and ternary network, the INT8 quantization has a wider parameter search space, thereby better retaining inference precision, especially for a complex network. Therefore, the INT8 quantification technology has been widely used in the industry, such as in TensorFlow-Lite, TensorRT, etc. platforms.

Although fixed point quantization has reduced the complexity of convolution computation, there is still a large amount of computation for deep neural networks. By using the error tolerance of the convolutional neural network, approximate calculation can be introduced to further apply low-power design on hardware design, so that the low-power requirement of an embedded system can be met. In convolutional neural networks, the computation of convolutional layers takes up more than ninety percent of the total network computation. The multiplication and accumulation introduces approximate calculation, for example, the multiplier is designed into an approximate multiplier, and the power consumption of the calculation circuit can be reduced under the condition of not bringing obvious errors to a prediction system.

Disclosure of Invention

Aiming at the problem of power consumption caused by a large amount of operations of the convolutional layer in the fixed-point quantization convolutional neural network, the invention provides the fixed-point quantization convolutional neural network accelerator calculation circuit, which is used for finishing the large amount of operations of the convolutional layer, can accelerate the calculation speed of the fixed-point quantization convolutional neural network on the premise of not generating obvious precision loss, and has the characteristics of low power consumption and small circuit area.

The technical scheme adopted by the invention is as follows:

a fixed point quantization convolution neural network accelerator calculation circuit comprises N input channel processing units, a partial accumulation adding unit and a quantization activation unit, wherein N is a positive integer;

the input channel processing unit is used for processing input data of one input channel, and the input data of the input channel comprises a plurality of input feature maps and a plurality of weights; the input channel processing unit comprises an approximate multiplier array and a quick addition unit, wherein the approximate multiplier array is used for correspondingly multiplying a plurality of input feature maps and a plurality of weights in input data of one input channel; the fast addition unit is used for compressing and then adding partial product results output by the approximate multiplier array to obtain a convolution result of an input channel;

the partial accumulation adding unit is used for accumulating all convolution results of the N input channels output by the N input channel processing units respectively and outputting the result to the quantization activation unit;

the quantization activation unit includes:

a first adder, the first input end of which is connected with the convolution accumulation results of the N input channels output by the partial product accumulation unit, and the second input end of which is connected with the offset data;

a multiplier, a first multiplier input end of which is connected with the output end of the first adder, and a second multiplier input end of which is connected with an approximate multiplier;

the input end of the arithmetic right shift unit is connected with the output end of the multiplier;

the input end of the function activation unit is connected with the output end of the arithmetic right shift unit;

a second adder, a first input end of which is connected with the output end of the function activation unit, and a second input end of which is connected with zero data;

and the input end of the amplitude limiting unit is connected with the output end of the second adder and is used for limiting the output bit number of the second adder to a specified bit number, and the output end of the amplitude limiting unit is used as the output end of the convolutional neural network accelerator calculation circuit.

Specifically, the approximate multiplier array includes a plurality of approximate multipliers, the number of the approximate multipliers of the approximate multiplier array in each input channel processing unit, the number of the input feature maps in the input data of each input channel, and the number of the weights are determined by convolution kernels of the convolution neural network, for an M × M convolution kernel, M is a positive integer, M × M approximate multipliers are arranged in each input channel processing unit to form the approximate multiplier array, and M × M input feature map input ends and M × M weight input ends are arranged to receive the input data of one input channel.

Specifically, the approximate multiplier generates partial products by using Booth coding, adds sign bits to the partial products obtained after Booth coding to expand the partial products to form a partial product array, divides the partial product array into three parts according to the weight, wherein the part with the lowest weight is approximately compressed by using an OR gate, the part with the highest weight is accurately compressed by using a 3-2 compressor and a 4-2 compressor, or operates the sign bits and the corresponding partial products in the rest part and then approximately compresses by using a 4-2 compressor, and adds the compressed partial products by using a third adder to obtain the partial product result output by the approximate multiplier array.

Specifically, the fast adding unit includes a wallace tree and a fourth adder, the wallace tree performs sign bit expansion on partial product results output by the approximate multiplier array, and then performs three-time compression by using a 4-2 compressor and a 3-2 compressor respectively, and the fourth adder adds the partial products subjected to the three-time compression by the wallace tree to obtain a convolution result of the input channel.

Specifically, the partial product accumulation unit includes a fifth adder, a data selector and an intermediate result register,

the first input end of the fifth adder is connected with convolution results of N input channels output by the N input channel processing units respectively, the second input end of the fifth adder is connected with the output end of the intermediate result register, and the output end of the fifth adder is connected with the input end of the data selector;

the data selector outputs the output data of the fifth adder to the intermediate result register before the fifth adder does not complete the convolution result accumulation of the N input channels, and outputs the output data of the fifth adder to the quantization activation unit after the fifth adder completes the convolution result accumulation of the N input channels.

Specifically, the convolutional neural network adopts INT8 data type quantization, and the quantization scheme selects TensorFlow quantization specification.

Specifically, the approximate multiplier connected to the second multiplier input terminal of the multiplier, the right shift number of the arithmetic right shift unit, and the zero data connected to the second input terminal of the second adder are obtained according to a tensrflow quantization algorithm, the function activation unit is activated by ReLu, and the activation expression is:

the invention has the beneficial effects that: according to the invention, by optimizing the calculation unit, the memory space occupied by the weight and the characteristic value is effectively reduced, so that more data can be transmitted under the same bandwidth, the throughput rate and the energy efficiency are improved, the calculation speed of the convolutional neural network with fixed point quantization is accelerated on the premise of not generating obvious precision loss, the characteristics of low power consumption and small circuit area are realized, and the method is suitable for the convolutional neural network system needing shaping quantization.

Drawings

Fig. 1 is an overall structural diagram of a computation circuit of a fixed-point quantization convolutional neural network accelerator according to an embodiment of the present invention.

Fig. 2 is a radix-4 Booth coding circuit used in an embodiment of an approximate multiplier in a fixed-point quantization convolutional neural network accelerator calculation circuit provided by the invention.

Fig. 3 is a partial product generated by an approximate multiplier in a computation circuit of a fixed-point quantization convolutional neural network accelerator according to an embodiment of the present invention after Booth encoding is performed.

Fig. 4 is a schematic diagram of a specific operation process of an approximate multiplier in a computation circuit of a fixed-point quantization convolutional neural network accelerator according to an embodiment of the present invention.

FIG. 5 is a block diagram of a 3-2 compressor and a 4-2 compressor used in a computation circuit of a convolutional neural network accelerator with fixed point quantization according to the present invention.

Fig. 6 is a detailed structural diagram of a Wallace tree employed in an embodiment of the computation circuit of the convolutional neural network accelerator with fixed point quantization according to the present invention.

Fig. 7 is a detailed structural diagram of a quantization activation unit in a computation circuit of a convolutional neural network accelerator with fixed point quantization according to the present invention.

Detailed Description

The technical solution of the present invention is described in detail below with reference to the accompanying drawings and specific embodiments.

The present invention provides a convolutional neural network accelerator computing circuit based on fixed point quantization, which is described below by taking the convolutional neural network application of INT8 integer quantization, quantization scheme selection and quantization specification of tensrflow as an example, but the specific quantization type and quantization scheme should not be construed as a limitation to the present invention. The INT8 data type quantization is carried out on the convolutional neural network, on the basis, the accelerator calculation circuit provided by the invention is applied, after shaping and quantization, not only the memory space occupied by the weight is changed into the original 1/4, and more data can be transmitted under the same bandwidth, but also the calculation process is of a fixed point type, so that the throughput rate and the energy efficiency are further improved. The weights and characteristic diagrams of the input of the accelerator calculation circuit in the embodiment are INT8 type data, the number of shift bits and the offset.

As shown in fig. 1, the computation circuit of a fixed-point quantization convolutional neural network accelerator provided by the present invention includes N input channel processing units, a partial accumulation adding unit, and a quantization activating unit, where N, which is determined by the number of input channels, is a positive integer and can be selected according to factors such as resources.

Each input channel processing unit is used for processing input data of one input channel, and the input data of one input channel comprises a plurality of input feature maps and a plurality of weights. Each input channel processing unit comprises an approximate multiplier array and a quick addition unit, wherein the approximate multiplier array comprises a plurality of approximate multipliers for multiplying the characteristic graphs and the weights, and is used for correspondingly multiplying the plurality of input characteristic graphs and the plurality of weights in the input data of one input channel, the number of the approximate multipliers of the approximate multiplier array in each input channel processing unit and the number of the input characteristic graphs and the weights in the input data of each input channel are determined by a convolution kernel of a convolution neural network, M is a positive integer for M multiplied by M convolution kernels, M multiplied by M approximate multipliers are arranged in each input channel processing unit to form the approximate multiplier array, and M multiplied by M input characteristic graph input ends and M multiplied by M weight input ends are arranged for receiving the input data of one input channel. For example, for a 3 × 3 convolution kernel, the number of the feature map input ends, the weight input ends and the approximate multipliers in each input channel processing unit can be set to 9, due to INT8 integer quantization, each feature map input end and each weight input end are 8 bits, each approximate multiplier input end is 8 bits signed number, and outputs 16 bits signed number, each input channel processing unit has 9 inputs of 8 bits feature map input ends and 9 8 bits weight input ends, the feature map and weight input data of each input channel are 9 × 8 bits, the output data are 9 × 16 bits, and the number of bits is changed for different quantization schemes. In this embodiment, 3 × 3 convolution kernels and N parallel input channels are taken as examples, but the present invention should not be construed as a limitation to the present invention, and the present invention may also support a plurality of convolution kernel sizes by expansion, for example, for the sizes of different convolution kernels, such as 3 × 3, 5 × 5, 7 × 7, 11 × 11, and the like, the number of feature diagram input ends, the number of weight input ends, and the number of 8 × 8 approximate multipliers may be correspondingly increased. In addition, the embodiment adopts a fixed point complement form, and the range of the characteristic value input and the weight is [ -128,127 ].

In the embodiment, the 8-bit × 8-bit approximate multiplier generates a partial product by using radix-4 Booth coding, and can be replaced by different kinds of approximate multipliers according to the requirements of prediction precision and power consumption. Booth encoding the circuit for generating partial products is shown in FIG. 2, where Bn and An in FIG. 2 represent the representation of two M-bit (8-bit in this embodiment) signed binary operands, respectively, e.g., B0 represents the 0 th bit of B binary, and A7 represents the 7 th bit of A binary. For convenience of the formula we will here denote a-7 a6 … a0 and B-7B 6 … B0. The a by B expression is derived as follows:

wherein b is_-10. Each term in the equation is called a partial product. The multiplier in the formula is divided into continuous arrays by mutually overlapped 3-bit bits, and each partial product belongs to one of-2A, -0, 2A, A and 0.

Each partial product is generated in the circuit, and the logical expression of Booth coded output is as follows:

Neg＝b_2i+1

neg, X1, X2, and Z are encoded by 3 adjacent multipliers { b2i +1, b2i,2b2i-1 }. Neg is the sign offset bit. When the partial product result is-2A, -A and-0, the multiplicand needs to be added with 1, Neg has the effect of complementing the multiplicand, and the value of Neg is 1. When the partial product is 2A, A and 0, the complement of the multiplicand is equal to the original code, and the Neg bit is 0. The Z signal is to prevent the multiplicand from shifting left when the partial product is 0 and-0. Neg, X1, X2, Z produce a circuit as shown in the left circuit in FIG. 2.

PPij is the partial product of the ith row and j column, which is logically combined by the multiplier aj, aj-1 and the multiplicand b2i +1, b2i,2b2i-1, as shown in the right circuit of FIG. 2.

The partial product array generated after sign bit expansion is added is shown in fig. 3, the approximate compression method is shown in fig. 4, and the basic idea of compressing after the approximate multiplier provided by the embodiment generates the partial product by adopting radix-4 Booth coding is as follows: the lower order is decompressed with an approximate compressor and the upper order is decompressed with an exact compressor. Booth encoding generates 5 partial products, the partial products are artificially divided into three parts, the first part (high 8 bits) is accurately compressed, the second part and the third part (low 8 bits) adopt approximate compression processing, as shown in figure 4, the part with the lowest weight, namely the third part adopts a multi-input single-output OR gate (OR) to compress the partial products, and P0-P5 is directly generated to obtain an approximate compression result; the part with the highest weight, namely the first part, is compressed by partial products by a 3-2 compressor and a 4-2 compressor; the remaining second portion was ORed P30 and Neg3 and then decompressed using a 4-2 compressor. After the first compression, two rows of partial products are obtained, and finally, a third adder is used to obtain a final 16-bit multiplication result, in this embodiment, the size of a convolution kernel of 3 × 3 is taken as an example, assuming that a parallel input channel is set to be 1, input data are 9 8-bit feature maps and 9 8-bit weight inputs, 9 approximate multiplications are performed simultaneously, and 9 identical approximate multipliers obtain 9 16-bit results. Because the convolutional neural network has certain resistance to noise, the convolution operation accumulates the result after multiple multiplications, and if the error accumulation approaches to 0, the result of classification prediction cannot be greatly influenced; even if the result of error accumulation is not 0, the loss of accuracy is acceptable because the convolutional neural network itself is fault-tolerant.

The fast addition unit is used for compressing and then adding partial product results output by the approximate multiplier array to obtain a convolution result of an input channel. As shown in fig. 1, the fast adding unit in this embodiment includes a Wallace tree (Wallace tree) and a fourth adder, where the fourth adder may be a ripple carry adder, and the Wallace tree and the fourth adder are used to fast add results output by 16 bits of 9 approximate multipliers corresponding to input channels, and the Wallace tree includes a plurality of 3-2 compressors and 4-2 compressors. As shown in fig. 6, taking the result of compressing a 3 × 3 convolution kernel as an example, in this embodiment, the Wallace tree performs fast accumulation on 9 16 bits, in order to prevent data overflow, sign bits of all 16-bit operands are expanded to 20 bits, then a 4-2 compressor and a 3-2 compressor are respectively used to perform three-time compression on 9 operands, and first 40 4-2 compressors are used to perform first compression; secondly, compressing the partial product by using 19 4-2 compressors and a 3-2 compressor; the third time, 20 3-2 compressors are adopted; the final addition stage uses a fourth adder to obtain the final 20-bit result. Although the present embodiment takes the compression of 9 16-bit operands as an example, the number of partial products to be compressed can be further increased after the expansion of the convolution kernel, and should not be construed as a limitation to the present invention.

Fig. 5 is a structural diagram of a 3-2 compressor and a 4-2 compressor, where the 3-2 compressor is also called a full adder and the logical expression is P1+ P2+ P3 ═ Sum +2 × Cout. The 4-2 compressor is a 3-2 compressor, and its logical expression is P1+ P2+ P3+ P4+ P5 ═ 2 × Cout +2 × Carry + Sum. Both compressors are used in the approximate multiplier and Wallace tree sections.

In this embodiment, the convolution layer has a plurality of groups of input channels corresponding to a plurality of convolution kernels, N input channel processing units respectively output convolution results of N input channels, a partial product accumulation unit needs to add convolution results of different input channels, and when convolution results of N input channels are not accumulated, 32-bit intermediate results are stored in an intermediate result register until convolution results of a group of N input channels are all accumulated, and 32-bit signed numbers are output and sent to a quantization activation unit. As shown in fig. 1, the partial product accumulation unit includes a plurality of adders for adding convolution results of different input channels; a 32-bit accumulator, i.e. a fifth adder, is used for accumulating the convolution results of the N input channels; and the data selector and the intermediate result register are used for judging whether all channel results are accumulated completely, if so, the output data of the fifth adder is transmitted to the quantization activation unit, otherwise, the output data of the fifth adder is stored in the intermediate result register, and the intermediate result register can be realized by using an SRAM or a register.

As shown in fig. 7, the quantization activation unit includes a first adder, a multiplier, an arithmetic right shift unit, a function activation unit, a second adder, and a clipping unit, and an input of the quantization activation unit includes an accumulated result of one point in an output feature map. The first adder is used for offset accumulation, a first input end of the first adder is connected with convolution accumulation results of the N input channels, a second input end of the first adder is connected with offset data, an output end of the first adder is connected with a first multiplier input end of the multiplier, and bit width of the offset data is 32 bits in the embodiment. The multiplier in this embodiment is a 32-bit × 32-bit fixed-point multiplier, the second multiplier input terminal is connected to the approximate multiplier, the inputs of the multipliers are 32-bit signed numbers, and the output is a 64-bit signed number. The right shift number of the arithmetic right shift unit and the approximate multiplier of the second input end of the multiplier are calculated in advance according to a quantization algorithm, the quantization algorithm utilizes the floating point type scaling coefficients of the input characteristic diagram, the input weight and the output characteristic diagram to convert to obtain the approximate multiplier and the right shift number, and fixed point data are obtained after conversion, wherein the right shift number is 8-bit data and is often larger than 31; the approximate multiplier is a 32-bit signed number, shifted left by 31 bits from a fraction between 0.5 and 1. Obtaining a 32-bit intermediate result after multiplication and shift, adding 8-bit zero data by a second adder after passing through a function activation unit, wherein the function activation unit is used as an activation function of an algorithm level, the second adder is a 32-bit fixed point adder and is used for accumulating zeros calculated in advance in a quantization algorithm, the zero data is also obtained in advance by the quantization algorithm, and the range of the zero data is [ -128,127 ]. After the final slicing unit, the output data is limited to an 8-bit signed number, the output range is [ -128,127 ]. The function activation unit can be activated by ReLu, and the activation expression is as follows:

in summary, compared with a conventional floating-point operation neural network accelerated computation circuit or an integer neural network accelerator computation circuit, the fixed-point quantization convolutional neural network accelerated computation circuit based on approximate computation provided by the invention is added with an approximate multiplier, adopts the Wallace tree to compress different dot products to replace an addition tree structure, can effectively reduce the memory space occupied by the weight and the characteristic value, can transmit more data under the same bandwidth, improves the throughput and the energy efficiency, accelerates the computation speed of the fixed-point quantization convolutional neural network on the premise of smaller prediction accuracy loss, and reduces the area and the power consumption.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A fixed point quantization convolution neural network accelerator calculation circuit is characterized by comprising N input channel processing units, a partial accumulation adding unit and a quantization activation unit, wherein N is a positive integer;

the quantization activation unit includes:

2. The fixed-point quantization convolutional neural network accelerator calculation circuit of claim 1, wherein the approximation multiplier array comprises a plurality of approximation multipliers, the number of approximation multipliers of the approximation multiplier array in each of the input channel processing units, the number of input feature maps in the input data of each of the input channels, and the number of weights are determined by convolution kernels of the convolutional neural network, M is a positive integer for M × M convolution kernels, M × M approximation multipliers are provided in each of the input channel processing units to constitute the approximation multiplier array, and M × M input feature map inputs and M × M weight inputs are provided for receiving input data of one input channel.

3. The fixed-point quantization convolutional neural network accelerator calculating circuit as claimed in claim 2, wherein the approximate multiplier generates partial products by using booth encoding, the partial products obtained after booth encoding are added to sign bit expansion to form a partial product array, the partial product array is divided into three parts according to the weight, wherein the part with the lowest weight is approximately compressed by using an or gate, the part with the highest weight is accurately compressed by using a 3-2 compressor and a 4-2 compressor, the remaining part performs or operation on the sign bit and the corresponding partial product and then performs approximate compression by using a 4-2 compressor, and the partial products after compression are added by using a third adder to obtain the partial product result output by the approximate multiplier array.

4. The convolutional neural network accelerator calculating circuit for fixed point quantization as claimed in claim 1, wherein the fast adding unit comprises a wallace tree and a fourth adder, the wallace tree performs sign bit expansion on the partial product results output by the approximate multiplier array and then performs three times of compression by a 4-2 compressor and a 3-2 compressor respectively, and the fourth adder adds the partial products subjected to the three times of compression by the wallace tree to obtain the convolution result of the input channel.

5. The fixed-point quantized convolutional neural network accelerator computation circuit of claim 1, wherein the partial accumulation addition unit comprises a fifth adder, a data selector, and an intermediate result register,

6. The fixed-point quantized convolutional neural network accelerator computation circuit of any of claims 1 to 5, wherein the convolutional neural network employs quantization of INT8 data type, and the quantization scheme selects the quantization specification of TensorFlow.

7. The convolutional neural network accelerator computing circuit for fixed-point quantization of claim 6, wherein the approximate multiplier connected to the second multiplier input terminal of the multiplier, the right shift number of the arithmetic right shift unit, and the zero data connected to the second input terminal of the second adder are obtained according to a tensrflow quantization algorithm, the function activation unit is activated by ReLu, and the activation expression is: