CN115145536A

CN115145536A - Adder tree unit with low bit width input and low bit width output and approximate multiply-add method

Info

Publication number: CN115145536A
Application number: CN202210759303.6A
Authority: CN
Inventors: 黄科杰; 王楚惠; 沈海斌; 范继聪; 徐彦峰
Original assignee: Zhejiang University ZJU; CETC 58 Research Institute
Current assignee: Zhejiang University ZJU; CETC 58 Research Institute
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-10-04

Abstract

The invention discloses an adder tree unit with low bit width input and low bit width output and an approximate multiply-add method, wherein the unit comprises a coding and decoding circuit, an approximate adder tree circuit with low bit width input and low bit width output, an accumulation summation circuit and an error correction circuit; when a group of data and weight with n low bit widths are input, a multiplication and addition result with the same low bit width can be finally output. The method carries out radix-4 Booth coding on input weight and then carries out decoding operation, and a partial product array is obtained after reconstruction, wherein the scale is n multiplied by m, n is the number of rows, and m is the number of columns. And calculating and adding each row of the partial product array through an approximate adder tree of low-bit-width input-low-bit-width output to obtain a low-bit-width output, accumulating the obtained m low-bit-width outputs, compensating the accumulated result, and outputting the compensated result, namely the final output result of the adder tree unit. The method can optimize a large amount of multiplication and addition operations in convolution operation, and can complete calculation tasks with low power consumption and high speed.

Description

Adder tree unit with low bit width input and low bit width output and approximate multiply-add method

Technical Field

The invention belongs to the technical field of multiply-add computing hardware, and relates to an adder tree unit with low bit width input and low bit width output and an approximate multiply-add method.

Background

With the continuous development of modern computer technology, artificial Intelligence (AI) has been widely used in various fields to perform specific tasks, such as transportation, education, medical treatment, security, finance, and the like. The current AI chip still has the following problems to be solved: firstly, the amount of data required by deep learning calculation is huge, and the memory bandwidth becomes the bottleneck of the whole system. Second, a large number of memory accesses and multiply-and-accumulate (MAC) array computations result in an increase in overall power consumption of the AI chip. Third, deep learning requires a lot of computing power, and with the rapid development of deep learning algorithms, new algorithms are not well supported in the cured accelerator. Therefore, the best approach is to do hardware acceleration, i.e., to increase the power of the AI chip. At present, the artificial intelligent chip is still in the initial development stage, and has huge innovation space in both scientific research and industrial application. Only when the basic computing power reaches a certain height, the algorithm is cooperated with big data, and the artificial intelligence can realize breakthrough of higher level.

At present, the mainstream AI chip is to accelerate the main convolution operation in the Convolutional Neural Network (CNN) by Multiplication and Accumulation (MAC). The convolutional neural network has a wide role in applications such as computer vision, robot control, video analysis, voice recognition and the like, and although the convolutional neural network can provide excellent effects, the convolutional neural network has large calculation parameters and high calculation complexity. Whereas in Convolutional Neural Networks (CNN) Multiply and Accumulate (MAC) operations dominate and now low bit neural network accelerators, both input and output are low bit wide (typically below 10 bits), e.g. 6 bits, 8 bits etc. The traditional method of performing accurate calculation and then performing quantification on neurons is inefficient. Therefore, the reconfigurable approximate adder tree unit oriented to the low-bit neural network accelerator is developed, the calculation efficiency can be improved, and the power consumption of hardware multiply-add operation can be reduced on the premise of not remarkably reducing the accuracy.

Disclosure of Invention

The invention provides an adder tree unit with low bit width input and low bit width output and an approximate multiply-add method aiming at the defects of the prior art, and the adder tree unit and the method are reconfigurable approximate adder tree units and methods for low bit-rate neural network accelerators.

The technical scheme adopted by the invention is as follows:

an adder tree unit with low bit width input and low bit width output comprises an encoding and decoding circuit, an approximate adder tree circuit with low bit width input and low bit width output, an accumulation summation circuit and an error correction circuit;

the coding and decoding circuit is used for performing radix-4 Booth coding on the weight in the input n data and the weight to generate an m-bit coded value, decoding the input n data and the Booth coded value of each bit weight to generate a partial product array, and the array scale is n multiplied by m, wherein n is the number of rows and m is the number of columns;

the approximate adder tree circuit is low-bit-width input-low-bit-width output and is used for selectively accumulating n partial product data of each column of the generated partial product array and ensuring that the output bit width is consistent with the input bit width to form m low-bit-width outputs;

the accumulation summation circuit is used for accumulating and summing all the low bit width outputs output by the approximate adder tree circuit;

the error correction circuit is used for compensating the output of the accumulation summation circuit, and the formed circuit output is the final output result of the adder tree unit.

In the above technical solution, further, the encoding and decoding circuit includes a radix-4 booth encoding circuit and a decoding circuit; the radix-4 Booth encoding circuit weights three consecutive bits { d } using three signal pairs of sign, × 1 and × 2 _2i+1 ,d _2i ,d _2i-1 Encoding, the sign signal indicating that the encoded value is a negative number, the x 1 signal indicating that the value of the encoded value is 1, and x 2 indicating that the value of the encoded value is 2; the decoding circuit is used for carrying out corresponding decoding operation on input data according to an input Booth encoding value signal for input n data and m-bit Booth encoding values: if the sign signal is input, the input data is subjected to the operation of negating and adding a complementing code; if the x 1 signal is input, the input data value is unchanged; if a x 2 signal is input, the input data value is shifted to the right by one bit; after the completion of the decoding, the decoding is completed, i.e. an n x m array of partial products is output.

Furthermore, the approximate adder tree circuit comprises a selector and an adder, the selector is used for configuring the input position of each layer of adder in the adder tree circuit, and under the condition that the input bit width is x, the selector determines whether the input position of each layer of adder is the front x bit or the back x bit of the output result of the previous layer, so as to simultaneously determine the scaling factor 2 ^y 。

Furthermore, the error correction circuit adopts an error correction mode that the output result of the accumulation summation circuit is added with the correction value.

A method of approximate multiply-add of a low bit width input to a low bit width output, comprising:

step 1: carrying out radix-4 Booth encoding on a group of n low-bit-width data and weights, and then carrying out decoding operation to generate an n multiplied by m partial product array;

step 2: outputting each column of data of the reconstructed nxm partial product array to an approximate adder tree circuit with low bit width input-low bit width output in sequence, wherein the bit width is x, the approximate adder tree circuit comprises a selector and an adder, and the input of an adder at each layer in the adder tree circuit is configured through the selector to be the output result of the previous layerFront x bits or rear x bits, and further configures a scaling factor of 2 ^y In the approximate adder tree circuit, each row of partial product data is added approximately to form m accumulation results finally;

and step 3: accumulating and summing the m accumulation results to form 1 multiplication accumulation result;

and 4, step 4: and compensating the obtained multiplication and accumulation result and outputting the result to finally generate and output a low-bit-width multiplication and addition result, wherein the multiplication and addition result is a result obtained after the multiplication and addition of the group of data and the weight.

In the above technical solution, further, in the step 1, the radix-4 booth coding is to encode three consecutive bits of the weight by using three signals, i.e., a symbol, x 1 and x 2, and for the input n data and m-bit booth coded values during decoding, according to the input booth coded value signal, perform corresponding decoding operation on the input data: if the sign signal is input, the input data is subjected to the operation of negating and adding a complementing code; if the x 1 signal is input, the input data value is unchanged; if a x 2 signal is input, the input data value is shifted to the right by one bit.

Furthermore, in the approximate adder tree in step 2, when the selector selects the first x bits output by the previous layer as the input of the adder in the current layer, the least significant bit input of the adder is the final bit of the two adder output results in the previous layer, and when the selector selects the last x bits output by the previous layer as the input of the adder in the current layer, the least significant bit input of the adder is 0.

Further, the method for performing compensation in step 4 comprises: and adding the multiplication accumulation result obtained in the last step with a correction value, wherein the correction value is y.

The invention has the advantages that:

the invention can reduce the number of cycles of single multiply-add calculation cycle by half by carrying out radix-4 Booth coding on the weight, and reduce the calculation amount of an adder tree; through the block approximation processing and error correction of the adder tree, the calculation error of the multiply-add operation is reduced; and by combining an approximate calculation technology, power consumption and resource use required by calculation are further reduced, and energy efficiency is improved. Specifically, the invention comprises:

(1) Aiming at the multiply-add operation in the low-bit neural network, a multiply-add calculation method is provided, the multiply-add calculation method changes the traditional operation mode that data is multiplied and then added, and the flexibility of multiply-add operation is improved by rearranging partial products into a partial product array and then adding, thereby facilitating the introduction of approximate operation.

(2) The encoding and decoding circuit adopts a radix-4 Booth encoding mode, so that the number of partial products is obviously reduced, the complexity of operation can be effectively reduced, and the scale of the circuit can be reduced.

(3) The approximate adder tree circuit is 'low bit width input-low bit width output', namely the adder tree is combined with the quantization module, so that the combination of the adder tree and the quantization operation is realized, the quantization module in a low-bit neural network can be omitted, and the power consumption and the area of the circuit can be reduced.

(4) The selector configuration module configures the approximate operation of each layer of adders in the adder tree and configures the selected scaling factor 2 ^y . The selector configuration module is used for configuration, so that the function of adjustable precision is achieved, approximate operation configuration can be performed according to precision requirements in different application scenes, and the best balance between precision and circuit hardware optimization is achieved.

(5) The accumulated summation result is compensated and then output, so that the calculation error can be further reduced, and the calculation precision is improved. The error correction method adopts a method of adding the accumulated summation result and the correction value, and the selection method of the correction value is as follows: directly setting the correction value to a base-2 scaling factor of 2 ^y The logarithmic value of (d), i.e., the correction value, is set to y. And the error correction circuit adds the accumulated summation value and the correction value y and outputs the sum.

Drawings

FIG. 1 is a flow chart of the approximate multiply-add method of the present invention

FIG. 2 general adder tree architecture

FIG. 3 is a block diagram of a data flow for a reconfigurable approximate adder tree unit according to the invention

FIG. 4 is a block-wise approximation processing diagram

FIG. 5 is a design diagram of a least significant adder tree

FIG. 6 is a diagram of a tree design for a sub-adder

FIG. 7 is a diagram of a tree design for a sub-adder

FIG. 8 is a design diagram of a most significant adder tree

FIG. 9 cumulative summation plot

FIG. 10 output plot after error correction

Detailed Description

The technical solution of the present invention is further explained with reference to the accompanying drawings and specific embodiments.

The adder tree unit with low bit width input and low bit width output comprises a radix-4 Booth coding circuit, a decoding circuit, an approximate adder tree circuit with low bit width input and low bit width output, an accumulation summation circuit and an error correction circuit which are sequentially connected.

Inputting n data and weights to an encoding and decoding circuit, wherein the encoding and decoding circuit encodes the weights to generate m-bit encoding values, the size of m is determined by the number of bits of the weights, and decodes the input n data and the Booth encoding values of the weights of each bit to generate a partial product array, the size of the array is n multiplied by m, wherein n is the number of rows and m is the number of columns. Each column of data of the generated partial product matrix is to be sequentially output into the approximate adder tree circuit of "low bit width input-low bit width output". The approximate adder tree circuit accumulates the input n partial product data of each column, and the output low bit width accumulation result is accumulated and summed by the accumulation summing circuit and then input to the error correction circuit for compensation to form the final circuit output which is the final output result of the approximate adder tree unit. The working process is shown in figure 1.

Wherein the radix-4 Booth encoding circuit uses three signals of sign, x 1 and x 2 to weight three consecutive bits { d } for low-ratio privilege _2i+1 ,d _2i ,d _2i-1 And (6) coding is carried out. The sign signal indicates that the code value is a negative number, the x 1 signal indicates that the value of the code value is 1, and x 2 indicates that the value of the code value is 2.

TABLE 1 accurate radix-4 Booth encoder

And the decoding circuit inputs n data and the Booth code value (m bit Booth code values in total) of each bit weight, and performs corresponding decoding operation on the input data according to the input Booth code value signal. If the symbol signal is input, the input data is subjected to the operation of negating and adding a complementary code; if the x 1 signal is input, the input data value is unchanged; if a x 2 signal is input, the input data value is shifted to the right by one bit. And after the decoding is finished, outputting a partial product array, wherein the size of the array is n multiplied by m, n is the number of rows, and m is the number of columns. The partial products are rearranged into a partial product array and then added, so that the flexibility of multiply-add operation is improved, and the introduction of approximate operation is facilitated.

The approximate adder tree circuit of 'low bit width input-low bit width output' performs approximate summation calculation on a plurality of signed numbers. In a general adder tree structure, every two inputs are added through an adder to obtain the output of the layer, the output results are added two by two, and so on, and finally the accumulation results of all the inputs are obtained, as shown in fig. 2. The approximate adder tree in the present invention is further designed in that the input bit width of each layer of adder is consistent with the output bit width of the previous layer, i.e. the approximate adder tree is "low bit width input-low bit width output". For an approximate adder tree with an input bit width of x, the input of each layer of adder is the first x bits or the last x bits of the x +1 bit summation result output by the previous layer. When the input is the front x bit output by the previous layer, the lowest bit carry input of the x bit adder is the final bit AND of the output results of the two adders of the previous layer; when the input is the last x bits output by the previous layer, the carry input of the least significant bit of the x-bit adder is 0. Thus, when a single data bit width of an adder tree input is x, the bit width size of the approximate adder tree output result of the "low bit width input-low bit width output" is x. That is, in the approximate adder tree circuit of the present invention, the first x bits or the last x bits of the output of the adder at each layer are selectable, and the approximate operation of the adder at each layer in the adder tree is selectively configured by the selector configuration module.

Carrying out sign bit extension on input to obtain p bit signed number; and then carrying out sign bit expansion on the weight and enabling the expanded bit width to be an even number to obtain q bit signed number and q to be the even number. Suppose N ₀ Multiplying and adding the p bit input and the q bit weight, and outputting z = ceil (log) theoretically ₂ N ₀ * (p + q)) bit multiply-add result, where the ceil function represents … …. When the partial product array is generated by subjecting the input data and the weight to radix-4 Booth encoding and decoding, the encoding bit number m of the q-bit weight is = q/2, and the size of the partial product array is N ₀ * m, N of partial product per column ₀ The data is calculated by a general adder tree, and z is obtained ₀ ＝ceil(log ₂ N ₀ * (p + 2)) bit output, z being the sum of the partial products, and the m bit widths obtained by summing the partial products being z ₀ ＝ceil(log ₂ N ₀ * (p + 2)) bit is output as a result of shift addition.

And because the whole unit is low-bit-width input-low-bit-width output, the final output result is x bit, and the corresponding z bit is a part of the unit with the bit width of x bit. Signed multiply-add result a for a z-bit binary _z- ₁ a _z-2 …a ₁ a ₀ To the extent that it is converted to a decimal number A = -a _z-1 ·2 ^z-1 +a _z-2 ·2 ^z-2 +…+a ₁ ·2 ¹ +a ₀ ·2 ⁰ . The essence of the low bit width output is that the multiplication and addition result is output after being scaled, and the product of the output result and the scaling coefficient is an approximate value of the original multiplication and addition result. Let the scaling factor be 2 ^b Then a can be expressed as a = (-a) _z-1 ·2 ^z-1-b +a _z-2 ·2 ^z-2-b +…+a _b ·2 ^b-b +…+a ₁ ·2 ^1-b +a ₀ ·2 ^0-b )·2 ^b . At this time, a is ignored _b ·2 ^b-b All items after the item can output binary signed number a _z- ₁ a _z-2 …a _b And then converting the output result after scaling into a decimal number B = -a _z-1 ·2 ^z-1-b +a _z-2 ·2 ^z-2-b +…+a _b ·2 ^b-b Then output the resultProduct with scaling factor C = B · 2 ^b ＝-a _z-1 ·2 ^z-1 +a _z-2 ·2 ^z-2 +…+a _b ·2 ^b And C is an approximate value of A. Considering that the first bits of a z-bit binary signed multiply-add result are likely to be sign bits, the x bits in the middle of the output z bits can further reduce the circuit resource and energy consumption without changing the size of the obtained approximate value. For example, for a 10-bit binary signed number 1111010011, let the scaling factor be 2 ² Output the scaling result 11110100 and output the intermediate bit value 10100, both and the scaling factor 2 ² The product of (c) is the same. Assume that the selected scaling factor is 2 ^y Then [ z-1:0 ] at this time]In bits of [ x + y: y]The bits are the result of the final adder tree unit output, which is scaled by a scaling factor of 2 ^y Is N ₀ The result of the multiply-add of the p bit inputs to the q bit weights is approximated. Since the weights for each accumulated calculation of the partial product sum result are different, i.e., [ x + y: y ] as described above]The bit corresponds to the result of each partial product summation z ₀ -1:0]Are different, e.g. correspond to [ x + y: y ] therein for the first time]Bits, second time corresponding to [ x + y-2:y-2 therein]Bit, …, the mth time, i.e., the last time, [ x + y-2m +2]Bit, and so on. Therefore, for each partial product approximation adder tree calculation, a different approximation process, i.e., a block approximation process, is performed. Therefore, the selector configuration module in the approximate adder tree of low bit width input-low bit width output is used for configuring the approximate operation of each layer of adders in the adder tree and simultaneously changing the phase of [ x + y: y ] of the outputs of the adder tree]The adder tree output position is selected to be adjustable, i.e. the selected scaling factor 2 ^y The approximate adder tree unit can be adjusted, so that the function of the approximate adder tree unit is more flexible, and the optimal balance between power consumption and precision is achieved for different application scenes.

And the accumulation summation circuit is used for carrying out accumulation calculation on the calculation result of the approximate adder tree.

And the error correction circuit is used for accumulating the result value of the summation circuit for compensation and then outputting the result value, so that the calculation error is reduced. Error correction in such a way as to compensateThe method for adding the summation result and the correction value comprises the following steps: directly setting the correction value to a base-2 scaling factor of 2 ^y The logarithmic value of (d), i.e., the correction value, is set to y. And the error correction circuit adds the accumulated summation value and the correction value y and outputs the sum.

In the present invention, it is preferable that the encoding and decoding circuit encodes the weight of the input by way of radix-4 booth encoding, the input weight is encoded to generate an encoded weight partial product, and taking the weight with the bit width of 8 bits as an example, the encoded weight partial product is a four-column partial product array. The Booth coding mode can reduce the scale of the decoding circuit, and for the weight with the bit width of 8 bits, only four columns of partial product arrays are generated after coding, thereby greatly reducing the number of partial products and reducing the required hardware scale.

Preferably, the approximate adder tree circuit comprises a selector and an adder, the approximate adder tree circuit configures each time the approximate processing operation inside the adder tree circuit is performed through the selector module, and the adder adopts a multiplexing mode. The approximate adder tree is used for accumulating n data of each column of the partial product array which are input in sequence, and the addition part can be executed in parallel by utilizing the tree-shaped addition structure, so that the speed of the circuit is effectively improved.

The first layer of the tree structure in the approximate adder tree is used for adding n partial products pairwise and outputting the n partial products to the second layer of the tree structure, the second layer of the tree structure is used for adding n/2 rows of partial products output by the first layer pairwise and outputting the n/2 rows of partial products to the third layer of the tree structure, and the like until all the partial products are added, a final partial product is generated by row addition of each partial product array, and each layer of the tree structure in the addition tree accumulation circuit is composed of adders. The adders in the adder-tree accumulation circuit adopt a multiplexing mode to reduce the hardware scale. The approximate adder tree is further designed in such a way that the input bit width of each layer of adder is consistent with the output bit width of the previous layer, namely the approximate adder tree is 'low bit width input-low bit width output'. For an approximate adder tree with input bit width x, the input of each layer of adder is output by the previous layerx +1 bits, the first x bits or the last x bits of the result. When the input is the front x bit of the previous layer output, the lowest bit carry input of the x bit adder is the result of the last bit phase and the last bit phase output by the two previous-level adders; when the input is the last x bits output by the previous layer, the least significant bit carry input of the x-bit adder is 0. Thus, when a single data bit width of an adder tree input is x, the bit width size of the approximate adder tree output result of the "low bit width input-low bit width output" is x. In this way, the hardware scale can be further reduced. The adder tree circuit configures the approximate operation of each layer of adder in the adder tree through a selector, and simultaneously configures the [ x + y: y ] of the adder tree output through phase change]The adder tree output position is selected to be adjustable, i.e. the selected scaling factor 2 ^y The configuration may be performed by a selector configuration module. The approximate addition scheme effectively reduces the overall power consumption and area of the circuit and can maintain certain precision. The approximate adder tree modifies each layer of adder in each adder tree through the selector configuration module to adopt different approximate operations, and further, the selected scaling coefficient 2 is obtained ^y The approximate adder tree unit is configured to achieve the function of adjustable precision, so that the function of the approximate adder tree unit can be more flexible, and the optimal balance between power consumption and precision can be achieved for different application scenes.

Taking the example of performing multiplication accumulation calculation on the input and the weight of which the sign bits are extended by 256 and then are 8 bits, the selected scaling factor is 2 ^y ＝2 ¹¹ I.e. y =11, when z = ceil (log) ₂ N ₀ * (p + q)) =24, the result of multiply addition is 24 bits, and therefore its [23]Of the bits [18]The bits are the result of the final adder tree element output, and the data flow of the reconfigurable approximate adder tree element is shown in fig. 3.

According to the method of the present invention, in this embodiment, 8-bit signed number weight is subjected to radix-4 booth encoding, that is, the 8-bit weight is encoded into 4-bit encoded value (the value range of the encoded value is 0, +/-1 and +/-2). After encoding, 256 pieces of data and a booth encoding value (4 booth encoding values in total) of each bit weight are input, and corresponding decoding operation is performed on the input data according to an input booth encoding value signal. If the symbol signal is input, the input data is subjected to the operation of negating and adding a complementary code; if the x 1 signal is input, the input data value is unchanged; if a x 2 signal is input, the input data value is shifted to the right by one bit. After decoding, the partial product array is output, and the size of the array is 256 × 4, where 256 is the number of rows and 4 is the number of columns. The decoding process is to solve and output the product result of the booth code value and the input data, and the product result output by this embodiment is a signed number of 10 bits. If Booth encoding is not performed, 8 cycle periods are needed for rearranging the partial products into the partial product array and then adding the partial products, and after radix-4 Booth encoding is performed, the cycle number of a single calculation cycle can be reduced by half, namely, the cycle number is reduced from 8 cycle periods to 4 cycle periods, so that the calculation amount of the adder tree is reduced.

In this embodiment, each column of partial product array data (total 256 data) is summed through the designed approximate adder tree of "low bit width input-low bit width output". The approximate adder tree is an approximate adder tree of 8-bit input-8-bit output, namely an 8-bit signed number output result can be finally obtained after calculation of the approximate adder tree. Since the scaling factor of the calculation result of each bit approximation adder tree is different, the block approximation processing is adopted. The result of multiplication and addition [23]The [18]Bit-wise for z of each adder tree output ₀ ＝ceil(log ₂ N ₀ * (p + 2)) =18bit result, i.e., the position of the adder tree calculation result [17]Of the bits [17]Bit, next lower adder tree computation result [17]The [16]Bit, next highest adder tree computation [17]The [14]Bit and most significant adder tree computation [17]The [12]A bit. Here, since the corresponding lowest-order adder-tree calculation result [17]The [17]And the bit and the sign complementing bit are required to be expanded, so that the 7-bit binary signed number is expanded into an 8-bit binary signed number and then is input into the accumulation summation circuit. The approximate operation of each layer of adders inside the adder tree can be configured by the selector configuration module. As shown in FIG. 4, the summing calculations of the adder tree for 4 times are approximated differently, with each layer of adder tree for different calculationsThe input of the adder being the output of the previous layer the front 8bit or rear 8bit selection arrangement is different, the design of the four-way adder tree is shown in fig. 5, 6, 7 and 8 from low order to high order. Through the approximation operation, the area resource and the power consumption of the adder tree can be reduced under the condition of ensuring certain precision.

In this embodiment, the booth code value of the weight has four bits, so that the four times of approximate adder tree calculation results need to be summed, and as shown in fig. 9, 8 bits of the four times of adder tree output results need to be summed. Finally, the error correction is performed on the 10-bit result obtained by the accumulation summation, and then the low-bit 8-bit result is output, as shown in fig. 10, the error correction method adopts a method of adding the accumulation summation result and a correction value, and the correction value is selected in the following manner: directly setting the correction value to a base-2 scaling factor of 2 ^y The logarithmic value of (a), i.e., the correction value, is set to y =11. And the error correction circuit outputs a low-order 8-bit result after adding the accumulated summation value and the correction value 11, namely the output result of the reconfigurable approximate adder tree unit.

……

In the application of a deep neural network model, quantization is a common method for reducing the size of the model, and actually, weight values and activation values represented by high bit widths are represented by lower bit widths. Due to the fact that the performance of the hardware platform is not ideal, such as low calculation power, limitations of memory and power consumption, the model inference speed is low, and the power consumption is high. The fixed-point operation instruction can process more data in unit time than the floating-point operation instruction, and meanwhile, the quantized model can reduce the storage space. The quantized model is deployed on an efficient customized computing platform, and a faster inference speed can be achieved.

One quantization method is to use 8-bit or 16-bit integer numbers instead of floating point numbers, and this method replaces floating point dot products with fixed point dot products, thereby greatly reducing the operation overhead of the neural network on hardware-free floating point devices. At the same time, the user can select the desired position, the method has more obvious advantages on some hardware devices supporting single instruction stream and multiple data streams SIMD, for example, the 128-bit register SSE can simultaneously operate 4 32-bit single-precision floating points, 8 16-bit integers and 16 8-bit integers by a single instruction. It is clear that 8-bit integer numbers are much faster than single precision floating point operations under SIMD loading. In addition, the method can also reduce the occupied space of the memory and the storage of the model.

Two model quantifications commonly used in industrial deployment are respectively quantification after training and quantification during training, and the quantification after training is the quantification of the weight and the activation value after the training of the whole model (floating point type) is finished. The embodiment can be better applied to quantization after the caffe training. The 8-bit integer number is adopted to replace a floating point number quantization method, after a floating point type weighted value and an activation value are converted into int8 type data, calculation is carried out through an approximate addition tree of 8-bit input-8-bit output, an 8-bit multiply-add calculation result is output and then converted into a floating point type output again, the precision can be not remarkably reduced, the obtained precision value is close to a floating point type precision value, and meanwhile, hardware resources and energy consumption can be remarkably reduced. Therefore, the adder tree unit design of low bit width input-low bit width output can be better applied to a deep neural network model, and the calculation efficiency can be improved and the power consumption of hardware multiplication and addition operation can be reduced on the premise of not remarkably reducing the precision.

Claims

1. An adder tree unit with low bit width input and low bit width output is characterized by comprising an encoding and decoding circuit, an approximate adder tree circuit with low bit width input and low bit width output, an accumulation summation circuit and an error correction circuit;

2. The low bit width input-low bit width output adder tree unit according to claim 1, wherein said encoding and decoding circuits comprise radix-4 booth encoding circuits, decoding circuits;

a radix-4 Booth encoding circuit weights three consecutive bits { d } using three signal pairs of sign, × 1 and × 2 _2i+1 ,d _2i ,d _2i-1 Encoding, the sign signal indicating that the encoded value is a negative number, the x 1 signal indicating that the value of the encoded value is 1, and x 2 indicating that the value of the encoded value is 2;

the decoding circuit is used for carrying out corresponding decoding operation on input data according to an input Booth encoding value signal for input n data and m-bit Booth encoding values: if the symbol signal is input, the input data is subjected to the operation of negating and adding a complementary code; if the x 1 signal is input, the input data value is unchanged; if the x 2 signal is input, the input data value is shifted right by one bit; after decoding is completed, the n × m partial product array is output.

3. A low bit width input-low bit width output adder tree unit according to claim 1, wherein said approximate adder tree circuit comprises a selector and an adder, the selector is configured to configure the input position of each layer of adder in the adder tree circuit, and for the case that the input bit width is x, the selector determines whether the input of each layer of adder is the first x bits or the last x bits of the output result of the previous layer, thereby determining the scaling factor 2 at the same time ^y 。

4. A low bit width input-low bit width output adder tree unit according to claim 1, wherein said error correction circuit performs error correction by adding the result of the output of said summation circuit to the correction value.

5. An approximate multiply-add method for low bit width input-low bit width output, comprising:

step 2: each column of data of the reconstructed n multiplied by m partial product array is sequentially output to an approximate adder tree circuit with low bit width input-low bit width output, the bit width is x, the approximate adder tree circuit comprises a selector and an adder, the input of each layer of adder in the adder tree circuit is configured to be the front x bit or the rear x bit of the output result of the previous layer through the selector, and then a scaling coefficient 2 is configured ^y In the approximate adder tree circuit, each row of partial product data is added approximately to form m accumulation results finally;

6. The method according to claim 5, wherein the radix-4 booth encoding in step 1 is to encode three consecutive bits with weights using three signals of symbol, x 1 and x 2, and when decoding, for the input n data and m-bit booth encoded values, the corresponding decoding operation is performed on the input data according to the input booth encoded value signal: if the symbol signal is input, the input data is subjected to the operation of negating and adding a complementary code; if the x 1 signal is input, the input data value is unchanged; if a x 2 signal is input, the input data value is shifted to the right by one bit.

7. The method of claim 5, wherein in step 2, when the selector selects the first x bits of the output of the previous layer as the input of the adder in the current layer, the least significant bit input of the adder is the last and the next result of the two output results of the adder in the previous layer, and when the selector selects the next x bits of the output of the previous layer as the input of the adder in the current layer, the least significant bit input of the adder is 0.

8. The method according to claim 5, wherein the compensation in step 4 is performed by: and adding the multiplication accumulation result obtained in the last step and a correction value, wherein the correction value is y.