CN117170623A

CN117170623A - Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation

Info

Publication number: CN117170623A
Application number: CN202311453997.1A
Authority: CN
Inventors: 张�浩; 汪粲星; 谢钠
Original assignee: Nanjing Magnichip Microelectronics Co ltd
Current assignee: Nanjing Magnichip Microelectronics Co ltd
Priority date: 2023-11-03
Filing date: 2023-11-03
Publication date: 2023-12-05
Anticipated expiration: 2043-11-03
Also published as: CN117170623B

Abstract

The invention discloses a multi-bit width reconstruction approximate tensor multiplication and addition method and a system for neural network calculation, which support multiplication operations with bit widths of 8 multiplied by 8, 8 multiplied by 4 and 4 multiplied by 4. The approximate tensor multiplication and addition unit comprises a sub-tensor multiplication unit and an accumulation circuit. The sub-tensor multiplication unit comprises a symbol expansion module, a partial product generation module and a reconfigurable addition tree module. The symbol expansion module is used for expanding 4bits of data into 5bits of data and calculating truncated data and compensation shift values of 8bits of data in an approximate calculation mode; the reconfigurable additive tree module is used for completing accumulation of the partial product generation module results. The accumulation circuit is used for calculating the sum of the results of the corresponding sub-tensor multiplication units as a final result. The invention changes the bit width and structure of the multiplier and the adder, realizes approximation of different degrees, obviously reduces hardware cost of tensor multiplication and addition calculation, and keeps higher precision and flexibility.

Description

Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation

Technical Field

The invention belongs to the field of reconfigurable computation, and particularly relates to a multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network computation.

Background

Deep neural networks have achieved the most advanced results in a range of machine learning applications from image classification to speech recognition. The need to use human brain heuristic networks in mobile devices such as mobile devices and edge internet of things sensors with limited hardware and power supply overhead is increasing, which requires small area and energy efficient neural network accelerator designs.

The neural network calculation refers to a process of performing various calculation tasks by using a neural network model, and has the advantages of parallelism, distribution, self-adaption, fault tolerance and the like. Neural network computing has found wide application in many fields, such as computer vision, radar image processing, speech recognition, image generation, and the like.

The approximate calculation technology is a method for reducing the calculation complexity and the energy consumption under the condition of sacrificing a certain precision. Quantization is a technique for approximate computation to improve data throughput and overall energy efficiency by reducing bit width. Neural network computation involves a large number of multiply-add operations, which consume a large amount of area resources and incur a large amount of power consumption overhead. Meanwhile, different neural network applications or different layers of the same neural network have different fault tolerance, the single bit width and mode calculation unit can bring energy efficiency waste, and the bit width and the precision of the calculation unit are required to be dynamically adjusted according to different application requirements.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems that the neural network calculation in the prior art comprises a large number of multiplication and addition operations, the operations consume a large amount of area resources and bring about a large amount of power consumption expense, the invention provides a multi-bit width reconfigurable approximate tensor multiplication and addition method and system for the neural network calculation, the precision and the efficiency can be balanced by adopting different bit widths, the approximation of different degrees can be realized by changing the bit widths and structures of a multiplier and an adder, the power consumption and the area expense of the tensor multiplication and addition calculation can be obviously reduced, and meanwhile, the higher precision and the flexibility are kept.

In order to solve the technical problems, the invention provides the following technical scheme: the multi-bit width reconstruction approximate tensor multiplication and addition method for neural network calculation is used for supporting multiplication operations with bit widths of 8 multiplied by 8, 8 multiplied by 4 and 4 multiplied by 4, for input data, the following steps S1 to S3 are executed to obtain the calculation result of the sub tensor multiplication and addition unit, then the calculation result is output to an accumulation circuit for accumulation to obtain the final calculation result of the tensor multiplication and addition unit,

s1, carrying out accurate calculation or approximate calculation on input 8bits data and 4bits data to obtain 5bits data, wherein the accurate calculation on the 8bits data is as follows: splitting 8bits of data into two 4bits of data, and expanding the two 4bits of data into 5bits of data respectively; the approximate calculation for 8bits data is: cutting 8bits data into 5bits data, and calculating a compensation shift value; the exact and approximate calculations for 4bits data are: expanding the most significant supplementary sign bit of the 4bits data to obtain 5bits data, and then executing the step S2;

s2, carrying out 5*5 signed operation on data expanded to 5bits to obtain a partial product calculation result, calculating a compensation shift value on data truncated to 5bits and summing the calculated compensation shift value to obtain a total compensation shift value N, carrying out 5*5 signed operation and shifting the result left by N bits to obtain a partial product calculation result, and then executing step S3;

and S3, carrying out step-by-step addition accumulation on the partial product calculation result in the step S2, and expanding 1bit width during the step-by-step addition accumulation to obtain the calculation result of the sub-tensor multiplication and addition unit.

Further, in the foregoing step S1, the accurate calculation for 8bits data is: the bits data are split into high 4bits data and low 4bits data, the highest bit of the high 4bits data complements the sign bit, the highest bit of the low 4bits data complements 0, and 5bits data are obtained.

Further, in the foregoing step S1, the approximate calculation for 8bits data includes: traversing from 7 th bit to 0 th bit, finding out the number different from the previous bit, determining the number different from the previous bit as n-th bit, intercepting data as A [ n+1:n-3], wherein A is 8bits data to be approximately calculated, and [ (S ] is a truncated value; and when n=7, intercept data is a [7:3]; when n is less than or equal to 4, intercepting data to be A4:0, and calculating a compensation shift value by the following steps: when n is more than or equal to 3, the compensation shift value is n-3; when n <3, the compensation shift value is 0.

Further, the step S1 further includes: when the approximate calculation is carried out on 8bits of data, if the 8bits of data are the activation value and the weight in the neural network, the pre-coding is carried out firstly: taking the [6:2] bit of the activation value as a 5bits truncated value, then calculating a compensation shift value of the activation value, and taking the decimal number represented by the [1:0] bit of the weight as the weight compensation shift value.

Further, in the step S2, when calculating the total compensation shift value N, the method further includes: and adding the compensation shift value of the activation value and the compensation shift value of the weight to obtain a total compensation shift value.

Further, in the aforementioned step S2, the 5bits data is subjected to 5*5 signed operation by 16 5×5 multipliers.

The present invention provides a multi-bit wide reconstruction approximate tensor multiplication and addition system for neural network calculation, which is characterized by comprising: the tensor multiplication unit comprises at least one tensor multiplication unit and an accumulation circuit correspondingly connected with the tensor multiplication unit, wherein the tensor multiplication unit is used for supporting multiplication operations with bit widths of 8 multiplied by 8, 8 multiplied by 4 and 4 multiplied by 4, the accumulation circuit is used for accumulating and summing the results of the tensor multiplication units, and the summation result is used as the final calculation result of the tensor multiplication addition unit;

the sub-tensor multiplication unit includes: the system comprises a symbol expansion module, a partial product generation module and a reconfigurable addition tree module;

the symbol expansion module is used for carrying out accurate calculation or approximate calculation aiming at input 8bits data and 4bits data to obtain 5bits data, wherein the accurate calculation aiming at the 8bits data is as follows: splitting 8bits of data into two 4bits of data, and expanding the two 4bits of data into 5bits of data respectively; the approximate calculation for 8bits data is: cutting 8bits data into 5bits data, and calculating a compensation shift value; the exact and approximate calculations for 4bits data are: expanding the most significant complement sign bit of the 4bits data to obtain 5bits data, and then outputting the 5bits data to a partial product generation module;

the partial product generation module is used for carrying out 5*5 signed operation on the data expanded to 5bits to obtain a partial product calculation result, calculating a compensation shift value on the data truncated to 5bits and summing the calculated compensation shift value to obtain a total compensation shift value N, carrying out 5*5 signed operation on the result and shifting the result left by N bits to obtain a partial product calculation result, and outputting the partial product calculation result to the reconfigurable addition tree module;

and the reconfigurable addition tree module is used for carrying out step-by-step addition and accumulation on the partial product calculation result output by the partial product generation module, and expanding 1bit width during the step-by-step addition and accumulation to obtain the calculation result of the sub-tensor multiplication and addition unit.

Further, the number of the aforementioned sub-tensor multiplication units is 4.

Further, the reconfigurable adder tree module includes a 4-stage adder for accumulating the result output from the sub-tensor multiplication unit.

Further, the aforementioned partial product generation module includes 16 5×5 signed multipliers and a shift compensation circuit.

Compared with the prior art, the beneficial technical effects of the invention adopting the technical scheme are as follows:

(1) The tensor multiplication and addition unit supports multiplication of three bit widths, realizes more flexible configuration by using the same hardware, can more fully utilize the characteristic that different layers of the neural network have different tolerance to errors, dynamically configures multiplication bit widths, and minimizes hardware cost. For example, in a layer having high sensitivity to errors, multiplication of 8×8bits wide is performed using 8bits of data; in a layer having low sensitivity to errors, multiplication of 8×4 or 4×4bits width is performed using 4bits of data, and computational energy efficiency is improved while network accuracy is maintained.

(2) The proposed tensor multiply-add unit supports a precise calculation mode and an approximate calculation mode. In an approximate calculation mode, the multiplication operation related to 8bits of data is converted into the multiplication operation of 5bits of data in a shifting mode, so that the delay and the power consumption of a circuit are reduced; meanwhile, due to the characteristic of data distribution of the neural network, the conversion does not have an excessive influence on the overall accuracy of the network. The lower bits of the reconfigurable adder tree are also dynamically configured as OR gates in the approximate calculation mode to replace the full adder, thereby further reducing the hardware overhead of the circuit.

(3) The weight of the tensor multiplication and addition unit is determined after the network training is finished, so that the weight can be pre-coded when the 8 multiplied by 8bits wide multiplication is calculated in an approximate calculation mode, calculation of a shift compensation value and 5bits truncated data in a symbol expansion module is skipped, and the calculation energy efficiency is further improved.

Drawings

FIG. 1 is an overall schematic of an embodiment of the present invention.

Fig. 2 is a schematic diagram of a weight distribution of a neural network.

Fig. 3 is a schematic diagram of an expansion method and a calculation method of different bit width data in an embodiment of the invention.

FIG. 4 is a schematic diagram of an embodiment of the present invention calculating an 8×8bit wide multiplication in an approximate calculation mode.

FIG. 5 is a schematic diagram of the structure of a single 16bits adder in a reconfigurable adder tree in an embodiment of the invention.

Description of the embodiments

For a better understanding of the technical content of the present invention, specific examples are set forth below, along with the accompanying drawings.

Aspects of the invention are described herein with reference to the drawings, in which there are shown many illustrative embodiments. The embodiments of the present invention are not limited to the embodiments described in the drawings. It is to be understood that this invention is capable of being carried out by any of the various concepts and embodiments described above and as such described in detail below, since the disclosed concepts and embodiments are not limited to any implementation. Additionally, some aspects of the disclosure may be used alone or in any suitable combination with other aspects of the disclosure.

Fig. 1 shows a schematic overall structure of a preferred embodiment of the present invention, and the present invention provides a multi-bit width reconstruction approximate tensor multiplication method for neural network calculation, which is used for supporting multiplication operations with bit widths of 8×8, 8×4 and 4×4, performing the following steps S1 to S3 for input data to obtain a calculation result of a sub-tensor multiplication addition unit, then outputting the calculation result to an accumulation circuit for accumulation to obtain a final calculation result of the tensor multiplication addition unit,

s1, carrying out accurate calculation or approximate calculation on input 8bits data and 4bits data to obtain 5bits data, wherein the accurate calculation on the 8bits data is as follows: splitting 8bits of data into two 4bits of data, and expanding the two 4bits of data into 5bits of data respectively; the approximate calculation for 8bits data is: cutting 8bits data into 5bits data, and calculating a compensation shift value; the exact and approximate calculations for 4bits data are: and expanding the most significant complement sign bit of the 4bits data to obtain 5bits data, and then executing the step S2.

The case of selecting accurate calculation or approximate calculation is as follows: in a scene with strong noise immunity like speech recognition, an approximation mode can be used, while for some image recognition, if the classification number is large, a scene with poor robustness of the neural network needs to use an accurate calculation mode. The reason for expanding the data to 5bits is to increase the hardware multiplexing rate as much as possible under different calculation bit width and calculation modes, and meanwhile, the accuracy of the approximate calculation mode result is kept to be reduced within an acceptable range.

Referring to fig. 3, the exact calculation for 8bits of data is: the bits data are split into high 4bits data and low 4bits data, the highest bit of the high 4bits data complements the sign bit, the highest bit of the low 4bits data complements 0, and 5bits data are obtained.

When the multiplication bit width is 8×8 and the circuit works in the accurate mode, the invention decomposes 8bits of data into high order and low order, and the result is calculated by 4 5×5 signed multipliers. Assuming that the two 8bits multipliers are a= { AH, AL }, b= { BH, BL }, where AH, BH and AL, BL are the high 4bits and the low 4bits of the 8bits numbers a and B, respectively, a×b=ah×bh < <8+ah×bl < <4+al×bh < <4+al×bl, respectively, a product of a and B is obtained by shift addition using 4 multipliers.

When the input tensor is the activation value and the weight in the neural network, the weight data can be obtained after off-chip training, so that the weight data can be pre-encoded before being input into the tensor multiplication and addition unit, and 5bits data and the shift compensation value which are generated by the truncation of the shift compensation module are calculated in advance, thereby avoiding the processing of the weight data on the chip when the tensor multiplication and addition is calculated, and further reducing the hardware cost.

When the approximate calculation is carried out on 8bits of data, if the 8bits of data are the activation value and the weight in the neural network, the pre-coding is carried out firstly: taking the [6:2] bit of the activation value as a 5bits truncated value, then calculating a compensation shift value of the activation value, and taking the decimal number represented by the [1:0] bit of the weight as the weight compensation shift value.

For example, assuming that the 8bits activation value a is "1110_1001", the calculated 5bits truncated value is "10100", and the compensation shift value is 1; the weight value W of 8bits pre-coding is 0001_1011, and the 5bits truncated value is 00110 and the shift compensation value is 3. The 10bits product result is "11_1011_1000", and the final product result is 11_1011_1000_0000 after the 10bits product result is processed by the shift compensation circuit, considering that the total compensation shift value is 4.

Because the data in the neural network is signed numbers, A and B are signed numbers, and the 4bits AL and BL are positive numbers when participating in calculation and do not have sign bits, direct calculation can cause errors of calculation results, so that one-bit sign bit expansion is needed, and the 4bits are expanded to 5bits. The AH and BH respectively expand one symbol bit as follows: { SA, AH }, { SB, BH }, wherein SA and SB are the sign bits of A and B, respectively; AL and BL will extend by one bit 0, {0, AL }, 0, BL }, respectively. The expanded data are input into 4 5 multiplied by 5 multipliers to be calculated and shifted and added to obtain a final result.

When the multiplication bit width is 8×4 and the circuit is operating in the precise mode, similarly, the input 8bits of data a expands to { SA, AH }, {0, al },4bits of data expands to { SB, BH }, and the product of a and B will be calculated by 25×5 signed multipliers and shift added.

When the multiplication bit width is 4 x 4 and the circuit is operating in the exact mode, similarly, data a and B will be expanded to { SA, a } and { SB, B }, the product of which is calculated by 1 5 x 5 signed multipliers and shift added.

Fig. 2 shows the weight distribution of a neural network. The statistics shows that the smaller weight value in the vicinity of the 0 value in most of the neural networks has a large duty ratio, and the maximum value and the minimum value have a small duty ratio, so that the tendency of normal distribution is shown, namely, the larger the weight value is, the smaller the probability of occurrence in the calculation of the neural network is. The approximate calculation mode in the invention fully utilizes the characteristic, and designs a scheme with smaller and more accurate multipliers and larger result errors to minimize the hardware cost of the circuit. The present invention converts signed multiplication with bit width of 8 x 8 or 8 x 4 into signed multiplication with bit width of 5 x 5 by shifting method. Specifically, first, traversing from high order to low order, finding the first digit different from the sign bit in 8bits data, then intercepting the corresponding 5bits data, and discarding the low order data. Because the value represented by the low-order data is small, the error introduced by truncation is within an acceptable range, but the bit width of the multiplier is effectively reduced, so that the hardware cost is greatly reduced. The approximate calculation for 8bits data includes: traversing from 7 th bit to 0 th bit, finding out the number different from the previous bit, determining the number different from the previous bit as n-th bit, intercepting data as A [ n+1:n-3], wherein A is 8bits data to be approximately calculated, and [ (S ] is a truncated value; and when n=7, intercept data is a [7:3]; when n is less than or equal to 4, intercepting data to be A4:0, and calculating a compensation shift value by the following steps: when n is more than or equal to 3, the compensation shift value is n-3; when n <3, the compensation shift value is 0.

Fig. 4 is a schematic diagram showing a process of calculating 8×8 bit-width multiplication in the approximate calculation mode according to embodiment 1 of the present invention. The 8bits wide multiplier and multiplicand are first truncated to 5bits by the sign extend module, and the required compensation shift value is calculated at the same time. The truncated 5bits data is input to a 5×5 signed multiplier of the partial product generation module, and the final product is calculated by a shift compensation circuit of the partial product generation module.

Fig. 5 shows a schematic diagram of the structure of a single 16bits adder in the reconfigurable additive tree in the present invention. Wherein the addition result of the low n bits is generated by an OR gate, the addition result of the (16-n) bits is generated by an accurate full adder, and no carry signal is transferred between the n-th bit and the (n+1) -th bit. As a preferred solution of the present embodiment, the value of n is 8, which can achieve a better trade-off between accuracy and hardware overhead.

In another aspect, the present invention provides a multi-bit wide reconstruction approximate tensor multiplication and addition system for neural network calculation, including: the tensor multiplication unit comprises at least one tensor multiplication unit and an accumulation circuit correspondingly connected with the tensor multiplication unit, wherein the tensor multiplication unit is used for supporting multiplication operations with bit widths of 8 multiplied by 8, 8 multiplied by 4 and 4 multiplied by 4, the accumulation circuit is used for accumulating and summing the results of the tensor multiplication units, and the summation result is used as the final calculation result of the tensor multiplication addition unit;

the symbol expansion module is used for carrying out accurate calculation or approximate calculation aiming at input 8bits data and 4bits data to obtain 5bits data so as to realize hardware multiplexing, wherein the accurate calculation aiming at the 8bits data is as follows: splitting 8bits of data into two 4bits of data, and expanding the two 4bits of data into 5bits of data respectively; the approximate calculation for 8bits data is: cutting 8bits data into 5bits data, and calculating a compensation shift value; the exact and approximate calculations for 4bits data are: expanding the most significant complement sign bit of the 4bits data to obtain 5bits data, and then outputting the 5bits data to a partial product generation module;

the reconfigurable addition tree module can dynamically adjust the calculation precision, is used for carrying out step-by-step addition and accumulation on the partial product calculation result output by the partial product generation module, and is used for carrying out 1bit width expansion when the step-by-step addition and accumulation is carried out, so as to obtain the calculation result of the sub-tensor multiplication and addition unit.

As the optimal scheme of the multi-bit wide reconstruction approximate tensor multiplication and addition system facing the neural network calculation, the number of the sub-tensor multiplication units is 4, the reconfigurable addition tree module comprises 4-stage adders, and the addition of 16 accurate 5 multiplied results is carried out, and each-stage addition is expanded by 1bit wide to prevent the result from overflowing; and the reconfigurable addition tree can dynamically configure low-order addition as an OR gate or an accurate full adder according to the working mode of the sub-tensor multiplication unit so as to reduce the calculation power consumption. The partial product generation module includes 16 5 x 5 signed multipliers and a shift compensation circuit. The input signal of the module comes from the output of the symbol expansion module; the shift compensation circuit only works in an approximate calculation mode and when 8bits of data are input, and the rest time does not exchange data with other circuits so as to reduce the turnover power consumption. The working principle of the circuit is as follows: adding all compensation shift values input to a certain 5×5 signed multiplier to obtain a total compensation shift value N, and shifting the result of the 5×5 signed multiplier to the left by N bits; this process is done in parallel for 16 5 x 5 signed multipliers.

Referring to fig. 1, with the method of the present invention, a 256-bit multiplier and a 256-bit multiplicand are equally divided into 4 groups of 64-bit multipliers and multiplicands, respectively, and input into 4 of the sub-tensor multiplication units; inside each sub-tensor multiplication unit, a 64-bits multiplier and a multiplicand are split again according to multiplication bit width configuration, and 16 accurate input values of 5 multiplied by 5 are generated under the action of a symbol expansion module. The compensation shift value is calculated in the approximate calculation mode where the multiplication bit width contains 8bits of data. 16 5×5 multipliers in the partial product generation module calculate 16 10bits results. In the approximate calculation mode, when the multiplication bit width contains 8bits of data, the result is shifted left by the shift compensation circuit by the corresponding bit number to obtain 16 final partial products. The 16 final partial products are accumulated in the reconfigurable addition tree module through 4-level addition to obtain partial sums. In the approximate calculation mode, the low-order addition of the reconfigurable addition tree is realized by an OR gate; in the exact calculation mode, the low order addition of the reconfigurable adder tree is implemented by a full adder. The 4 parts of the 4 sub-tensor multiplication units are generated in parallel and are sent to an accumulation circuit. The accumulation circuit accumulates the four partial sums to obtain the final calculation result of the multi-bit wide reconfigurable approximate tensor multiplication and addition unit.

The invention is applied to a general neural network accelerator chip, wherein 4 rows and 16 columns of tensor multiply-add units are adopted in total, data and weights are sent to each multiply-add unit in a broadcast mode, and parallel calculation of each unit is realized in a mode of fixing output data. For the reasoning of AlexNet networks, the 7 and 8 layers of network are put in a 4×4 mode, the 5 and 6 layers of network are put in an 8×4 mode, and the remaining 1 to 4 layers are put in an approximately 8×8 mode. The final network computing energy efficiency is improved by 39.7%.

While the invention has been described in terms of preferred embodiments, it is not intended to be limiting. Those skilled in the art will appreciate that various modifications and adaptations can be made without departing from the spirit and scope of the present invention. Accordingly, the scope of the invention is defined by the appended claims.

Claims

1. The multi-bit width reconstruction approximate tensor multiplication and addition method for the neural network calculation is used for supporting multiplication operations with bit widths of 8 multiplied by 8, 8 multiplied by 4 and 4 multiplied by 4, and is characterized in that for input data, the following steps S1 to S3 are executed to obtain the calculation result of the sub-tensor multiply-add unit, and then the calculation result is output to the accumulation circuit for accumulation to obtain the final calculation result of the tensor multiply-add unit,

2. The multi-bit wide reconstruction approximate tensor multiplication method for neural network computation according to claim 1, wherein in step S1, the exact computation for 8bits data is: the bits data are split into high 4bits data and low 4bits data, the highest bit of the high 4bits data complements the sign bit, the highest bit of the low 4bits data complements 0, and 5bits data are obtained.

3. The multi-bit wide reconstruction approximate tensor multiplication method for neural network computation according to claim 1, wherein in step S1, the approximate computation for 8bits data comprises: traversing from 7 th bit to 0 th bit, finding out the number different from the first bit and the previous bit, determining the number different from the first bit and the previous bit as n-th bit, and intercepting data as A [ n+1:n-3], wherein A is 8bits data to be approximately calculated, and [ (wherein A ] is a truncated value; and when n=7, intercept data is a [7:3]; when n is less than or equal to 4, intercepting data to be A4:0, and calculating a compensation shift value by the following steps: when n is more than or equal to 3, the compensation shift value is n-3; when n <3, the compensation shift value is 0.

4. The method of multi-bit wide reconstruction approximate tensor multiplication and addition for neural network computation according to claim 3, wherein the step S1 further comprises: when the approximate calculation is carried out on 8bits of data, if the 8bits of data are the activation value and the weight in the neural network, the pre-coding is carried out firstly: taking the [6:2] bit of the activation value as a 5bits truncated value, then calculating a compensation shift value of the activation value, and taking the decimal number represented by the [1:0] bit of the weight as the weight compensation shift value.

5. The method of multiply-add for multi-bit wide reconstruction approximation tensor calculation for a neural network of claim 4, wherein calculating the total compensation shift value N in step S2 further comprises: and adding the compensation shift value of the activation value and the compensation shift value of the weight to obtain a total compensation shift value.

6. The method of claim 5, wherein in step S2, the 5bits data is subjected to 5*5 signed operation by 16 5×5 multipliers.

7. A multi-bit wide reconstruction approximate tensor multiply-add system for neural network computation, comprising: the tensor multiplication unit comprises at least one tensor multiplication unit and an accumulation circuit correspondingly connected with the tensor multiplication unit, wherein the tensor multiplication unit is used for supporting multiplication operations with bit widths of 8 multiplied by 8, 8 multiplied by 4 and 4 multiplied by 4, the accumulation circuit is used for accumulating and summing the results of the tensor multiplication units, and the summation result is used as the final calculation result of the tensor multiplication addition unit;

8. The neural network calculation oriented multi-bit wide reconstruction approximate tensor multiply add system of claim 7, wherein the number of sub-tensor multiply units is 4.

9. The neural network computation oriented multi-bit wide reconstruction approximate tensor multiply-add system of claim 8, wherein the reconfigurable adder-tree module comprises a 4-stage adder to accumulate the results output by the sub-tensor multiply unit.

10. The neural network computation-oriented multi-bit wide reconstruction approximate tensor multiply-add system of claim 7, wherein the partial product generation module includes 16 5 x 5 signed multipliers and a shift compensation circuit.