CN117170623A - Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation - Google Patents

Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation Download PDF

Info

Publication number
CN117170623A
CN117170623A CN202311453997.1A CN202311453997A CN117170623A CN 117170623 A CN117170623 A CN 117170623A CN 202311453997 A CN202311453997 A CN 202311453997A CN 117170623 A CN117170623 A CN 117170623A
Authority
CN
China
Prior art keywords
data
bit
tensor
calculation
approximate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311453997.1A
Other languages
Chinese (zh)
Other versions
CN117170623B (en
Inventor
张�浩
汪粲星
谢钠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Magnichip Microelectronics Co ltd
Original Assignee
Nanjing Magnichip Microelectronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Magnichip Microelectronics Co ltd filed Critical Nanjing Magnichip Microelectronics Co ltd
Priority to CN202311453997.1A priority Critical patent/CN117170623B/en
Publication of CN117170623A publication Critical patent/CN117170623A/en
Application granted granted Critical
Publication of CN117170623B publication Critical patent/CN117170623B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a multi-bit width reconstruction approximate tensor multiplication and addition method and a system for neural network calculation, which support multiplication operations with bit widths of 8 multiplied by 8, 8 multiplied by 4 and 4 multiplied by 4. The approximate tensor multiplication and addition unit comprises a sub-tensor multiplication unit and an accumulation circuit. The sub-tensor multiplication unit comprises a symbol expansion module, a partial product generation module and a reconfigurable addition tree module. The symbol expansion module is used for expanding 4bits of data into 5bits of data and calculating truncated data and compensation shift values of 8bits of data in an approximate calculation mode; the reconfigurable additive tree module is used for completing accumulation of the partial product generation module results. The accumulation circuit is used for calculating the sum of the results of the corresponding sub-tensor multiplication units as a final result. The invention changes the bit width and structure of the multiplier and the adder, realizes approximation of different degrees, obviously reduces hardware cost of tensor multiplication and addition calculation, and keeps higher precision and flexibility.

Description

Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation
Technical Field
The invention belongs to the field of reconfigurable computation, and particularly relates to a multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network computation.
Background
Deep neural networks have achieved the most advanced results in a range of machine learning applications from image classification to speech recognition. The need to use human brain heuristic networks in mobile devices such as mobile devices and edge internet of things sensors with limited hardware and power supply overhead is increasing, which requires small area and energy efficient neural network accelerator designs.
The neural network calculation refers to a process of performing various calculation tasks by using a neural network model, and has the advantages of parallelism, distribution, self-adaption, fault tolerance and the like. Neural network computing has found wide application in many fields, such as computer vision, radar image processing, speech recognition, image generation, and the like.
The approximate calculation technology is a method for reducing the calculation complexity and the energy consumption under the condition of sacrificing a certain precision. Quantization is a technique for approximate computation to improve data throughput and overall energy efficiency by reducing bit width. Neural network computation involves a large number of multiply-add operations, which consume a large amount of area resources and incur a large amount of power consumption overhead. Meanwhile, different neural network applications or different layers of the same neural network have different fault tolerance, the single bit width and mode calculation unit can bring energy efficiency waste, and the bit width and the precision of the calculation unit are required to be dynamically adjusted according to different application requirements.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems that the neural network calculation in the prior art comprises a large number of multiplication and addition operations, the operations consume a large amount of area resources and bring about a large amount of power consumption expense, the invention provides a multi-bit width reconfigurable approximate tensor multiplication and addition method and system for the neural network calculation, the precision and the efficiency can be balanced by adopting different bit widths, the approximation of different degrees can be realized by changing the bit widths and structures of a multiplier and an adder, the power consumption and the area expense of the tensor multiplication and addition calculation can be obviously reduced, and meanwhile, the higher precision and the flexibility are kept.
In order to solve the technical problems, the invention provides the following technical scheme: the multi-bit width reconstruction approximate tensor multiplication and addition method for neural network calculation is used for supporting multiplication operations with bit widths of 8 multiplied by 8, 8 multiplied by 4 and 4 multiplied by 4, for input data, the following steps S1 to S3 are executed to obtain the calculation result of the sub tensor multiplication and addition unit, then the calculation result is output to an accumulation circuit for accumulation to obtain the final calculation result of the tensor multiplication and addition unit,
s1, carrying out accurate calculation or approximate calculation on input 8bits data and 4bits data to obtain 5bits data, wherein the accurate calculation on the 8bits data is as follows: splitting 8bits of data into two 4bits of data, and expanding the two 4bits of data into 5bits of data respectively; the approximate calculation for 8bits data is: cutting 8bits data into 5bits data, and calculating a compensation shift value; the exact and approximate calculations for 4bits data are: expanding the most significant supplementary sign bit of the 4bits data to obtain 5bits data, and then executing the step S2;
s2, carrying out 5*5 signed operation on data expanded to 5bits to obtain a partial product calculation result, calculating a compensation shift value on data truncated to 5bits and summing the calculated compensation shift value to obtain a total compensation shift value N, carrying out 5*5 signed operation and shifting the result left by N bits to obtain a partial product calculation result, and then executing step S3;
and S3, carrying out step-by-step addition accumulation on the partial product calculation result in the step S2, and expanding 1bit width during the step-by-step addition accumulation to obtain the calculation result of the sub-tensor multiplication and addition unit.
Further, in the foregoing step S1, the accurate calculation for 8bits data is: the bits data are split into high 4bits data and low 4bits data, the highest bit of the high 4bits data complements the sign bit, the highest bit of the low 4bits data complements 0, and 5bits data are obtained.
Further, in the foregoing step S1, the approximate calculation for 8bits data includes: traversing from 7 th bit to 0 th bit, finding out the number different from the previous bit, determining the number different from the previous bit as n-th bit, intercepting data as A [ n+1:n-3], wherein A is 8bits data to be approximately calculated, and [ (S ] is a truncated value; and when n=7, intercept data is a [7:3]; when n is less than or equal to 4, intercepting data to be A4:0, and calculating a compensation shift value by the following steps: when n is more than or equal to 3, the compensation shift value is n-3; when n <3, the compensation shift value is 0.
Further, the step S1 further includes: when the approximate calculation is carried out on 8bits of data, if the 8bits of data are the activation value and the weight in the neural network, the pre-coding is carried out firstly: taking the [6:2] bit of the activation value as a 5bits truncated value, then calculating a compensation shift value of the activation value, and taking the decimal number represented by the [1:0] bit of the weight as the weight compensation shift value.
Further, in the step S2, when calculating the total compensation shift value N, the method further includes: and adding the compensation shift value of the activation value and the compensation shift value of the weight to obtain a total compensation shift value.
Further, in the aforementioned step S2, the 5bits data is subjected to 5*5 signed operation by 16 5×5 multipliers.
The present invention provides a multi-bit wide reconstruction approximate tensor multiplication and addition system for neural network calculation, which is characterized by comprising: the tensor multiplication unit comprises at least one tensor multiplication unit and an accumulation circuit correspondingly connected with the tensor multiplication unit, wherein the tensor multiplication unit is used for supporting multiplication operations with bit widths of 8 multiplied by 8, 8 multiplied by 4 and 4 multiplied by 4, the accumulation circuit is used for accumulating and summing the results of the tensor multiplication units, and the summation result is used as the final calculation result of the tensor multiplication addition unit;
the sub-tensor multiplication unit includes: the system comprises a symbol expansion module, a partial product generation module and a reconfigurable addition tree module;
the symbol expansion module is used for carrying out accurate calculation or approximate calculation aiming at input 8bits data and 4bits data to obtain 5bits data, wherein the accurate calculation aiming at the 8bits data is as follows: splitting 8bits of data into two 4bits of data, and expanding the two 4bits of data into 5bits of data respectively; the approximate calculation for 8bits data is: cutting 8bits data into 5bits data, and calculating a compensation shift value; the exact and approximate calculations for 4bits data are: expanding the most significant complement sign bit of the 4bits data to obtain 5bits data, and then outputting the 5bits data to a partial product generation module;
the partial product generation module is used for carrying out 5*5 signed operation on the data expanded to 5bits to obtain a partial product calculation result, calculating a compensation shift value on the data truncated to 5bits and summing the calculated compensation shift value to obtain a total compensation shift value N, carrying out 5*5 signed operation on the result and shifting the result left by N bits to obtain a partial product calculation result, and outputting the partial product calculation result to the reconfigurable addition tree module;
and the reconfigurable addition tree module is used for carrying out step-by-step addition and accumulation on the partial product calculation result output by the partial product generation module, and expanding 1bit width during the step-by-step addition and accumulation to obtain the calculation result of the sub-tensor multiplication and addition unit.
Further, the number of the aforementioned sub-tensor multiplication units is 4.
Further, the reconfigurable adder tree module includes a 4-stage adder for accumulating the result output from the sub-tensor multiplication unit.
Further, the aforementioned partial product generation module includes 16 5×5 signed multipliers and a shift compensation circuit.
Compared with the prior art, the beneficial technical effects of the invention adopting the technical scheme are as follows:
(1) The tensor multiplication and addition unit supports multiplication of three bit widths, realizes more flexible configuration by using the same hardware, can more fully utilize the characteristic that different layers of the neural network have different tolerance to errors, dynamically configures multiplication bit widths, and minimizes hardware cost. For example, in a layer having high sensitivity to errors, multiplication of 8×8bits wide is performed using 8bits of data; in a layer having low sensitivity to errors, multiplication of 8×4 or 4×4bits width is performed using 4bits of data, and computational energy efficiency is improved while network accuracy is maintained.
(2) The proposed tensor multiply-add unit supports a precise calculation mode and an approximate calculation mode. In an approximate calculation mode, the multiplication operation related to 8bits of data is converted into the multiplication operation of 5bits of data in a shifting mode, so that the delay and the power consumption of a circuit are reduced; meanwhile, due to the characteristic of data distribution of the neural network, the conversion does not have an excessive influence on the overall accuracy of the network. The lower bits of the reconfigurable adder tree are also dynamically configured as OR gates in the approximate calculation mode to replace the full adder, thereby further reducing the hardware overhead of the circuit.
(3) The weight of the tensor multiplication and addition unit is determined after the network training is finished, so that the weight can be pre-coded when the 8 multiplied by 8bits wide multiplication is calculated in an approximate calculation mode, calculation of a shift compensation value and 5bits truncated data in a symbol expansion module is skipped, and the calculation energy efficiency is further improved.
Drawings
FIG. 1 is an overall schematic of an embodiment of the present invention.
Fig. 2 is a schematic diagram of a weight distribution of a neural network.
Fig. 3 is a schematic diagram of an expansion method and a calculation method of different bit width data in an embodiment of the invention.
FIG. 4 is a schematic diagram of an embodiment of the present invention calculating an 8×8bit wide multiplication in an approximate calculation mode.
FIG. 5 is a schematic diagram of the structure of a single 16bits adder in a reconfigurable adder tree in an embodiment of the invention.
Description of the embodiments
For a better understanding of the technical content of the present invention, specific examples are set forth below, along with the accompanying drawings.
Aspects of the invention are described herein with reference to the drawings, in which there are shown many illustrative embodiments. The embodiments of the present invention are not limited to the embodiments described in the drawings. It is to be understood that this invention is capable of being carried out by any of the various concepts and embodiments described above and as such described in detail below, since the disclosed concepts and embodiments are not limited to any implementation. Additionally, some aspects of the disclosure may be used alone or in any suitable combination with other aspects of the disclosure.
Fig. 1 shows a schematic overall structure of a preferred embodiment of the present invention, and the present invention provides a multi-bit width reconstruction approximate tensor multiplication method for neural network calculation, which is used for supporting multiplication operations with bit widths of 8×8, 8×4 and 4×4, performing the following steps S1 to S3 for input data to obtain a calculation result of a sub-tensor multiplication addition unit, then outputting the calculation result to an accumulation circuit for accumulation to obtain a final calculation result of the tensor multiplication addition unit,
s1, carrying out accurate calculation or approximate calculation on input 8bits data and 4bits data to obtain 5bits data, wherein the accurate calculation on the 8bits data is as follows: splitting 8bits of data into two 4bits of data, and expanding the two 4bits of data into 5bits of data respectively; the approximate calculation for 8bits data is: cutting 8bits data into 5bits data, and calculating a compensation shift value; the exact and approximate calculations for 4bits data are: and expanding the most significant complement sign bit of the 4bits data to obtain 5bits data, and then executing the step S2.
The case of selecting accurate calculation or approximate calculation is as follows: in a scene with strong noise immunity like speech recognition, an approximation mode can be used, while for some image recognition, if the classification number is large, a scene with poor robustness of the neural network needs to use an accurate calculation mode. The reason for expanding the data to 5bits is to increase the hardware multiplexing rate as much as possible under different calculation bit width and calculation modes, and meanwhile, the accuracy of the approximate calculation mode result is kept to be reduced within an acceptable range.
Referring to fig. 3, the exact calculation for 8bits of data is: the bits data are split into high 4bits data and low 4bits data, the highest bit of the high 4bits data complements the sign bit, the highest bit of the low 4bits data complements 0, and 5bits data are obtained.
When the multiplication bit width is 8×8 and the circuit works in the accurate mode, the invention decomposes 8bits of data into high order and low order, and the result is calculated by 4 5×5 signed multipliers. Assuming that the two 8bits multipliers are a= { AH, AL }, b= { BH, BL }, where AH, BH and AL, BL are the high 4bits and the low 4bits of the 8bits numbers a and B, respectively, a×b=ah×bh < <8+ah×bl < <4+al×bh < <4+al×bl, respectively, a product of a and B is obtained by shift addition using 4 multipliers.
When the input tensor is the activation value and the weight in the neural network, the weight data can be obtained after off-chip training, so that the weight data can be pre-encoded before being input into the tensor multiplication and addition unit, and 5bits data and the shift compensation value which are generated by the truncation of the shift compensation module are calculated in advance, thereby avoiding the processing of the weight data on the chip when the tensor multiplication and addition is calculated, and further reducing the hardware cost.
When the approximate calculation is carried out on 8bits of data, if the 8bits of data are the activation value and the weight in the neural network, the pre-coding is carried out firstly: taking the [6:2] bit of the activation value as a 5bits truncated value, then calculating a compensation shift value of the activation value, and taking the decimal number represented by the [1:0] bit of the weight as the weight compensation shift value.
For example, assuming that the 8bits activation value a is "1110_1001", the calculated 5bits truncated value is "10100", and the compensation shift value is 1; the weight value W of 8bits pre-coding is 0001_1011, and the 5bits truncated value is 00110 and the shift compensation value is 3. The 10bits product result is "11_1011_1000", and the final product result is 11_1011_1000_0000 after the 10bits product result is processed by the shift compensation circuit, considering that the total compensation shift value is 4.
Because the data in the neural network is signed numbers, A and B are signed numbers, and the 4bits AL and BL are positive numbers when participating in calculation and do not have sign bits, direct calculation can cause errors of calculation results, so that one-bit sign bit expansion is needed, and the 4bits are expanded to 5bits. The AH and BH respectively expand one symbol bit as follows: { SA, AH }, { SB, BH }, wherein SA and SB are the sign bits of A and B, respectively; AL and BL will extend by one bit 0, {0, AL }, 0, BL }, respectively. The expanded data are input into 4 5 multiplied by 5 multipliers to be calculated and shifted and added to obtain a final result.
When the multiplication bit width is 8×4 and the circuit is operating in the precise mode, similarly, the input 8bits of data a expands to { SA, AH }, {0, al },4bits of data expands to { SB, BH }, and the product of a and B will be calculated by 25×5 signed multipliers and shift added.
When the multiplication bit width is 4 x 4 and the circuit is operating in the exact mode, similarly, data a and B will be expanded to { SA, a } and { SB, B }, the product of which is calculated by 1 5 x 5 signed multipliers and shift added.
Fig. 2 shows the weight distribution of a neural network. The statistics shows that the smaller weight value in the vicinity of the 0 value in most of the neural networks has a large duty ratio, and the maximum value and the minimum value have a small duty ratio, so that the tendency of normal distribution is shown, namely, the larger the weight value is, the smaller the probability of occurrence in the calculation of the neural network is. The approximate calculation mode in the invention fully utilizes the characteristic, and designs a scheme with smaller and more accurate multipliers and larger result errors to minimize the hardware cost of the circuit. The present invention converts signed multiplication with bit width of 8 x 8 or 8 x 4 into signed multiplication with bit width of 5 x 5 by shifting method. Specifically, first, traversing from high order to low order, finding the first digit different from the sign bit in 8bits data, then intercepting the corresponding 5bits data, and discarding the low order data. Because the value represented by the low-order data is small, the error introduced by truncation is within an acceptable range, but the bit width of the multiplier is effectively reduced, so that the hardware cost is greatly reduced. The approximate calculation for 8bits data includes: traversing from 7 th bit to 0 th bit, finding out the number different from the previous bit, determining the number different from the previous bit as n-th bit, intercepting data as A [ n+1:n-3], wherein A is 8bits data to be approximately calculated, and [ (S ] is a truncated value; and when n=7, intercept data is a [7:3]; when n is less than or equal to 4, intercepting data to be A4:0, and calculating a compensation shift value by the following steps: when n is more than or equal to 3, the compensation shift value is n-3; when n <3, the compensation shift value is 0.
S2, carrying out 5*5 signed operation on data expanded to 5bits to obtain a partial product calculation result, calculating a compensation shift value on data truncated to 5bits and summing the calculated compensation shift value to obtain a total compensation shift value N, carrying out 5*5 signed operation and shifting the result left by N bits to obtain a partial product calculation result, and then executing step S3;
and S3, carrying out step-by-step addition accumulation on the partial product calculation result in the step S2, and expanding 1bit width during the step-by-step addition accumulation to obtain the calculation result of the sub-tensor multiplication and addition unit.
Fig. 4 is a schematic diagram showing a process of calculating 8×8 bit-width multiplication in the approximate calculation mode according to embodiment 1 of the present invention. The 8bits wide multiplier and multiplicand are first truncated to 5bits by the sign extend module, and the required compensation shift value is calculated at the same time. The truncated 5bits data is input to a 5×5 signed multiplier of the partial product generation module, and the final product is calculated by a shift compensation circuit of the partial product generation module.
Fig. 5 shows a schematic diagram of the structure of a single 16bits adder in the reconfigurable additive tree in the present invention. Wherein the addition result of the low n bits is generated by an OR gate, the addition result of the (16-n) bits is generated by an accurate full adder, and no carry signal is transferred between the n-th bit and the (n+1) -th bit. As a preferred solution of the present embodiment, the value of n is 8, which can achieve a better trade-off between accuracy and hardware overhead.
In another aspect, the present invention provides a multi-bit wide reconstruction approximate tensor multiplication and addition system for neural network calculation, including: the tensor multiplication unit comprises at least one tensor multiplication unit and an accumulation circuit correspondingly connected with the tensor multiplication unit, wherein the tensor multiplication unit is used for supporting multiplication operations with bit widths of 8 multiplied by 8, 8 multiplied by 4 and 4 multiplied by 4, the accumulation circuit is used for accumulating and summing the results of the tensor multiplication units, and the summation result is used as the final calculation result of the tensor multiplication addition unit;
the sub-tensor multiplication unit includes: the system comprises a symbol expansion module, a partial product generation module and a reconfigurable addition tree module;
the symbol expansion module is used for carrying out accurate calculation or approximate calculation aiming at input 8bits data and 4bits data to obtain 5bits data so as to realize hardware multiplexing, wherein the accurate calculation aiming at the 8bits data is as follows: splitting 8bits of data into two 4bits of data, and expanding the two 4bits of data into 5bits of data respectively; the approximate calculation for 8bits data is: cutting 8bits data into 5bits data, and calculating a compensation shift value; the exact and approximate calculations for 4bits data are: expanding the most significant complement sign bit of the 4bits data to obtain 5bits data, and then outputting the 5bits data to a partial product generation module;
the partial product generation module is used for carrying out 5*5 signed operation on the data expanded to 5bits to obtain a partial product calculation result, calculating a compensation shift value on the data truncated to 5bits and summing the calculated compensation shift value to obtain a total compensation shift value N, carrying out 5*5 signed operation on the result and shifting the result left by N bits to obtain a partial product calculation result, and outputting the partial product calculation result to the reconfigurable addition tree module;
the reconfigurable addition tree module can dynamically adjust the calculation precision, is used for carrying out step-by-step addition and accumulation on the partial product calculation result output by the partial product generation module, and is used for carrying out 1bit width expansion when the step-by-step addition and accumulation is carried out, so as to obtain the calculation result of the sub-tensor multiplication and addition unit.
As the optimal scheme of the multi-bit wide reconstruction approximate tensor multiplication and addition system facing the neural network calculation, the number of the sub-tensor multiplication units is 4, the reconfigurable addition tree module comprises 4-stage adders, and the addition of 16 accurate 5 multiplied results is carried out, and each-stage addition is expanded by 1bit wide to prevent the result from overflowing; and the reconfigurable addition tree can dynamically configure low-order addition as an OR gate or an accurate full adder according to the working mode of the sub-tensor multiplication unit so as to reduce the calculation power consumption. The partial product generation module includes 16 5 x 5 signed multipliers and a shift compensation circuit. The input signal of the module comes from the output of the symbol expansion module; the shift compensation circuit only works in an approximate calculation mode and when 8bits of data are input, and the rest time does not exchange data with other circuits so as to reduce the turnover power consumption. The working principle of the circuit is as follows: adding all compensation shift values input to a certain 5×5 signed multiplier to obtain a total compensation shift value N, and shifting the result of the 5×5 signed multiplier to the left by N bits; this process is done in parallel for 16 5 x 5 signed multipliers.
Referring to fig. 1, with the method of the present invention, a 256-bit multiplier and a 256-bit multiplicand are equally divided into 4 groups of 64-bit multipliers and multiplicands, respectively, and input into 4 of the sub-tensor multiplication units; inside each sub-tensor multiplication unit, a 64-bits multiplier and a multiplicand are split again according to multiplication bit width configuration, and 16 accurate input values of 5 multiplied by 5 are generated under the action of a symbol expansion module. The compensation shift value is calculated in the approximate calculation mode where the multiplication bit width contains 8bits of data. 16 5×5 multipliers in the partial product generation module calculate 16 10bits results. In the approximate calculation mode, when the multiplication bit width contains 8bits of data, the result is shifted left by the shift compensation circuit by the corresponding bit number to obtain 16 final partial products. The 16 final partial products are accumulated in the reconfigurable addition tree module through 4-level addition to obtain partial sums. In the approximate calculation mode, the low-order addition of the reconfigurable addition tree is realized by an OR gate; in the exact calculation mode, the low order addition of the reconfigurable adder tree is implemented by a full adder. The 4 parts of the 4 sub-tensor multiplication units are generated in parallel and are sent to an accumulation circuit. The accumulation circuit accumulates the four partial sums to obtain the final calculation result of the multi-bit wide reconfigurable approximate tensor multiplication and addition unit.
The invention is applied to a general neural network accelerator chip, wherein 4 rows and 16 columns of tensor multiply-add units are adopted in total, data and weights are sent to each multiply-add unit in a broadcast mode, and parallel calculation of each unit is realized in a mode of fixing output data. For the reasoning of AlexNet networks, the 7 and 8 layers of network are put in a 4×4 mode, the 5 and 6 layers of network are put in an 8×4 mode, and the remaining 1 to 4 layers are put in an approximately 8×8 mode. The final network computing energy efficiency is improved by 39.7%.
While the invention has been described in terms of preferred embodiments, it is not intended to be limiting. Those skilled in the art will appreciate that various modifications and adaptations can be made without departing from the spirit and scope of the present invention. Accordingly, the scope of the invention is defined by the appended claims.

Claims (10)

1. The multi-bit width reconstruction approximate tensor multiplication and addition method for the neural network calculation is used for supporting multiplication operations with bit widths of 8 multiplied by 8, 8 multiplied by 4 and 4 multiplied by 4, and is characterized in that for input data, the following steps S1 to S3 are executed to obtain the calculation result of the sub-tensor multiply-add unit, and then the calculation result is output to the accumulation circuit for accumulation to obtain the final calculation result of the tensor multiply-add unit,
s1, carrying out accurate calculation or approximate calculation on input 8bits data and 4bits data to obtain 5bits data, wherein the accurate calculation on the 8bits data is as follows: splitting 8bits of data into two 4bits of data, and expanding the two 4bits of data into 5bits of data respectively; the approximate calculation for 8bits data is: cutting 8bits data into 5bits data, and calculating a compensation shift value; the exact and approximate calculations for 4bits data are: expanding the most significant supplementary sign bit of the 4bits data to obtain 5bits data, and then executing the step S2;
s2, carrying out 5*5 signed operation on data expanded to 5bits to obtain a partial product calculation result, calculating a compensation shift value on data truncated to 5bits and summing the calculated compensation shift value to obtain a total compensation shift value N, carrying out 5*5 signed operation and shifting the result left by N bits to obtain a partial product calculation result, and then executing step S3;
and S3, carrying out step-by-step addition accumulation on the partial product calculation result in the step S2, and expanding 1bit width during the step-by-step addition accumulation to obtain the calculation result of the sub-tensor multiplication and addition unit.
2. The multi-bit wide reconstruction approximate tensor multiplication method for neural network computation according to claim 1, wherein in step S1, the exact computation for 8bits data is: the bits data are split into high 4bits data and low 4bits data, the highest bit of the high 4bits data complements the sign bit, the highest bit of the low 4bits data complements 0, and 5bits data are obtained.
3. The multi-bit wide reconstruction approximate tensor multiplication method for neural network computation according to claim 1, wherein in step S1, the approximate computation for 8bits data comprises: traversing from 7 th bit to 0 th bit, finding out the number different from the first bit and the previous bit, determining the number different from the first bit and the previous bit as n-th bit, and intercepting data as A [ n+1:n-3], wherein A is 8bits data to be approximately calculated, and [ (wherein A ] is a truncated value; and when n=7, intercept data is a [7:3]; when n is less than or equal to 4, intercepting data to be A4:0, and calculating a compensation shift value by the following steps: when n is more than or equal to 3, the compensation shift value is n-3; when n <3, the compensation shift value is 0.
4. The method of multi-bit wide reconstruction approximate tensor multiplication and addition for neural network computation according to claim 3, wherein the step S1 further comprises: when the approximate calculation is carried out on 8bits of data, if the 8bits of data are the activation value and the weight in the neural network, the pre-coding is carried out firstly: taking the [6:2] bit of the activation value as a 5bits truncated value, then calculating a compensation shift value of the activation value, and taking the decimal number represented by the [1:0] bit of the weight as the weight compensation shift value.
5. The method of multiply-add for multi-bit wide reconstruction approximation tensor calculation for a neural network of claim 4, wherein calculating the total compensation shift value N in step S2 further comprises: and adding the compensation shift value of the activation value and the compensation shift value of the weight to obtain a total compensation shift value.
6. The method of claim 5, wherein in step S2, the 5bits data is subjected to 5*5 signed operation by 16 5×5 multipliers.
7. A multi-bit wide reconstruction approximate tensor multiply-add system for neural network computation, comprising: the tensor multiplication unit comprises at least one tensor multiplication unit and an accumulation circuit correspondingly connected with the tensor multiplication unit, wherein the tensor multiplication unit is used for supporting multiplication operations with bit widths of 8 multiplied by 8, 8 multiplied by 4 and 4 multiplied by 4, the accumulation circuit is used for accumulating and summing the results of the tensor multiplication units, and the summation result is used as the final calculation result of the tensor multiplication addition unit;
the sub-tensor multiplication unit includes: the system comprises a symbol expansion module, a partial product generation module and a reconfigurable addition tree module;
the symbol expansion module is used for carrying out accurate calculation or approximate calculation aiming at input 8bits data and 4bits data to obtain 5bits data, wherein the accurate calculation aiming at the 8bits data is as follows: splitting 8bits of data into two 4bits of data, and expanding the two 4bits of data into 5bits of data respectively; the approximate calculation for 8bits data is: cutting 8bits data into 5bits data, and calculating a compensation shift value; the exact and approximate calculations for 4bits data are: expanding the most significant complement sign bit of the 4bits data to obtain 5bits data, and then outputting the 5bits data to a partial product generation module;
the partial product generation module is used for carrying out 5*5 signed operation on the data expanded to 5bits to obtain a partial product calculation result, calculating a compensation shift value on the data truncated to 5bits and summing the calculated compensation shift value to obtain a total compensation shift value N, carrying out 5*5 signed operation on the result and shifting the result left by N bits to obtain a partial product calculation result, and outputting the partial product calculation result to the reconfigurable addition tree module;
and the reconfigurable addition tree module is used for carrying out step-by-step addition and accumulation on the partial product calculation result output by the partial product generation module, and expanding 1bit width during the step-by-step addition and accumulation to obtain the calculation result of the sub-tensor multiplication and addition unit.
8. The neural network calculation oriented multi-bit wide reconstruction approximate tensor multiply add system of claim 7, wherein the number of sub-tensor multiply units is 4.
9. The neural network computation oriented multi-bit wide reconstruction approximate tensor multiply-add system of claim 8, wherein the reconfigurable adder-tree module comprises a 4-stage adder to accumulate the results output by the sub-tensor multiply unit.
10. The neural network computation-oriented multi-bit wide reconstruction approximate tensor multiply-add system of claim 7, wherein the partial product generation module includes 16 5 x 5 signed multipliers and a shift compensation circuit.
CN202311453997.1A 2023-11-03 2023-11-03 Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation Active CN117170623B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311453997.1A CN117170623B (en) 2023-11-03 2023-11-03 Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311453997.1A CN117170623B (en) 2023-11-03 2023-11-03 Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation

Publications (2)

Publication Number Publication Date
CN117170623A true CN117170623A (en) 2023-12-05
CN117170623B CN117170623B (en) 2024-01-30

Family

ID=88939937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311453997.1A Active CN117170623B (en) 2023-11-03 2023-11-03 Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation

Country Status (1)

Country Link
CN (1) CN117170623B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117910421A (en) * 2024-03-15 2024-04-19 南京美辰微电子有限公司 Dynamic approximate circuit calculation deployment method and system based on neural network

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112416297A (en) * 2019-08-23 2021-02-26 辉达公司 Neural network accelerator based on logarithm algorithm
CN112732224A (en) * 2021-01-12 2021-04-30 东南大学 Reconfigurable approximate tensor multiplication and addition unit and method for convolutional neural network
CN114647399A (en) * 2022-05-19 2022-06-21 南京航空航天大学 Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device
CN114730374A (en) * 2020-03-25 2022-07-08 西部数据技术公司 Flexible accelerator for sparse tensors in convolutional neural networks
CN115033204A (en) * 2022-05-23 2022-09-09 东南大学 High-energy-efficiency approximate multiplier with reconfigurable precision and bit width
CN115145536A (en) * 2022-06-29 2022-10-04 浙江大学 Adder tree unit with low bit width input and low bit width output and approximate multiply-add method
US11531869B1 (en) * 2019-03-28 2022-12-20 Xilinx, Inc. Neural-network pooling
CN115840556A (en) * 2022-10-18 2023-03-24 东南大学 2 groups of signed tensor calculation circuit structure based on 6-bit approximate full adder
CN115982528A (en) * 2022-11-25 2023-04-18 上海交通大学 Booth algorithm-based approximate precoding convolution operation method and system
CN116543808A (en) * 2022-01-25 2023-08-04 上海交通大学 All-digital domain in-memory approximate calculation circuit based on SRAM unit

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11531869B1 (en) * 2019-03-28 2022-12-20 Xilinx, Inc. Neural-network pooling
CN112416297A (en) * 2019-08-23 2021-02-26 辉达公司 Neural network accelerator based on logarithm algorithm
CN114730374A (en) * 2020-03-25 2022-07-08 西部数据技术公司 Flexible accelerator for sparse tensors in convolutional neural networks
CN112732224A (en) * 2021-01-12 2021-04-30 东南大学 Reconfigurable approximate tensor multiplication and addition unit and method for convolutional neural network
CN116543808A (en) * 2022-01-25 2023-08-04 上海交通大学 All-digital domain in-memory approximate calculation circuit based on SRAM unit
CN114647399A (en) * 2022-05-19 2022-06-21 南京航空航天大学 Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device
CN115033204A (en) * 2022-05-23 2022-09-09 东南大学 High-energy-efficiency approximate multiplier with reconfigurable precision and bit width
CN115145536A (en) * 2022-06-29 2022-10-04 浙江大学 Adder tree unit with low bit width input and low bit width output and approximate multiply-add method
CN115840556A (en) * 2022-10-18 2023-03-24 东南大学 2 groups of signed tensor calculation circuit structure based on 6-bit approximate full adder
CN115982528A (en) * 2022-11-25 2023-04-18 上海交通大学 Booth algorithm-based approximate precoding convolution operation method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梁泰琳: "面向神经网络应用的精简指令集张量处理器设计", 《中国博士学位论文全文数据库 (信息科技辑)》, pages 137 - 1 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117910421A (en) * 2024-03-15 2024-04-19 南京美辰微电子有限公司 Dynamic approximate circuit calculation deployment method and system based on neural network

Also Published As

Publication number Publication date
CN117170623B (en) 2024-01-30

Similar Documents

Publication Publication Date Title
CN106909970B (en) Approximate calculation-based binary weight convolution neural network hardware accelerator calculation device
CN110780845A (en) Configurable approximate multiplier for quantization convolutional neural network and implementation method thereof
US20210349692A1 (en) Multiplier and multiplication method
CN117170623B (en) Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation
CN111832719A (en) Fixed point quantization convolution neural network accelerator calculation circuit
US6601077B1 (en) DSP unit for multi-level global accumulation
KR101603471B1 (en) System and method for signal processing in digital signal processors
Liu et al. A precision-scalable energy-efficient convolutional neural network accelerator
US20220283777A1 (en) Signed multiword multiplier
CN115982528A (en) Booth algorithm-based approximate precoding convolution operation method and system
CN114115803A (en) Approximate floating-point multiplier based on partial product probability analysis
CN111966327A (en) Mixed precision space-time multiplexing multiplier based on NAS (network attached storage) search and control method thereof
CN116205244A (en) Digital signal processing structure
US20220075598A1 (en) Systems and Methods for Numerical Precision in Digital Multiplier Circuitry
Baba et al. Design and implementation of advanced modified booth encoding multiplier
CN113672196B (en) Double multiplication calculating device and method based on single digital signal processing unit
CN115357214A (en) Operation unit compatible with asymmetric multi-precision mixed multiply-accumulate operation
CN113283591B (en) Efficient convolution implementation method and device based on Winograd algorithm and approximate multiplier
KR20230121151A (en) Numerical precision of digital multiplier networks
CN114691086A (en) High-performance approximate multiplier based on operand clipping and calculation method thereof
Saggese et al. CFPM: Run-time Configurable Floating-Point Multiplier
US20200125329A1 (en) Rank-based dot product circuitry
Spagnolo et al. Efficient implementation of signed multipliers on FPGAs
CN211577939U (en) Special calculation array for neural network
CN111401533B (en) Special computing array for neural network and computing method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant