CN114647399B - Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device - Google Patents
Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device Download PDFInfo
- Publication number
- CN114647399B CN114647399B CN202210541757.6A CN202210541757A CN114647399B CN 114647399 B CN114647399 B CN 114647399B CN 202210541757 A CN202210541757 A CN 202210541757A CN 114647399 B CN114647399 B CN 114647399B
- Authority
- CN
- China
- Prior art keywords
- partial product
- bit
- approximate
- adder
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device, which comprises an input truncation compensation circuit, a radix-8 Booth encoder and decoder circuit, a primary partial product compression circuit, a secondary partial product compression circuit and a carry look-ahead adder circuit. Wherein the weight in the first stage partial product compression circuit isCutting off the lower position of the Wallace tree, using an approximate 4_2 compressor for the second lower position 2, and using an accurate compressor for the upper position; the second-order partial product compression circuit uses a precision compressor and includes a probability constant compensation section for compensating for truncation of the first-order partial product, compensating for errors generated by using an approximate 4_2 compressor, and compensating for truncation of the second-order partial product, respectively. The invention reduces power consumption and hardware cost by using truncation and approximation methods, and maintains higher precision by adopting a probability constant compensation strategy for errors.
Description
Technical Field
The invention relates to the technical field of approximate arithmetic operation circuit design, in particular to a low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device.
Background
Since 2007, a series of semiconductor laws such as moore's law, denuded scaling law, etc. have gradually failed, and it has become very difficult to continuously improve the performance of the chip while maintaining the same power consumption of the chip. And now, the importance of big data processing and artificial intelligence is continuously improved, and the applications need massive data and complex computation, so that higher requirements are provided for general computation engines and application-specific integrated circuits with high energy efficiency and high performance. In the existing applications such as pattern recognition, video processing and data mining, the fault tolerance capability exists, under the premise that the applications exist, the calculation precision is introduced into a design space by approximate calculation as a new dimension, the hardware overhead and the power consumption are reduced on the premise that the application requirements are met, and the method is adopted as a new energy-efficient design method to alleviate the problems.
The multiply-accumulate unit is widely used in applications such as convolutional neural networks as an important calculation unit of a digital signal processor. Serial multiply-accumulate units are favored for their small hardware overhead, but are not used satisfactorily in applications with high latency requirements. Only the parallel multiply-accumulate unit exists, but research on the parallel multiply-accumulate unit is less, and the hardware overhead is too large because the parallel multiply-accumulate unit can be realized by copying a single multiplier and an adder. A paper "A High-Performance and Energy-Efficient FIR Adaptive Filter Using applied Adaptive array CIRCUITS" published in IEEE TRANSACTIONS CICUITS AND SYSTEMS discloses a method for designing an Adaptive Filter based ON a Distributed algorithm, wherein an error calculation module is consistent with the design idea of a parallel multiply-accumulate unit, but the approximation means is rough, and the effective balance between precision and hardware overhead cannot be realized.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device, aiming at improving the approximation and truncation strategies on the original design, reducing the power consumption, reducing the hardware expense and maintaining higher precision.
In order to achieve the purpose, the invention adopts the following technical scheme:
a low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device comprises an input truncation compensation circuit, a radix-8 Booth encoder and decoder circuit, a primary partial product compression circuit, a secondary partial product compression circuit and a carry look-ahead adder circuit;
the input truncation compensation circuit has two groups of leading-in lengths ofEach group of the elements isThe following processes are respectively carried out on the data: cutoff is lowThe values of the bit and k are determined according to the specific requirement on the precision, and the value range isIn the first placeBit up with 1, finallyThe bit result is output to the radix-8 booth encoder and decoder circuitry;
the radix-8 booth encoder and decoder circuitry includes N sets of radix-8 booth encoders, approximate decoding adders, and a legacy decoder; the output of one group of input truncation compensation circuits is divided into three bits by a group for coding by the radix-8 Booth coder, and the coding result is output to a traditional decoder; the approximate decoding adder calculates the output of the other input truncation compensation circuit; the traditional decoder processes the results of the radix-8 Booth encoder and the approximate decoding adder to generate a partial product, and outputs the partial product to a first-stage partial product compression circuit;
the first stage partial product compression circuit comprisesA first-level Wallace tree, each first-level Wallace tree being of the sizeEach first-level Wallace tree is a regular rectangle; each first-level Wallace tree is divided into three sections for approximate processing, and one weight isThe first-level Wallace tree of (1) with low truncationCompressing the second lower 2 bits by using an approximate 4_2 compressor, compressing the rest high bits into two lines by using an accurate adder, and outputting the compression results of the accurate adders of all the first-level Wallace trees to a second-level partial product compression circuit;
the second-level partial product compression circuit comprises a second-level Wallace tree and a probability constant compensation module, wherein the probability constant compensation module is used for compensating the first-level partial product truncation, errors generated by using an approximate 4_2 compressor and the second-level partial product truncation to obtain truncated and approximate constant compensation partial data; the second-level Wallace tree compresses the received input data and the constant compensation part data into two lines by using an accurate adder, and takesThe bit is output to a carry look ahead adder circuit;
the carry look ahead adder circuit adds the output results of the two-stage partial product compression circuit and retainsTo produce the output result of the final multiply-accumulate device.
In order to optimize the technical scheme, the specific measures adopted further comprise:
further, the radix-8 Booth encoder includes five output signals, using,,,Respectively representing the most significant bit, the second least significant bit and the least significant bit of the input signal, respectivelyComprises the following steps:;
the traditional decoder only generates partial products which are accurately compressed and approximately compressed in a one-stage partial product compression circuit, and the expression is as follows:whereinTo approximately decode the adder inputThe number of bits is,to approximately decode the adder outputA bit.
Further, the approximate decoding adder is low on input dataThe bits are accumulated approximately by one set of two bits, p is rootNon-negative integers determined according to the precision requirement are represented by the formula:,,whereinRepresents the second of input yThe number of bits is,in order to input the carry bit, the carry bit is input,in order to output the carry bit,to the final sumA bit; to input dataAndthe low-order additional error recovery circuit has the formula:,,whereinTo the final sumThe error of the bits is recovered as a signal,andrespectively after recoveryThe final sum of the bits and the output carry; and accumulating the high order of the input data by using a ripple carry adder.
Further, the approximate 4_2 compressor formula used in the one-stage partial product compression circuit is:,whereinFour inputs for the ith column of the Wallace tree; the precise adder comprises a precise full adder and a precise half adder; the sign compensation bit of each row partial product is not included in the Wallace tree, and the error is reduced by a constant compensation method.
Further, the two-stage partial product compression circuit performs unified processing on the sign bit under the condition that the input sign is determined: on the premise that only the numerical value is reserved, the high-order addition 111 of any output of the compression tree at the lowest order of the first-order partial product compressor and the high-order addition 110 of any output of the compression trees at the second lowest order and the second highest order are carried out.
Further, the process of compensating the first-stage partial product truncation includes:
assuming that the inputs are uniformly distributed:
whereinIs the mth bit of the input signal x. The probability of each bit after truncation compensation is:
performing radix-8 Booth encoding on one group of operands, and filling 0 at the lowest bit according to an encoding rule, wherein the encoding result probability is as follows:
whereinIs the result of Booth encodingA bit. When the Booth code value isThe output of the approximate decoding adder is needed in the decoding processTherefore, it is necessary to calculateThe probability of each bit is obtained according to the characteristics of the approximate decoding adder:
whereinIs composed ofTo (1) aA bit. The expectation of calculating the partial product is as follows:
where the index n denotes the result of the operation of the nth element of the two sets of input vectors, the index i or j denotes the ith or jth binary bit of the number,is weighted asOf the compression treeGo to the firstColumn partial product; the expectation of the sign-corrected bit is constantly 0.5,。
further, the compensating for the use of the approximate 4_2 compressor includes:
using delta to represent the error between the actual output and the accurate output, the error expected by the calculation is;Presentation modeThe probability of the occurrence of the event is,,the error is represented by the number of bits in the error,(ii) a The specific compensation value is the desired sum of the individual errors.
Further, constant compensation for the two-level partial product truncation is expected from the output value of the approximate 4_2 compressor:,。
further, the input of the carry look ahead adder isThe bits are divided into four groups, a traveling wave carry adder is arranged in each group, and a carry look-ahead adder is arranged between the groups.
Further, the first-stage partial product compression circuit and the second-stage partial product compression circuit adopt a sign extension elimination method. The sign expansion elimination method utilizes the characteristic that 2-system operation is not 0, namely 1, and carries out unified processing on sign bits input by a partial integral compression circuit, and negative values are converted into the highest one, so that the subsequent compression processing of all positive values is facilitated.
The invention has the beneficial effects that:
the low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device adopts a distributed algorithm to improve the parallelism of the multiplication accumulation unit and effectively improve the circuit performance, adopts truncation and approximation means to reduce the circuit complexity and the power consumption, and adopts a constant compensation strategy to exchange the precision with the minimum hardware overhead. Truncating the partial product not only saves the compressor and shortens the critical path length of the carry look ahead adder, but also saves the conventional booth decoder that generates the truncated partial product and greatly saves hardware overhead.
Drawings
FIG. 1 is a schematic diagram of the low-power consumption high-precision approximate parallel fixed-width multiply-accumulate device of the present invention.
FIG. 2a is a schematic view of,The structure of the first-level Wallace compressed tree-0 is shown as an example.
FIG. 2b is a schematic representation of,For example, the structure of the first-level Wallace compressed tree-1 is schematically illustrated.
FIG. 2c is a schematic representation of,The structure of the first-level Wallace compressed tree-2 is shown as an example.
FIG. 2d is a schematic representation of,The structure of the first-level Wallace compressed tree-3 is shown as an example.
Detailed Description
The present invention will now be described in further detail with reference to the accompanying drawings.
It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.
FIG. 1 is a schematic diagram of the low-power consumption high-precision approximate parallel fixed-width multiply-accumulate device of the present invention. Referring to fig. 1, the multiply-accumulate apparatus includes an input truncation compensation circuit, a radix-8 booth encoder and decoder circuit, a first-stage partial product compression circuit, a second-stage partial product compression circuit, and a carry-look-ahead adder circuit.
The input truncation compensation circuit is used for multiplying and accumulating two groups of length of the parallel multiplication and accumulation units intoEach group of elements isInput ofIs located and is incorporated inFirst, the1 is added on the position, finallyThe bit results are output to the radix-8 booth encoder and decoder circuits.
The radix-8 Booth encoder and decoder circuit encodes one set of inputs by dividing into three bits, and the other set of inputs approximates decoding adder, encoding result and generatedAnd outputting the partial product to a traditional decoder to generate a partial product, and then outputting the partial product to a one-stage partial product compression circuit.
The first stage partial product compression circuit comprisesEach of the Wallace trees has a size ofEach first-level Wallace tree is a regular rectangle, the first-level Wallace trees are divided into three sections for approximate processing, and the weight of one section isThe Wallace tree of (A) has low truncationAnd the second lower 2 bits are compressed by an approximate 4_2 compressor, the rest upper bits are compressed into two rows by an accurate adder, and only the compression results of the accurate adders of all Wallace trees are output to a two-stage partial product compression circuit.
The two-stage partial product compression circuit comprises 1 Wallace tree, and uses precise adder to compress the input and truncated and approximate constant compensation part into two lines, and takesThe bits are output to a carry look ahead adder circuit.
The carry look ahead adder circuit adds the results of the two-stage partial product compression circuit and retainsTo produce the result of the final multiply-accumulate unit.
(I) -radix-8 Booth encoder and decoder circuits
The radix-8 booth encoder and decoder circuitry includes a radix-8 booth encoder, an approximate decoding adder, and a legacy decoder.
decoder by generatingThe approximate decoding adder and the traditional decoder only generate partial products which are accurately compressed and approximately compressed in the one-stage partial product compression circuit, and the expression is as follows:。
approximate decoding adder pair lowThe bits are accumulated approximately by a group of two bits, and the formula is as follows:,,(ii) a At the same time toAndthe low-order additional error recovery circuit has the formula:,,the high order is accumulated using a ripple carry adder.
One-stage partial product compression circuit
The approximate 4_2 compressor formula used in the first stage partial product compression circuit is:,(ii) a The precise adder comprises a precise full adder and a precise half adder; the sign compensation bit of each row partial product is not included in the Wallace tree, and the error is reduced by a constant compensation method.
(III) two-stage partial product compression circuit
Constant probability compensation in a two-level partial product compression circuit includes three parts, namely compensation for truncation of the first-level partial product, compensation for errors generated by using an approximate 4_2 compressor, and compensation for truncation of the second-level partial product.
(1) A first part: and (5) compensating for the truncation of the first-order partial product.
Assuming that the inputs are uniformly distributed, i.e. (here the subscripts are omitted, since the probabilities of all inputs are the same):
the probability of each bit after the truncation compensation is (for convenience all inputs are divided by):
One group of operands is subjected to radix-8 Booth encoding, and 0 is required to be complemented at the lowest bit according to an encoding rule. Since the lower two bits of the lowest bit encoding input are always (1,0), the probability of the encoding result is different from that of other bits and needs to be considered separately. The encoding result probabilities are as follows (positive and negative probabilities are the same):
to obtain the probability of partial product, calculation is also neededThe probability of each bit, based on the properties of the approximate decoding adder, can be:
the expectation of the partial product can be calculated from the above equation as follows:
whereinIs weighted asOf the compression treeLine, firstColumn partial product; wherein the expectation of the sign correction bit is always 0.5, i.e.。
(2) A second part: compensation for using an approximate 4_2 compressor.
Use ofDifferent modes of representing input, have(ii) a Use ofIndicating an error of(ii) a Use ofTo representProbability of occurrence, then error is expected to be. The specific compensation value is the desired sum of the individual errors.
(3) And a third part: compensation for second order partial product truncation.
The error of the part is from the two-stage partial product compression circuit to cut off the output result of the approximate 4_2 compressor, and the part is expected to be constant compensated according to the output value:,。
(IV) carry look ahead adder
The carry look ahead adder inputs areThe bits are divided into four groups, a traveling wave carry adder is arranged in each group, and a carry look-ahead adder is arranged between the groups.
The following are,For example, the multiply-accumulate apparatus according to the embodiment of the present invention will be further described with reference to the accompanying drawings.
The,The low-power consumption approximate parallel fixed-width multiplication accumulation unit structurally comprises an input truncation compensation circuit, a radix-8 Booth encoder and decoder circuit, a primary partial product compression circuit, a secondary partial product compression circuit and a carry look-ahead adder circuit.
The input truncation compensation circuit truncates the lower 5 bits of the two groups of input with the length of 16 of the parallel multiplication accumulation unit, supplements 1 on the 4 th bit, and finally outputs the 12-bit result to the radix-8 Booth encoder and decoder circuit. Only one group of inputs needs to be subjected to base-8 coding, the inputs are divided into three bits and one group, the other group of inputs are simultaneously sent to an approximate decoding adder to calculate 3x, then the three groups of inputs are sent to a traditional decoder, only a part compressed by a one-stage partial product compression circuit is generated, and the result is sent to a one-stage Wallace compression tree with different corresponding weights according to coding weights, as shown in figure 1. In FIG. 1, the first-level Wallace tree-0 represents the lowest Wallace tree in the first-level partial product compression circuit, the first-level Wallace tree-1 represents the second-lowest Wallace tree in the first-level partial product compression circuit, and so on. The first stage partial product compression circuit has 4 Wallace trees, each Wallace tree is a rectangle of 14 × 8, as shown in FIGS. 2 a-2 d. And shifting the result of the first-stage partial product compression circuit according to the weight, inputting the result into the second-stage partial product compression circuit, compressing the result into two lines by using a precise adder, finally taking the lower 16 bits to send into a carry-look-ahead adder, and taking the lower 16 bits as the final fixed-width output.
The radix-8 booth encoder has five output signals, the expression is:
decoder by generatingThe approximate decoding adder and the traditional decoder only generate partial products which are accurately compressed and approximately compressed in the one-stage partial product compression circuit, and the expression is as follows:。
the approximate decoding adder adopts a two-bit group approximate accumulation for the lower 4 bits, and the formula is as follows:,,(ii) a Meanwhile, an error recovery circuit is added to the lowest 3 bits and the lowest 4 bits, and the formula is as follows:,,(ii) a The high order is accumulated using a ripple carry adder.
As shown in fig. 2 a-2 d, the sign expansion process is performed on all four wallace trees of the first-stage partial-product compression circuit, and finally two output signs are controlled to be positive, negative. The lowest order compression tree truncates the lower 8 bits, the next lowest order compression tree truncates the lower 5 bits, the next highest order compression tree truncates the lower 2 bits, the highest order compression tree does not truncate, except the highest order compression tree, the next highest 2 bits of all compression trees use an approximate 4_2 compressor, the lowest 1 bit of the highest order compression tree uses an approximate 4_2 compressor, the rest high bits are compressed into two rows by an accurate adder, and only the compression results of the accurate adders of all compression trees are input to a second-level partial product compression circuit.
The approximate 4_2 compressor formula used in the first stage partial product compression circuit is:,(ii) a The precise adder includes a precise full adder and a precise half adder. The sign compensation bit of each row partial product is not added into the compression tree, and the error is reduced by a constant compensation method.
As a further optimization scheme of this embodiment, the two-stage partial product compression circuit performs unified processing on the sign bits under the condition that the input sign is determined: on the premise that only the numerical value bit is reserved, the high order of any output of the compression tree at the lowest order of the first-order partial product compressor is added with '111', and the high order of any output of the compression tree at the second lowest order and the second highest order is added with '110', as shown in fig. 3.
As a further optimization scheme of this embodiment, the two-stage partial product compressor includes a constant compensation part for truncation and approximation, and the constant compensation is derived from theoretical probability: the 2 nd and 4 th bits of the two-level partial product compression tree are complemented by 1, as shown in FIG. 3. The final 16-bit result is fed into the carry look ahead adder.
The input of the carry look ahead adder is 16 bits, the carry look ahead adder is divided into four bit groups, the traveling wave carry adder is arranged in each group, the carry look ahead adder is arranged between the groups, and the lower 16 bits are taken as the fixed width result of the final approximate multiplication accumulation unit.
Finally, compared with the original design, the improved design reduces the power delay product by 25 percent, reduces the power delay product of the full-precision copy by 80 percent and reduces the average error distance of the copy without compensation by 58 percent.
While preferred embodiments of the embodiments of this specification have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the true scope of the embodiments of the specification.
It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments described herein without departing from the spirit and scope of the embodiments described herein. Thus, if such modifications and variations of the embodiments of the present specification fall within the scope of the claims of the embodiments of the present specification and their equivalents, the embodiments of the present specification are intended to include such modifications and variations.
Claims (9)
1. A low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device is characterized by comprising an input truncation compensation circuit, a radix-8 Booth encoder and decoder circuit, a primary partial product compression circuit, a secondary partial product compression circuit and a carry look-ahead adder circuit;
the input truncation compensation circuit has two groups of leading-in lengths ofEach group of the elements isThe following processes are respectively carried out on the data: cutoff is lowThe values of the bit and k are determined according to the specific requirement on the precision, and the value range isIn the first place1 is added on the position, finallyThe bit result is output to the radix-8 booth encoder and decoder circuitry;
the radix-8 booth encoder and decoder circuitry includes N sets of radix-8 booth encoders, approximate decoding adders, and a legacy decoder; the output of one group of input truncation compensation circuits is divided into three bits by a group for coding by the radix-8 Booth coder, and the coding result is output to a traditional decoder; the approximate decoding adder calculates the output of the other input truncation compensation circuit; the traditional decoder processes the results of the radix-8 Booth encoder and the approximate decoding adder to generate a partial product and outputs the partial product to a first-stage partial product compression circuit;
the primary partial product compression circuit comprisesA first-level Wallace tree, each first-level Wallace tree being of the sizeEach first-level Wallace tree is a regular rectangle; each first-level Wallace tree is divided into three sections for approximate processing, and the weight of one section isThe first-level Wallace tree of (1) with low truncationCompressing the second lower 2 bits by using an approximate 4_2 compressor, compressing the rest high bits into two lines by using an accurate adder, and outputting the compression results of the accurate adders of all the first-level Wallace trees to a second-level partial product compression circuit;
the two-stage partial product compression circuit comprises a two-stage Wallace tree and a probability constant compensation module, and the probabilityThe constant compensation module is used for compensating the first-stage partial product truncation, the error generated by using the approximate 4_2 compressor and the second-stage partial product truncation to obtain truncated and approximate constant compensation partial data; the second-level Wallace tree compresses the received input data and the constant compensation part data into two lines by using an accurate adder, and takesThe bit is output to a carry look ahead adder circuit;
2. The low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 1, wherein said radix-8 booth encoder includes five output signals, using,,,Respectively representing the most significant bit, the second least significant bit and the least significant bit of the input signal, and the expressions are respectively:
the traditional decoder only generates partial products which are accurately compressed and approximately compressed in a one-stage partial product compression circuit, and the expression is as follows:in whichTo approximately decode the adder inputThe number of bits is,for approximately decoding the output of the adderA bit.
3. The low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 1, wherein the approximate decoding adder is low for input dataThe bits are approximately accumulated in a group of two bits, p is a non-negative integer determined according to the precision requirement, and the formula is as follows:,,whereinRepresents the first of input yThe number of bits is,in order to input the carry bit, the carry bit is input,in order to output the carry bit,to the final sumA bit; to input dataAndthe low-order additional error recovery circuit has the formula:,,whereinTo the final sumThe error of the bits is recovered as a signal,andrespectively after recoveryThe final sum of the bits and the output carry; and accumulating the high order of the input data by using a ripple carry adder.
4. The low power consumption high precision approximate parallel fixed width multiply accumulate device of claim 3, wherein the approximate 4_2 compressor output signal used in the one stage partial product compression circuit has the formula:,whereinIs the first tree of WallaceFour inputs to a column; the precise adder comprises a precise full adder and a precise half adder; the sign compensation bit of each row partial product is not included in the Wallace tree, and the error is reduced by a constant compensation method.
5. The low-power consumption high-precision approximate parallel fixed-width multiply-accumulate device of claim 1, wherein the two-stage partial product compression circuit processes the sign bit uniformly in case of determining the input sign: on the premise that only the numerical value is reserved, the high-order addition 111 of any output of the compression tree at the lowest order of the first-order partial product compressor and the high-order addition 110 of any output of the compression trees at the second lowest order and the second highest order are carried out.
6. The low-energy-consumption high-precision approximately parallel fixed-width multiply-accumulate device of claim 1, wherein the procedure of compensating the first-stage partial product truncation comprises:
assuming that the inputs are uniformly distributed:
whereinIs the mth bit of the input signal x; the probability of each bit after the truncation compensation is:
performing radix-8 Booth encoding on one group of operands, and filling 0 at the lowest bit according to an encoding rule, wherein the encoding result probability is as follows:
whereinIs the result of Booth encodingA bit; when the Booth code value isUsing the output of the approximate decoding adderDecoding and calculatingThe probability of each bit is obtained according to the characteristics of the approximate decoding adder:
whereinIs composed ofTo (1) aA bit; the expectation of calculating the partial product is as follows:
where the index n denotes the result of the operation of the nth element of the two sets of input vectors and the index i or j denotes the ith or jth of the numberThe bit is carried in a binary system, and the bit,is weighted asOf the compression treeGo to the firstColumn partial product; the expectation of the sign-corrected bit is always 0.5,。
7. the low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 4, wherein the compensation for using the approximate 4_2 compressor comprises:
use ofRepresenting the error between the actual output and the accurate output, the error expected by the calculation being;Presentation modeThe probability of the occurrence of the event is,,the error is represented by the number of bits in the error,(ii) a The specific compensation value is the desired sum of the individual errors.
9. the low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 1, wherein the input of the carry look ahead adder isThe bits are divided into four groups, a traveling wave carry adder is arranged in each group, and a carry look-ahead adder is arranged between the groups.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210541757.6A CN114647399B (en) | 2022-05-19 | 2022-05-19 | Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210541757.6A CN114647399B (en) | 2022-05-19 | 2022-05-19 | Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114647399A CN114647399A (en) | 2022-06-21 |
CN114647399B true CN114647399B (en) | 2022-08-16 |
Family
ID=81997301
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210541757.6A Active CN114647399B (en) | 2022-05-19 | 2022-05-19 | Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114647399B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115407965B (en) * | 2022-11-01 | 2023-03-24 | 南京航空航天大学 | High-performance approximate divider based on Taylor expansion and error compensation method |
CN116048455B (en) * | 2023-03-07 | 2023-06-02 | 南京航空航天大学 | Insertion type approximate multiplication accumulator |
CN117170623B (en) * | 2023-11-03 | 2024-01-30 | 南京美辰微电子有限公司 | Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7346643B1 (en) * | 1999-07-30 | 2008-03-18 | Mips Technologies, Inc. | Processor with improved accuracy for multiply-add operations |
CN110673823A (en) * | 2019-09-30 | 2020-01-10 | 上海寒武纪信息科技有限公司 | Multiplier, data processing method and chip |
CN114115803A (en) * | 2022-01-24 | 2022-03-01 | 南京航空航天大学 | Approximate floating-point multiplier based on partial product probability analysis |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7315879B2 (en) * | 2001-02-16 | 2008-01-01 | Texas Instruments Incorporated | Multiply-accumulate modules and parallel multipliers and methods of designing multiply-accumulate modules and parallel multipliers |
US6978426B2 (en) * | 2002-04-10 | 2005-12-20 | Broadcom Corporation | Low-error fixed-width modified booth multiplier |
CN111258633B (en) * | 2018-11-30 | 2022-08-09 | 上海寒武纪信息科技有限公司 | Multiplier, data processing method, chip and electronic equipment |
-
2022
- 2022-05-19 CN CN202210541757.6A patent/CN114647399B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7346643B1 (en) * | 1999-07-30 | 2008-03-18 | Mips Technologies, Inc. | Processor with improved accuracy for multiply-add operations |
CN110673823A (en) * | 2019-09-30 | 2020-01-10 | 上海寒武纪信息科技有限公司 | Multiplier, data processing method and chip |
CN114115803A (en) * | 2022-01-24 | 2022-03-01 | 南京航空航天大学 | Approximate floating-point multiplier based on partial product probability analysis |
Non-Patent Citations (2)
Title |
---|
一种低功耗常系数乘法器的设计;李京等;《计算机工程与应用》;20070601(第30期);全文 * |
定宽截断式并行乘法器的实现研究;孙凌等;《中国集成电路》;20071215(第12期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114647399A (en) | 2022-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114647399B (en) | Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device | |
US20210349692A1 (en) | Multiplier and multiplication method | |
CN111488133B (en) | High-radix approximate Booth coding method and mixed-radix Booth coding approximate multiplier | |
CN111221499B (en) | Approximate multiplier based on approximate 6-2 and 4-2 compressors and calculation method | |
CN109144473B (en) | Decimal 3:2 compressor structure based on redundant ODDS number | |
CN109165006B (en) | Design optimization and hardware implementation method and system of Softmax function | |
CN116400883A (en) | Floating point multiply-add device capable of switching precision | |
CN112256236A (en) | FFT circuit based on approximate constant complex multiplier and implementation method | |
CN110825346B (en) | Low logic complexity unsigned approximation multiplier | |
US5944776A (en) | Fast carry-sum form booth encoder | |
CN110955403B (en) | Approximate base-8 Booth encoder and approximate binary multiplier of mixed Booth encoding | |
CN114115803B (en) | Approximate floating-point multiplier based on partial product probability analysis | |
US7840628B2 (en) | Combining circuitry | |
US20230015148A1 (en) | Multiplier and Adder in Systolic Array | |
CN113778377B (en) | Squarer structure based on base 8 Booth folding codes | |
CN115826913A (en) | Approximate binary multiplier based on static segmented compensation method | |
CN116048455B (en) | Insertion type approximate multiplication accumulator | |
CN112926287B (en) | Decimal-to-binary number converter based on tree compression | |
CN116011512A (en) | High-precision low-power-consumption approximate shift multiplier for neural network | |
JeevanaJyothi et al. | Approximate Multiplier Design Using Novel 4: 2 Compressor Design With Improved Accuracy | |
WO2022178861A1 (en) | Parallel multiplier and working method thereof | |
Devarani et al. | Design and implementation of truncated multipliers for precision improvement | |
Nabeesha et al. | Design and Implementation of High Speed and Power Efficient of Approximate Multiplier by using SDLC Technique | |
Sharanya et al. | A MODIFIED PARTIAL PRODUCT GENERATOR FOR REDUNDANT BINARY MULTIPLIERS | |
MULTIPLIER | DESIGN OF HIGH-ACCURACY FIXED-WIDTH MODIFIED |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |