CN114647399B - Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device - Google Patents

Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device Download PDF

Info

Publication number
CN114647399B
CN114647399B CN202210541757.6A CN202210541757A CN114647399B CN 114647399 B CN114647399 B CN 114647399B CN 202210541757 A CN202210541757 A CN 202210541757A CN 114647399 B CN114647399 B CN 114647399B
Authority
CN
China
Prior art keywords
partial product
bit
approximate
adder
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210541757.6A
Other languages
Chinese (zh)
Other versions
CN114647399A (en
Inventor
崔子英
陈珂
刘伟强
崔益军
王成华
吴比
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202210541757.6A priority Critical patent/CN114647399B/en
Publication of CN114647399A publication Critical patent/CN114647399A/en
Application granted granted Critical
Publication of CN114647399B publication Critical patent/CN114647399B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device, which comprises an input truncation compensation circuit, a radix-8 Booth encoder and decoder circuit, a primary partial product compression circuit, a secondary partial product compression circuit and a carry look-ahead adder circuit. Wherein the weight in the first stage partial product compression circuit is
Figure DEST_PATH_IMAGE001
Cutting off the lower position of the Wallace tree, using an approximate 4_2 compressor for the second lower position 2, and using an accurate compressor for the upper position; the second-order partial product compression circuit uses a precision compressor and includes a probability constant compensation section for compensating for truncation of the first-order partial product, compensating for errors generated by using an approximate 4_2 compressor, and compensating for truncation of the second-order partial product, respectively. The invention reduces power consumption and hardware cost by using truncation and approximation methods, and maintains higher precision by adopting a probability constant compensation strategy for errors.

Description

Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device
Technical Field
The invention relates to the technical field of approximate arithmetic operation circuit design, in particular to a low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device.
Background
Since 2007, a series of semiconductor laws such as moore's law, denuded scaling law, etc. have gradually failed, and it has become very difficult to continuously improve the performance of the chip while maintaining the same power consumption of the chip. And now, the importance of big data processing and artificial intelligence is continuously improved, and the applications need massive data and complex computation, so that higher requirements are provided for general computation engines and application-specific integrated circuits with high energy efficiency and high performance. In the existing applications such as pattern recognition, video processing and data mining, the fault tolerance capability exists, under the premise that the applications exist, the calculation precision is introduced into a design space by approximate calculation as a new dimension, the hardware overhead and the power consumption are reduced on the premise that the application requirements are met, and the method is adopted as a new energy-efficient design method to alleviate the problems.
The multiply-accumulate unit is widely used in applications such as convolutional neural networks as an important calculation unit of a digital signal processor. Serial multiply-accumulate units are favored for their small hardware overhead, but are not used satisfactorily in applications with high latency requirements. Only the parallel multiply-accumulate unit exists, but research on the parallel multiply-accumulate unit is less, and the hardware overhead is too large because the parallel multiply-accumulate unit can be realized by copying a single multiplier and an adder. A paper "A High-Performance and Energy-Efficient FIR Adaptive Filter Using applied Adaptive array CIRCUITS" published in IEEE TRANSACTIONS CICUITS AND SYSTEMS discloses a method for designing an Adaptive Filter based ON a Distributed algorithm, wherein an error calculation module is consistent with the design idea of a parallel multiply-accumulate unit, but the approximation means is rough, and the effective balance between precision and hardware overhead cannot be realized.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device, aiming at improving the approximation and truncation strategies on the original design, reducing the power consumption, reducing the hardware expense and maintaining higher precision.
In order to achieve the purpose, the invention adopts the following technical scheme:
a low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device comprises an input truncation compensation circuit, a radix-8 Booth encoder and decoder circuit, a primary partial product compression circuit, a secondary partial product compression circuit and a carry look-ahead adder circuit;
the input truncation compensation circuit has two groups of leading-in lengths of
Figure 323812DEST_PATH_IMAGE001
Each group of the elements is
Figure 105954DEST_PATH_IMAGE002
The following processes are respectively carried out on the data: cutoff is low
Figure 526571DEST_PATH_IMAGE003
The values of the bit and k are determined according to the specific requirement on the precision, and the value range is
Figure 327037DEST_PATH_IMAGE004
In the first place
Figure 174907DEST_PATH_IMAGE005
Bit up with 1, finally
Figure 834559DEST_PATH_IMAGE006
The bit result is output to the radix-8 booth encoder and decoder circuitry;
the radix-8 booth encoder and decoder circuitry includes N sets of radix-8 booth encoders, approximate decoding adders, and a legacy decoder; the output of one group of input truncation compensation circuits is divided into three bits by a group for coding by the radix-8 Booth coder, and the coding result is output to a traditional decoder; the approximate decoding adder calculates the output of the other input truncation compensation circuit; the traditional decoder processes the results of the radix-8 Booth encoder and the approximate decoding adder to generate a partial product, and outputs the partial product to a first-stage partial product compression circuit;
the first stage partial product compression circuit comprises
Figure 901609DEST_PATH_IMAGE007
A first-level Wallace tree, each first-level Wallace tree being of the size
Figure 963106DEST_PATH_IMAGE008
Each first-level Wallace tree is a regular rectangle; each first-level Wallace tree is divided into three sections for approximate processing, and one weight is
Figure 716299DEST_PATH_IMAGE009
The first-level Wallace tree of (1) with low truncation
Figure 722301DEST_PATH_IMAGE010
Compressing the second lower 2 bits by using an approximate 4_2 compressor, compressing the rest high bits into two lines by using an accurate adder, and outputting the compression results of the accurate adders of all the first-level Wallace trees to a second-level partial product compression circuit;
the second-level partial product compression circuit comprises a second-level Wallace tree and a probability constant compensation module, wherein the probability constant compensation module is used for compensating the first-level partial product truncation, errors generated by using an approximate 4_2 compressor and the second-level partial product truncation to obtain truncated and approximate constant compensation partial data; the second-level Wallace tree compresses the received input data and the constant compensation part data into two lines by using an accurate adder, and takes
Figure 484720DEST_PATH_IMAGE011
The bit is output to a carry look ahead adder circuit;
the carry look ahead adder circuit adds the output results of the two-stage partial product compression circuit and retains
Figure 400724DEST_PATH_IMAGE011
To produce the output result of the final multiply-accumulate device.
In order to optimize the technical scheme, the specific measures adopted further comprise:
further, the radix-8 Booth encoder includes five output signals, using
Figure 200184DEST_PATH_IMAGE012
Figure 568848DEST_PATH_IMAGE013
Figure 869380DEST_PATH_IMAGE014
Figure 905469DEST_PATH_IMAGE015
Respectively representing the most significant bit, the second least significant bit and the least significant bit of the input signal, respectivelyComprises the following steps:
Figure 125097DEST_PATH_IMAGE016
Figure 981058DEST_PATH_IMAGE017
Figure 85280DEST_PATH_IMAGE018
Figure 349777DEST_PATH_IMAGE019
Figure 615673DEST_PATH_IMAGE020
the traditional decoder only generates partial products which are accurately compressed and approximately compressed in a one-stage partial product compression circuit, and the expression is as follows:
Figure 958930DEST_PATH_IMAGE021
wherein
Figure 725898DEST_PATH_IMAGE022
To approximately decode the adder input
Figure DEST_PATH_IMAGE023
The number of bits is,
Figure 205421DEST_PATH_IMAGE024
to approximately decode the adder output
Figure 517584DEST_PATH_IMAGE023
A bit.
Further, the approximate decoding adder is low on input data
Figure 348137DEST_PATH_IMAGE025
The bits are accumulated approximately by one set of two bits, p is rootNon-negative integers determined according to the precision requirement are represented by the formula:
Figure 794162DEST_PATH_IMAGE026
Figure 252825DEST_PATH_IMAGE027
Figure 126103DEST_PATH_IMAGE028
wherein
Figure 443952DEST_PATH_IMAGE029
Represents the second of input y
Figure 693668DEST_PATH_IMAGE030
The number of bits is,
Figure DEST_PATH_IMAGE031
in order to input the carry bit, the carry bit is input,
Figure 990526DEST_PATH_IMAGE032
in order to output the carry bit,
Figure 34705DEST_PATH_IMAGE033
to the final sum
Figure 964484DEST_PATH_IMAGE030
A bit; to input data
Figure 752311DEST_PATH_IMAGE034
And
Figure 60933DEST_PATH_IMAGE035
the low-order additional error recovery circuit has the formula:
Figure 151380DEST_PATH_IMAGE036
Figure 443821DEST_PATH_IMAGE037
Figure 769760DEST_PATH_IMAGE038
wherein
Figure 57522DEST_PATH_IMAGE039
To the final sum
Figure 709083DEST_PATH_IMAGE030
The error of the bits is recovered as a signal,
Figure 223241DEST_PATH_IMAGE040
and
Figure 352871DEST_PATH_IMAGE041
respectively after recovery
Figure 33424DEST_PATH_IMAGE042
The final sum of the bits and the output carry; and accumulating the high order of the input data by using a ripple carry adder.
Further, the approximate 4_2 compressor formula used in the one-stage partial product compression circuit is:
Figure 590307DEST_PATH_IMAGE043
Figure 591761DEST_PATH_IMAGE044
wherein
Figure 649716DEST_PATH_IMAGE045
Four inputs for the ith column of the Wallace tree; the precise adder comprises a precise full adder and a precise half adder; the sign compensation bit of each row partial product is not included in the Wallace tree, and the error is reduced by a constant compensation method.
Further, the two-stage partial product compression circuit performs unified processing on the sign bit under the condition that the input sign is determined: on the premise that only the numerical value is reserved, the high-order addition 111 of any output of the compression tree at the lowest order of the first-order partial product compressor and the high-order addition 110 of any output of the compression trees at the second lowest order and the second highest order are carried out.
Further, the process of compensating the first-stage partial product truncation includes:
assuming that the inputs are uniformly distributed:
Figure 53015DEST_PATH_IMAGE046
wherein
Figure 515221DEST_PATH_IMAGE047
Is the mth bit of the input signal x. The probability of each bit after truncation compensation is:
Figure 879337DEST_PATH_IMAGE048
performing radix-8 Booth encoding on one group of operands, and filling 0 at the lowest bit according to an encoding rule, wherein the encoding result probability is as follows:
Figure 350770DEST_PATH_IMAGE049
Figure 342997DEST_PATH_IMAGE050
Figure 631895DEST_PATH_IMAGE051
wherein
Figure 607942DEST_PATH_IMAGE052
Is the result of Booth encoding
Figure 351907DEST_PATH_IMAGE030
A bit. When the Booth code value is
Figure 103700DEST_PATH_IMAGE053
The output of the approximate decoding adder is needed in the decoding process
Figure 173287DEST_PATH_IMAGE054
Therefore, it is necessary to calculate
Figure 636629DEST_PATH_IMAGE054
The probability of each bit is obtained according to the characteristics of the approximate decoding adder:
Figure 308919DEST_PATH_IMAGE055
wherein
Figure 541317DEST_PATH_IMAGE056
Is composed of
Figure 47385DEST_PATH_IMAGE054
To (1) a
Figure 732444DEST_PATH_IMAGE023
A bit. The expectation of calculating the partial product is as follows:
Figure 959157DEST_PATH_IMAGE057
where the index n denotes the result of the operation of the nth element of the two sets of input vectors, the index i or j denotes the ith or jth binary bit of the number,
Figure 46062DEST_PATH_IMAGE058
is weighted as
Figure 723031DEST_PATH_IMAGE009
Of the compression tree
Figure DEST_PATH_IMAGE059
Go to the first
Figure 754441DEST_PATH_IMAGE023
Column partial product; the expectation of the sign-corrected bit is constantly 0.5,
Figure 643900DEST_PATH_IMAGE060
further, the compensating for the use of the approximate 4_2 compressor includes:
using delta to represent the error between the actual output and the accurate output, the error expected by the calculation is
Figure 224791DEST_PATH_IMAGE061
Figure 72662DEST_PATH_IMAGE062
Presentation mode
Figure 732313DEST_PATH_IMAGE063
The probability of the occurrence of the event is,
Figure 550097DEST_PATH_IMAGE064
Figure 346014DEST_PATH_IMAGE065
the error is represented by the number of bits in the error,
Figure 364786DEST_PATH_IMAGE066
(ii) a The specific compensation value is the desired sum of the individual errors.
Further, constant compensation for the two-level partial product truncation is expected from the output value of the approximate 4_2 compressor:
Figure 511733DEST_PATH_IMAGE067
Figure 883940DEST_PATH_IMAGE068
further, the input of the carry look ahead adder is
Figure 799943DEST_PATH_IMAGE011
The bits are divided into four groups, a traveling wave carry adder is arranged in each group, and a carry look-ahead adder is arranged between the groups.
Further, the first-stage partial product compression circuit and the second-stage partial product compression circuit adopt a sign extension elimination method. The sign expansion elimination method utilizes the characteristic that 2-system operation is not 0, namely 1, and carries out unified processing on sign bits input by a partial integral compression circuit, and negative values are converted into the highest one, so that the subsequent compression processing of all positive values is facilitated.
The invention has the beneficial effects that:
the low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device adopts a distributed algorithm to improve the parallelism of the multiplication accumulation unit and effectively improve the circuit performance, adopts truncation and approximation means to reduce the circuit complexity and the power consumption, and adopts a constant compensation strategy to exchange the precision with the minimum hardware overhead. Truncating the partial product not only saves the compressor and shortens the critical path length of the carry look ahead adder, but also saves the conventional booth decoder that generates the truncated partial product and greatly saves hardware overhead.
Drawings
FIG. 1 is a schematic diagram of the low-power consumption high-precision approximate parallel fixed-width multiply-accumulate device of the present invention.
FIG. 2a is a schematic view of
Figure 724037DEST_PATH_IMAGE069
Figure 482915DEST_PATH_IMAGE070
The structure of the first-level Wallace compressed tree-0 is shown as an example.
FIG. 2b is a schematic representation of
Figure 517867DEST_PATH_IMAGE069
Figure 553956DEST_PATH_IMAGE070
For example, the structure of the first-level Wallace compressed tree-1 is schematically illustrated.
FIG. 2c is a schematic representation of
Figure 22852DEST_PATH_IMAGE069
Figure 144392DEST_PATH_IMAGE070
The structure of the first-level Wallace compressed tree-2 is shown as an example.
FIG. 2d is a schematic representation of
Figure 983035DEST_PATH_IMAGE069
Figure 732685DEST_PATH_IMAGE070
The structure of the first-level Wallace compressed tree-3 is shown as an example.
FIG. 3 is a schematic view of a
Figure 264161DEST_PATH_IMAGE069
Figure 607417DEST_PATH_IMAGE070
An example two-level partial product compression tree diagram.
Detailed Description
The present invention will now be described in further detail with reference to the accompanying drawings.
It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.
FIG. 1 is a schematic diagram of the low-power consumption high-precision approximate parallel fixed-width multiply-accumulate device of the present invention. Referring to fig. 1, the multiply-accumulate apparatus includes an input truncation compensation circuit, a radix-8 booth encoder and decoder circuit, a first-stage partial product compression circuit, a second-stage partial product compression circuit, and a carry-look-ahead adder circuit.
The input truncation compensation circuit is used for multiplying and accumulating two groups of length of the parallel multiplication and accumulation units into
Figure 125117DEST_PATH_IMAGE011
Each group of elements is
Figure 870219DEST_PATH_IMAGE071
Input of
Figure 572596DEST_PATH_IMAGE003
Is located and is incorporated inFirst, the
Figure 262204DEST_PATH_IMAGE005
1 is added on the position, finally
Figure 708228DEST_PATH_IMAGE006
The bit results are output to the radix-8 booth encoder and decoder circuits.
The radix-8 Booth encoder and decoder circuit encodes one set of inputs by dividing into three bits, and the other set of inputs approximates decoding adder, encoding result and generated
Figure 307837DEST_PATH_IMAGE054
And outputting the partial product to a traditional decoder to generate a partial product, and then outputting the partial product to a one-stage partial product compression circuit.
The first stage partial product compression circuit comprises
Figure 181115DEST_PATH_IMAGE007
Each of the Wallace trees has a size of
Figure 607286DEST_PATH_IMAGE008
Each first-level Wallace tree is a regular rectangle, the first-level Wallace trees are divided into three sections for approximate processing, and the weight of one section is
Figure 591423DEST_PATH_IMAGE009
The Wallace tree of (A) has low truncation
Figure 311117DEST_PATH_IMAGE010
And the second lower 2 bits are compressed by an approximate 4_2 compressor, the rest upper bits are compressed into two rows by an accurate adder, and only the compression results of the accurate adders of all Wallace trees are output to a two-stage partial product compression circuit.
The two-stage partial product compression circuit comprises 1 Wallace tree, and uses precise adder to compress the input and truncated and approximate constant compensation part into two lines, and takes
Figure 479930DEST_PATH_IMAGE011
The bits are output to a carry look ahead adder circuit.
The carry look ahead adder circuit adds the results of the two-stage partial product compression circuit and retains
Figure 19496DEST_PATH_IMAGE011
To produce the result of the final multiply-accumulate unit.
(I) -radix-8 Booth encoder and decoder circuits
The radix-8 booth encoder and decoder circuitry includes a radix-8 booth encoder, an approximate decoding adder, and a legacy decoder.
The radix-8 booth encoder has five output signals, the expression:
Figure 807323DEST_PATH_IMAGE016
Figure 256890DEST_PATH_IMAGE072
Figure 471971DEST_PATH_IMAGE073
Figure 233254DEST_PATH_IMAGE019
Figure 949406DEST_PATH_IMAGE074
decoder by generating
Figure 378113DEST_PATH_IMAGE054
The approximate decoding adder and the traditional decoder only generate partial products which are accurately compressed and approximately compressed in the one-stage partial product compression circuit, and the expression is as follows:
Figure 498516DEST_PATH_IMAGE021
approximate decoding adder pair low
Figure 652154DEST_PATH_IMAGE025
The bits are accumulated approximately by a group of two bits, and the formula is as follows:
Figure 781784DEST_PATH_IMAGE026
Figure 64998DEST_PATH_IMAGE027
Figure 746515DEST_PATH_IMAGE028
(ii) a At the same time to
Figure 747969DEST_PATH_IMAGE034
And
Figure 681290DEST_PATH_IMAGE035
the low-order additional error recovery circuit has the formula:
Figure 428797DEST_PATH_IMAGE036
Figure 422161DEST_PATH_IMAGE037
Figure 910911DEST_PATH_IMAGE038
the high order is accumulated using a ripple carry adder.
One-stage partial product compression circuit
The approximate 4_2 compressor formula used in the first stage partial product compression circuit is:
Figure 241399DEST_PATH_IMAGE043
Figure 499205DEST_PATH_IMAGE044
(ii) a The precise adder comprises a precise full adder and a precise half adder; the sign compensation bit of each row partial product is not included in the Wallace tree, and the error is reduced by a constant compensation method.
(III) two-stage partial product compression circuit
Constant probability compensation in a two-level partial product compression circuit includes three parts, namely compensation for truncation of the first-level partial product, compensation for errors generated by using an approximate 4_2 compressor, and compensation for truncation of the second-level partial product.
(1) A first part: and (5) compensating for the truncation of the first-order partial product.
Assuming that the inputs are uniformly distributed, i.e. (here the subscripts are omitted, since the probabilities of all inputs are the same):
Figure 663470DEST_PATH_IMAGE046
the probability of each bit after the truncation compensation is (for convenience all inputs are divided by
Figure 373937DEST_PATH_IMAGE075
):
Figure 757382DEST_PATH_IMAGE048
One group of operands is subjected to radix-8 Booth encoding, and 0 is required to be complemented at the lowest bit according to an encoding rule. Since the lower two bits of the lowest bit encoding input are always (1,0), the probability of the encoding result is different from that of other bits and needs to be considered separately. The encoding result probabilities are as follows (positive and negative probabilities are the same):
Figure 135274DEST_PATH_IMAGE049
Figure 63916DEST_PATH_IMAGE050
Figure 527258DEST_PATH_IMAGE051
to obtain the probability of partial product, calculation is also needed
Figure 340493DEST_PATH_IMAGE054
The probability of each bit, based on the properties of the approximate decoding adder, can be:
Figure 572892DEST_PATH_IMAGE055
the expectation of the partial product can be calculated from the above equation as follows:
Figure 423167DEST_PATH_IMAGE076
wherein
Figure 373806DEST_PATH_IMAGE058
Is weighted as
Figure 990732DEST_PATH_IMAGE009
Of the compression tree
Figure 936691DEST_PATH_IMAGE059
Line, first
Figure 613660DEST_PATH_IMAGE023
Column partial product; wherein the expectation of the sign correction bit is always 0.5, i.e.
Figure 786015DEST_PATH_IMAGE077
(2) A second part: compensation for using an approximate 4_2 compressor.
Use of
Figure 314955DEST_PATH_IMAGE063
Different modes of representing input, have
Figure 256366DEST_PATH_IMAGE064
(ii) a Use of
Figure 838657DEST_PATH_IMAGE065
Indicating an error of
Figure 622942DEST_PATH_IMAGE066
(ii) a Use of
Figure 316092DEST_PATH_IMAGE062
To represent
Figure 377588DEST_PATH_IMAGE063
Probability of occurrence, then error is expected to be
Figure 271726DEST_PATH_IMAGE061
. The specific compensation value is the desired sum of the individual errors.
(3) And a third part: compensation for second order partial product truncation.
The error of the part is from the two-stage partial product compression circuit to cut off the output result of the approximate 4_2 compressor, and the part is expected to be constant compensated according to the output value:
Figure 418674DEST_PATH_IMAGE067
Figure 649935DEST_PATH_IMAGE068
(IV) carry look ahead adder
The carry look ahead adder inputs are
Figure 690572DEST_PATH_IMAGE011
The bits are divided into four groups, a traveling wave carry adder is arranged in each group, and a carry look-ahead adder is arranged between the groups.
The following are
Figure 880245DEST_PATH_IMAGE078
Figure 763756DEST_PATH_IMAGE079
For example, the multiply-accumulate apparatus according to the embodiment of the present invention will be further described with reference to the accompanying drawings.
The
Figure 172610DEST_PATH_IMAGE078
Figure 943120DEST_PATH_IMAGE079
The low-power consumption approximate parallel fixed-width multiplication accumulation unit structurally comprises an input truncation compensation circuit, a radix-8 Booth encoder and decoder circuit, a primary partial product compression circuit, a secondary partial product compression circuit and a carry look-ahead adder circuit.
The input truncation compensation circuit truncates the lower 5 bits of the two groups of input with the length of 16 of the parallel multiplication accumulation unit, supplements 1 on the 4 th bit, and finally outputs the 12-bit result to the radix-8 Booth encoder and decoder circuit. Only one group of inputs needs to be subjected to base-8 coding, the inputs are divided into three bits and one group, the other group of inputs are simultaneously sent to an approximate decoding adder to calculate 3x, then the three groups of inputs are sent to a traditional decoder, only a part compressed by a one-stage partial product compression circuit is generated, and the result is sent to a one-stage Wallace compression tree with different corresponding weights according to coding weights, as shown in figure 1. In FIG. 1, the first-level Wallace tree-0 represents the lowest Wallace tree in the first-level partial product compression circuit, the first-level Wallace tree-1 represents the second-lowest Wallace tree in the first-level partial product compression circuit, and so on. The first stage partial product compression circuit has 4 Wallace trees, each Wallace tree is a rectangle of 14 × 8, as shown in FIGS. 2 a-2 d. And shifting the result of the first-stage partial product compression circuit according to the weight, inputting the result into the second-stage partial product compression circuit, compressing the result into two lines by using a precise adder, finally taking the lower 16 bits to send into a carry-look-ahead adder, and taking the lower 16 bits as the final fixed-width output.
The radix-8 booth encoder has five output signals, the expression is:
Figure 303694DEST_PATH_IMAGE016
Figure 284288DEST_PATH_IMAGE017
Figure 122931DEST_PATH_IMAGE018
Figure 13527DEST_PATH_IMAGE019
Figure 545002DEST_PATH_IMAGE020
decoder by generating
Figure 498046DEST_PATH_IMAGE054
The approximate decoding adder and the traditional decoder only generate partial products which are accurately compressed and approximately compressed in the one-stage partial product compression circuit, and the expression is as follows:
Figure 405959DEST_PATH_IMAGE021
the approximate decoding adder adopts a two-bit group approximate accumulation for the lower 4 bits, and the formula is as follows:
Figure 151061DEST_PATH_IMAGE026
Figure 978072DEST_PATH_IMAGE027
Figure 543045DEST_PATH_IMAGE028
(ii) a Meanwhile, an error recovery circuit is added to the lowest 3 bits and the lowest 4 bits, and the formula is as follows:
Figure 723491DEST_PATH_IMAGE036
Figure 962580DEST_PATH_IMAGE037
Figure 835858DEST_PATH_IMAGE038
(ii) a The high order is accumulated using a ripple carry adder.
As shown in fig. 2 a-2 d, the sign expansion process is performed on all four wallace trees of the first-stage partial-product compression circuit, and finally two output signs are controlled to be positive, negative. The lowest order compression tree truncates the lower 8 bits, the next lowest order compression tree truncates the lower 5 bits, the next highest order compression tree truncates the lower 2 bits, the highest order compression tree does not truncate, except the highest order compression tree, the next highest 2 bits of all compression trees use an approximate 4_2 compressor, the lowest 1 bit of the highest order compression tree uses an approximate 4_2 compressor, the rest high bits are compressed into two rows by an accurate adder, and only the compression results of the accurate adders of all compression trees are input to a second-level partial product compression circuit.
The approximate 4_2 compressor formula used in the first stage partial product compression circuit is:
Figure 888128DEST_PATH_IMAGE043
Figure 996898DEST_PATH_IMAGE044
(ii) a The precise adder includes a precise full adder and a precise half adder. The sign compensation bit of each row partial product is not added into the compression tree, and the error is reduced by a constant compensation method.
As a further optimization scheme of this embodiment, the two-stage partial product compression circuit performs unified processing on the sign bits under the condition that the input sign is determined: on the premise that only the numerical value bit is reserved, the high order of any output of the compression tree at the lowest order of the first-order partial product compressor is added with '111', and the high order of any output of the compression tree at the second lowest order and the second highest order is added with '110', as shown in fig. 3.
As a further optimization scheme of this embodiment, the two-stage partial product compressor includes a constant compensation part for truncation and approximation, and the constant compensation is derived from theoretical probability: the 2 nd and 4 th bits of the two-level partial product compression tree are complemented by 1, as shown in FIG. 3. The final 16-bit result is fed into the carry look ahead adder.
The input of the carry look ahead adder is 16 bits, the carry look ahead adder is divided into four bit groups, the traveling wave carry adder is arranged in each group, the carry look ahead adder is arranged between the groups, and the lower 16 bits are taken as the fixed width result of the final approximate multiplication accumulation unit.
Finally, compared with the original design, the improved design reduces the power delay product by 25 percent, reduces the power delay product of the full-precision copy by 80 percent and reduces the average error distance of the copy without compensation by 58 percent.
While preferred embodiments of the embodiments of this specification have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the true scope of the embodiments of the specification.
It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments described herein without departing from the spirit and scope of the embodiments described herein. Thus, if such modifications and variations of the embodiments of the present specification fall within the scope of the claims of the embodiments of the present specification and their equivalents, the embodiments of the present specification are intended to include such modifications and variations.

Claims (9)

1. A low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device is characterized by comprising an input truncation compensation circuit, a radix-8 Booth encoder and decoder circuit, a primary partial product compression circuit, a secondary partial product compression circuit and a carry look-ahead adder circuit;
the input truncation compensation circuit has two groups of leading-in lengths of
Figure DEST_PATH_IMAGE002
Each group of the elements is
Figure DEST_PATH_IMAGE004
The following processes are respectively carried out on the data: cutoff is low
Figure DEST_PATH_IMAGE006
The values of the bit and k are determined according to the specific requirement on the precision, and the value range is
Figure DEST_PATH_IMAGE008
In the first place
Figure DEST_PATH_IMAGE010
1 is added on the position, finally
Figure DEST_PATH_IMAGE012
The bit result is output to the radix-8 booth encoder and decoder circuitry;
the radix-8 booth encoder and decoder circuitry includes N sets of radix-8 booth encoders, approximate decoding adders, and a legacy decoder; the output of one group of input truncation compensation circuits is divided into three bits by a group for coding by the radix-8 Booth coder, and the coding result is output to a traditional decoder; the approximate decoding adder calculates the output of the other input truncation compensation circuit; the traditional decoder processes the results of the radix-8 Booth encoder and the approximate decoding adder to generate a partial product and outputs the partial product to a first-stage partial product compression circuit;
the primary partial product compression circuit comprises
Figure DEST_PATH_IMAGE014
A first-level Wallace tree, each first-level Wallace tree being of the size
Figure DEST_PATH_IMAGE016
Each first-level Wallace tree is a regular rectangle; each first-level Wallace tree is divided into three sections for approximate processing, and the weight of one section is
Figure DEST_PATH_IMAGE018
The first-level Wallace tree of (1) with low truncation
Figure DEST_PATH_IMAGE020
Compressing the second lower 2 bits by using an approximate 4_2 compressor, compressing the rest high bits into two lines by using an accurate adder, and outputting the compression results of the accurate adders of all the first-level Wallace trees to a second-level partial product compression circuit;
the two-stage partial product compression circuit comprises a two-stage Wallace tree and a probability constant compensation module, and the probabilityThe constant compensation module is used for compensating the first-stage partial product truncation, the error generated by using the approximate 4_2 compressor and the second-stage partial product truncation to obtain truncated and approximate constant compensation partial data; the second-level Wallace tree compresses the received input data and the constant compensation part data into two lines by using an accurate adder, and takes
Figure DEST_PATH_IMAGE022
The bit is output to a carry look ahead adder circuit;
the carry look ahead adder circuit adds the output results of the two-stage partial product compression circuit and retains
Figure 709973DEST_PATH_IMAGE022
To produce the output result of the final multiply-accumulate device.
2. The low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 1, wherein said radix-8 booth encoder includes five output signals, using
Figure DEST_PATH_IMAGE024
Figure DEST_PATH_IMAGE026
Figure DEST_PATH_IMAGE028
Figure DEST_PATH_IMAGE030
Respectively representing the most significant bit, the second least significant bit and the least significant bit of the input signal, and the expressions are respectively:
Figure DEST_PATH_IMAGE032
Figure DEST_PATH_IMAGE034
Figure DEST_PATH_IMAGE036
Figure DEST_PATH_IMAGE038
Figure DEST_PATH_IMAGE040
the traditional decoder only generates partial products which are accurately compressed and approximately compressed in a one-stage partial product compression circuit, and the expression is as follows:
Figure DEST_PATH_IMAGE042
in which
Figure DEST_PATH_IMAGE044
To approximately decode the adder input
Figure DEST_PATH_IMAGE046
The number of bits is,
Figure DEST_PATH_IMAGE048
for approximately decoding the output of the adder
Figure 486168DEST_PATH_IMAGE046
A bit.
3. The low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 1, wherein the approximate decoding adder is low for input data
Figure DEST_PATH_IMAGE050
The bits are approximately accumulated in a group of two bits, p is a non-negative integer determined according to the precision requirement, and the formula is as follows:
Figure DEST_PATH_IMAGE052
Figure DEST_PATH_IMAGE054
Figure DEST_PATH_IMAGE056
wherein
Figure DEST_PATH_IMAGE058
Represents the first of input y
Figure DEST_PATH_IMAGE060
The number of bits is,
Figure DEST_PATH_IMAGE062
in order to input the carry bit, the carry bit is input,
Figure DEST_PATH_IMAGE064
in order to output the carry bit,
Figure DEST_PATH_IMAGE066
to the final sum
Figure 233281DEST_PATH_IMAGE060
A bit; to input data
Figure DEST_PATH_IMAGE068
And
Figure DEST_PATH_IMAGE070
the low-order additional error recovery circuit has the formula:
Figure DEST_PATH_IMAGE072
Figure DEST_PATH_IMAGE074
Figure DEST_PATH_IMAGE076
wherein
Figure DEST_PATH_IMAGE078
To the final sum
Figure 18704DEST_PATH_IMAGE060
The error of the bits is recovered as a signal,
Figure DEST_PATH_IMAGE080
and
Figure DEST_PATH_IMAGE082
respectively after recovery
Figure DEST_PATH_IMAGE084
The final sum of the bits and the output carry; and accumulating the high order of the input data by using a ripple carry adder.
4. The low power consumption high precision approximate parallel fixed width multiply accumulate device of claim 3, wherein the approximate 4_2 compressor output signal used in the one stage partial product compression circuit has the formula:
Figure DEST_PATH_IMAGE086
Figure DEST_PATH_IMAGE088
wherein
Figure DEST_PATH_IMAGE090
Is the first tree of Wallace
Figure 239601DEST_PATH_IMAGE060
Four inputs to a column; the precise adder comprises a precise full adder and a precise half adder; the sign compensation bit of each row partial product is not included in the Wallace tree, and the error is reduced by a constant compensation method.
5. The low-power consumption high-precision approximate parallel fixed-width multiply-accumulate device of claim 1, wherein the two-stage partial product compression circuit processes the sign bit uniformly in case of determining the input sign: on the premise that only the numerical value is reserved, the high-order addition 111 of any output of the compression tree at the lowest order of the first-order partial product compressor and the high-order addition 110 of any output of the compression trees at the second lowest order and the second highest order are carried out.
6. The low-energy-consumption high-precision approximately parallel fixed-width multiply-accumulate device of claim 1, wherein the procedure of compensating the first-stage partial product truncation comprises:
assuming that the inputs are uniformly distributed:
Figure DEST_PATH_IMAGE092
wherein
Figure DEST_PATH_IMAGE094
Is the mth bit of the input signal x; the probability of each bit after the truncation compensation is:
Figure DEST_PATH_IMAGE096
performing radix-8 Booth encoding on one group of operands, and filling 0 at the lowest bit according to an encoding rule, wherein the encoding result probability is as follows:
Figure DEST_PATH_IMAGE098
Figure DEST_PATH_IMAGE100
Figure DEST_PATH_IMAGE102
wherein
Figure DEST_PATH_IMAGE104
Is the result of Booth encoding
Figure 763861DEST_PATH_IMAGE060
A bit; when the Booth code value is
Figure DEST_PATH_IMAGE106
Using the output of the approximate decoding adder
Figure DEST_PATH_IMAGE108
Decoding and calculating
Figure 578364DEST_PATH_IMAGE108
The probability of each bit is obtained according to the characteristics of the approximate decoding adder:
Figure DEST_PATH_IMAGE110
wherein
Figure DEST_PATH_IMAGE112
Is composed of
Figure 839581DEST_PATH_IMAGE108
To (1) a
Figure 288886DEST_PATH_IMAGE046
A bit; the expectation of calculating the partial product is as follows:
Figure DEST_PATH_IMAGE114
where the index n denotes the result of the operation of the nth element of the two sets of input vectors and the index i or j denotes the ith or jth of the numberThe bit is carried in a binary system, and the bit,
Figure DEST_PATH_IMAGE116
is weighted as
Figure 328255DEST_PATH_IMAGE018
Of the compression tree
Figure DEST_PATH_IMAGE118
Go to the first
Figure 613743DEST_PATH_IMAGE046
Column partial product; the expectation of the sign-corrected bit is always 0.5,
Figure DEST_PATH_IMAGE120
7. the low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 4, wherein the compensation for using the approximate 4_2 compressor comprises:
use of
Figure DEST_PATH_IMAGE122
Representing the error between the actual output and the accurate output, the error expected by the calculation being
Figure DEST_PATH_IMAGE124
Figure DEST_PATH_IMAGE126
Presentation mode
Figure DEST_PATH_IMAGE128
The probability of the occurrence of the event is,
Figure DEST_PATH_IMAGE130
Figure DEST_PATH_IMAGE132
the error is represented by the number of bits in the error,
Figure DEST_PATH_IMAGE134
(ii) a The specific compensation value is the desired sum of the individual errors.
8. The low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 7, wherein the two-stage partial product truncation is expected to be constant compensated according to the output value of the approximate 4_2 compressor:
Figure DEST_PATH_IMAGE136
Figure DEST_PATH_IMAGE138
9. the low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 1, wherein the input of the carry look ahead adder is
Figure 334443DEST_PATH_IMAGE022
The bits are divided into four groups, a traveling wave carry adder is arranged in each group, and a carry look-ahead adder is arranged between the groups.
CN202210541757.6A 2022-05-19 2022-05-19 Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device Active CN114647399B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210541757.6A CN114647399B (en) 2022-05-19 2022-05-19 Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210541757.6A CN114647399B (en) 2022-05-19 2022-05-19 Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device

Publications (2)

Publication Number Publication Date
CN114647399A CN114647399A (en) 2022-06-21
CN114647399B true CN114647399B (en) 2022-08-16

Family

ID=81997301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210541757.6A Active CN114647399B (en) 2022-05-19 2022-05-19 Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device

Country Status (1)

Country Link
CN (1) CN114647399B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115407965B (en) * 2022-11-01 2023-03-24 南京航空航天大学 High-performance approximate divider based on Taylor expansion and error compensation method
CN116048455B (en) * 2023-03-07 2023-06-02 南京航空航天大学 Insertion type approximate multiplication accumulator
CN117170623B (en) * 2023-11-03 2024-01-30 南京美辰微电子有限公司 Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7346643B1 (en) * 1999-07-30 2008-03-18 Mips Technologies, Inc. Processor with improved accuracy for multiply-add operations
CN110673823A (en) * 2019-09-30 2020-01-10 上海寒武纪信息科技有限公司 Multiplier, data processing method and chip
CN114115803A (en) * 2022-01-24 2022-03-01 南京航空航天大学 Approximate floating-point multiplier based on partial product probability analysis

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7315879B2 (en) * 2001-02-16 2008-01-01 Texas Instruments Incorporated Multiply-accumulate modules and parallel multipliers and methods of designing multiply-accumulate modules and parallel multipliers
US6978426B2 (en) * 2002-04-10 2005-12-20 Broadcom Corporation Low-error fixed-width modified booth multiplier
CN111258633B (en) * 2018-11-30 2022-08-09 上海寒武纪信息科技有限公司 Multiplier, data processing method, chip and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7346643B1 (en) * 1999-07-30 2008-03-18 Mips Technologies, Inc. Processor with improved accuracy for multiply-add operations
CN110673823A (en) * 2019-09-30 2020-01-10 上海寒武纪信息科技有限公司 Multiplier, data processing method and chip
CN114115803A (en) * 2022-01-24 2022-03-01 南京航空航天大学 Approximate floating-point multiplier based on partial product probability analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种低功耗常系数乘法器的设计;李京等;《计算机工程与应用》;20070601(第30期);全文 *
定宽截断式并行乘法器的实现研究;孙凌等;《中国集成电路》;20071215(第12期);全文 *

Also Published As

Publication number Publication date
CN114647399A (en) 2022-06-21

Similar Documents

Publication Publication Date Title
CN114647399B (en) Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device
US20210349692A1 (en) Multiplier and multiplication method
CN111488133B (en) High-radix approximate Booth coding method and mixed-radix Booth coding approximate multiplier
CN111221499B (en) Approximate multiplier based on approximate 6-2 and 4-2 compressors and calculation method
CN109144473B (en) Decimal 3:2 compressor structure based on redundant ODDS number
CN109165006B (en) Design optimization and hardware implementation method and system of Softmax function
CN116400883A (en) Floating point multiply-add device capable of switching precision
CN112256236A (en) FFT circuit based on approximate constant complex multiplier and implementation method
CN110825346B (en) Low logic complexity unsigned approximation multiplier
US5944776A (en) Fast carry-sum form booth encoder
CN110955403B (en) Approximate base-8 Booth encoder and approximate binary multiplier of mixed Booth encoding
CN114115803B (en) Approximate floating-point multiplier based on partial product probability analysis
US7840628B2 (en) Combining circuitry
US20230015148A1 (en) Multiplier and Adder in Systolic Array
CN113778377B (en) Squarer structure based on base 8 Booth folding codes
CN115826913A (en) Approximate binary multiplier based on static segmented compensation method
CN116048455B (en) Insertion type approximate multiplication accumulator
CN112926287B (en) Decimal-to-binary number converter based on tree compression
CN116011512A (en) High-precision low-power-consumption approximate shift multiplier for neural network
JeevanaJyothi et al. Approximate Multiplier Design Using Novel 4: 2 Compressor Design With Improved Accuracy
WO2022178861A1 (en) Parallel multiplier and working method thereof
Devarani et al. Design and implementation of truncated multipliers for precision improvement
Nabeesha et al. Design and Implementation of High Speed and Power Efficient of Approximate Multiplier by using SDLC Technique
Sharanya et al. A MODIFIED PARTIAL PRODUCT GENERATOR FOR REDUNDANT BINARY MULTIPLIERS
MULTIPLIER DESIGN OF HIGH-ACCURACY FIXED-WIDTH MODIFIED

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant