CN114647399B

CN114647399B - Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device

Info

Publication number: CN114647399B
Application number: CN202210541757.6A
Authority: CN
Inventors: 崔子英; 陈珂; 刘伟强; 崔益军; 王成华; 吴比
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-08-16
Anticipated expiration: 2042-05-19
Also published as: CN114647399A

Abstract

The invention discloses a low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device, which comprises an input truncation compensation circuit, a radix-8 Booth encoder and decoder circuit, a primary partial product compression circuit, a secondary partial product compression circuit and a carry look-ahead adder circuit. Wherein the weight in the first stage partial product compression circuit is

Cutting off the lower position of the Wallace tree, using an approximate 4_2 compressor for the second lower position 2, and using an accurate compressor for the upper position; the second-order partial product compression circuit uses a precision compressor and includes a probability constant compensation section for compensating for truncation of the first-order partial product, compensating for errors generated by using an approximate 4_2 compressor, and compensating for truncation of the second-order partial product, respectively. The invention reduces power consumption and hardware cost by using truncation and approximation methods, and maintains higher precision by adopting a probability constant compensation strategy for errors.

Description

Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device

Technical Field

The invention relates to the technical field of approximate arithmetic operation circuit design, in particular to a low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device.

Background

Since 2007, a series of semiconductor laws such as moore's law, denuded scaling law, etc. have gradually failed, and it has become very difficult to continuously improve the performance of the chip while maintaining the same power consumption of the chip. And now, the importance of big data processing and artificial intelligence is continuously improved, and the applications need massive data and complex computation, so that higher requirements are provided for general computation engines and application-specific integrated circuits with high energy efficiency and high performance. In the existing applications such as pattern recognition, video processing and data mining, the fault tolerance capability exists, under the premise that the applications exist, the calculation precision is introduced into a design space by approximate calculation as a new dimension, the hardware overhead and the power consumption are reduced on the premise that the application requirements are met, and the method is adopted as a new energy-efficient design method to alleviate the problems.

The multiply-accumulate unit is widely used in applications such as convolutional neural networks as an important calculation unit of a digital signal processor. Serial multiply-accumulate units are favored for their small hardware overhead, but are not used satisfactorily in applications with high latency requirements. Only the parallel multiply-accumulate unit exists, but research on the parallel multiply-accumulate unit is less, and the hardware overhead is too large because the parallel multiply-accumulate unit can be realized by copying a single multiplier and an adder. A paper "A High-Performance and Energy-Efficient FIR Adaptive Filter Using applied Adaptive array CIRCUITS" published in IEEE TRANSACTIONS CICUITS AND SYSTEMS discloses a method for designing an Adaptive Filter based ON a Distributed algorithm, wherein an error calculation module is consistent with the design idea of a parallel multiply-accumulate unit, but the approximation means is rough, and the effective balance between precision and hardware overhead cannot be realized.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device, aiming at improving the approximation and truncation strategies on the original design, reducing the power consumption, reducing the hardware expense and maintaining higher precision.

In order to achieve the purpose, the invention adopts the following technical scheme:

a low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device comprises an input truncation compensation circuit, a radix-8 Booth encoder and decoder circuit, a primary partial product compression circuit, a secondary partial product compression circuit and a carry look-ahead adder circuit;

the input truncation compensation circuit has two groups of leading-in lengths of

Each group of the elements is

The following processes are respectively carried out on the data: cutoff is low

The values of the bit and k are determined according to the specific requirement on the precision, and the value range is

In the first place

Bit up with 1, finally

The bit result is output to the radix-8 booth encoder and decoder circuitry;

the radix-8 booth encoder and decoder circuitry includes N sets of radix-8 booth encoders, approximate decoding adders, and a legacy decoder; the output of one group of input truncation compensation circuits is divided into three bits by a group for coding by the radix-8 Booth coder, and the coding result is output to a traditional decoder; the approximate decoding adder calculates the output of the other input truncation compensation circuit; the traditional decoder processes the results of the radix-8 Booth encoder and the approximate decoding adder to generate a partial product, and outputs the partial product to a first-stage partial product compression circuit;

the first stage partial product compression circuit comprises

A first-level Wallace tree, each first-level Wallace tree being of the size

Each first-level Wallace tree is a regular rectangle; each first-level Wallace tree is divided into three sections for approximate processing, and one weight is

The first-level Wallace tree of (1) with low truncation

Compressing the second lower 2 bits by using an approximate 4_2 compressor, compressing the rest high bits into two lines by using an accurate adder, and outputting the compression results of the accurate adders of all the first-level Wallace trees to a second-level partial product compression circuit;

the second-level partial product compression circuit comprises a second-level Wallace tree and a probability constant compensation module, wherein the probability constant compensation module is used for compensating the first-level partial product truncation, errors generated by using an approximate 4_2 compressor and the second-level partial product truncation to obtain truncated and approximate constant compensation partial data; the second-level Wallace tree compresses the received input data and the constant compensation part data into two lines by using an accurate adder, and takes

The bit is output to a carry look ahead adder circuit;

the carry look ahead adder circuit adds the output results of the two-stage partial product compression circuit and retains

To produce the output result of the final multiply-accumulate device.

In order to optimize the technical scheme, the specific measures adopted further comprise:

further, the radix-8 Booth encoder includes five output signals, using

，

，

，

Respectively representing the most significant bit, the second least significant bit and the least significant bit of the input signal, respectivelyComprises the following steps:

；

；

；

；

；

the traditional decoder only generates partial products which are accurately compressed and approximately compressed in a one-stage partial product compression circuit, and the expression is as follows:

wherein

To approximately decode the adder input

The number of bits is,

to approximately decode the adder output

A bit.

Further, the approximate decoding adder is low on input data

The bits are accumulated approximately by one set of two bits, p is rootNon-negative integers determined according to the precision requirement are represented by the formula:

，

，

wherein

Represents the second of input y

The number of bits is,

in order to input the carry bit, the carry bit is input,

in order to output the carry bit,

to the final sum

A bit; to input data

And

the low-order additional error recovery circuit has the formula:

，

，

wherein

To the final sum

The error of the bits is recovered as a signal,

and

respectively after recovery

The final sum of the bits and the output carry; and accumulating the high order of the input data by using a ripple carry adder.

Further, the approximate 4_2 compressor formula used in the one-stage partial product compression circuit is:

，

wherein

Four inputs for the ith column of the Wallace tree; the precise adder comprises a precise full adder and a precise half adder; the sign compensation bit of each row partial product is not included in the Wallace tree, and the error is reduced by a constant compensation method.

Further, the two-stage partial product compression circuit performs unified processing on the sign bit under the condition that the input sign is determined: on the premise that only the numerical value is reserved, the high-order addition 111 of any output of the compression tree at the lowest order of the first-order partial product compressor and the high-order addition 110 of any output of the compression trees at the second lowest order and the second highest order are carried out.

Further, the process of compensating the first-stage partial product truncation includes:

assuming that the inputs are uniformly distributed:

；

wherein

Is the mth bit of the input signal x. The probability of each bit after truncation compensation is:

；

performing radix-8 Booth encoding on one group of operands, and filling 0 at the lowest bit according to an encoding rule, wherein the encoding result probability is as follows:

；

；

；

wherein

Is the result of Booth encoding

A bit. When the Booth code value is

The output of the approximate decoding adder is needed in the decoding process

Therefore, it is necessary to calculate

The probability of each bit is obtained according to the characteristics of the approximate decoding adder:

；

wherein

Is composed of

To (1) a

A bit. The expectation of calculating the partial product is as follows:

where the index n denotes the result of the operation of the nth element of the two sets of input vectors, the index i or j denotes the ith or jth binary bit of the number,

is weighted as

Of the compression tree

Go to the first

Column partial product; the expectation of the sign-corrected bit is constantly 0.5,

。

further, the compensating for the use of the approximate 4_2 compressor includes:

using delta to represent the error between the actual output and the accurate output, the error expected by the calculation is

；

Presentation mode

The probability of the occurrence of the event is,

，

the error is represented by the number of bits in the error,

(ii) a The specific compensation value is the desired sum of the individual errors.

Further, constant compensation for the two-level partial product truncation is expected from the output value of the approximate 4_2 compressor:

，

。

further, the input of the carry look ahead adder is

The bits are divided into four groups, a traveling wave carry adder is arranged in each group, and a carry look-ahead adder is arranged between the groups.

Further, the first-stage partial product compression circuit and the second-stage partial product compression circuit adopt a sign extension elimination method. The sign expansion elimination method utilizes the characteristic that 2-system operation is not 0, namely 1, and carries out unified processing on sign bits input by a partial integral compression circuit, and negative values are converted into the highest one, so that the subsequent compression processing of all positive values is facilitated.

The invention has the beneficial effects that:

the low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device adopts a distributed algorithm to improve the parallelism of the multiplication accumulation unit and effectively improve the circuit performance, adopts truncation and approximation means to reduce the circuit complexity and the power consumption, and adopts a constant compensation strategy to exchange the precision with the minimum hardware overhead. Truncating the partial product not only saves the compressor and shortens the critical path length of the carry look ahead adder, but also saves the conventional booth decoder that generates the truncated partial product and greatly saves hardware overhead.

Drawings

FIG. 1 is a schematic diagram of the low-power consumption high-precision approximate parallel fixed-width multiply-accumulate device of the present invention.

FIG. 2a is a schematic view of

，

The structure of the first-level Wallace compressed tree-0 is shown as an example.

FIG. 2b is a schematic representation of

，

For example, the structure of the first-level Wallace compressed tree-1 is schematically illustrated.

FIG. 2c is a schematic representation of

，

The structure of the first-level Wallace compressed tree-2 is shown as an example.

FIG. 2d is a schematic representation of

，

The structure of the first-level Wallace compressed tree-3 is shown as an example.

FIG. 3 is a schematic view of a

，

An example two-level partial product compression tree diagram.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings.

It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.

FIG. 1 is a schematic diagram of the low-power consumption high-precision approximate parallel fixed-width multiply-accumulate device of the present invention. Referring to fig. 1, the multiply-accumulate apparatus includes an input truncation compensation circuit, a radix-8 booth encoder and decoder circuit, a first-stage partial product compression circuit, a second-stage partial product compression circuit, and a carry-look-ahead adder circuit.

The input truncation compensation circuit is used for multiplying and accumulating two groups of length of the parallel multiplication and accumulation units into

Each group of elements is

Input of

Is located and is incorporated inFirst, the

1 is added on the position, finally

The bit results are output to the radix-8 booth encoder and decoder circuits.

The radix-8 Booth encoder and decoder circuit encodes one set of inputs by dividing into three bits, and the other set of inputs approximates decoding adder, encoding result and generated

And outputting the partial product to a traditional decoder to generate a partial product, and then outputting the partial product to a one-stage partial product compression circuit.

The first stage partial product compression circuit comprises

Each of the Wallace trees has a size of

Each first-level Wallace tree is a regular rectangle, the first-level Wallace trees are divided into three sections for approximate processing, and the weight of one section is

The Wallace tree of (A) has low truncation

And the second lower 2 bits are compressed by an approximate 4_2 compressor, the rest upper bits are compressed into two rows by an accurate adder, and only the compression results of the accurate adders of all Wallace trees are output to a two-stage partial product compression circuit.

The two-stage partial product compression circuit comprises 1 Wallace tree, and uses precise adder to compress the input and truncated and approximate constant compensation part into two lines, and takes

The bits are output to a carry look ahead adder circuit.

The carry look ahead adder circuit adds the results of the two-stage partial product compression circuit and retains

To produce the result of the final multiply-accumulate unit.

(I) -radix-8 Booth encoder and decoder circuits

The radix-8 booth encoder and decoder circuitry includes a radix-8 booth encoder, an approximate decoding adder, and a legacy decoder.

The radix-8 booth encoder has five output signals, the expression:

；

；

；

；

。

decoder by generating

The approximate decoding adder and the traditional decoder only generate partial products which are accurately compressed and approximately compressed in the one-stage partial product compression circuit, and the expression is as follows:

。

approximate decoding adder pair low

The bits are accumulated approximately by a group of two bits, and the formula is as follows:

，

，

(ii) a At the same time to

And

the low-order additional error recovery circuit has the formula:

，

，

the high order is accumulated using a ripple carry adder.

One-stage partial product compression circuit

The approximate 4_2 compressor formula used in the first stage partial product compression circuit is:

，

(ii) a The precise adder comprises a precise full adder and a precise half adder; the sign compensation bit of each row partial product is not included in the Wallace tree, and the error is reduced by a constant compensation method.

(III) two-stage partial product compression circuit

Constant probability compensation in a two-level partial product compression circuit includes three parts, namely compensation for truncation of the first-level partial product, compensation for errors generated by using an approximate 4_2 compressor, and compensation for truncation of the second-level partial product.

(1) A first part: and (5) compensating for the truncation of the first-order partial product.

Assuming that the inputs are uniformly distributed, i.e. (here the subscripts are omitted, since the probabilities of all inputs are the same):

；

the probability of each bit after the truncation compensation is (for convenience all inputs are divided by

)：

；

One group of operands is subjected to radix-8 Booth encoding, and 0 is required to be complemented at the lowest bit according to an encoding rule. Since the lower two bits of the lowest bit encoding input are always (1,0), the probability of the encoding result is different from that of other bits and needs to be considered separately. The encoding result probabilities are as follows (positive and negative probabilities are the same):

；

；

；

to obtain the probability of partial product, calculation is also needed

The probability of each bit, based on the properties of the approximate decoding adder, can be:

；

the expectation of the partial product can be calculated from the above equation as follows:

wherein

Is weighted as

Of the compression tree

Line, first

Column partial product; wherein the expectation of the sign correction bit is always 0.5, i.e.

。

(2) A second part: compensation for using an approximate 4_2 compressor.

Use of

Different modes of representing input, have

(ii) a Use of

Indicating an error of

(ii) a Use of

To represent

Probability of occurrence, then error is expected to be

. The specific compensation value is the desired sum of the individual errors.

(3) And a third part: compensation for second order partial product truncation.

The error of the part is from the two-stage partial product compression circuit to cut off the output result of the approximate 4_2 compressor, and the part is expected to be constant compensated according to the output value:

，

。

(IV) carry look ahead adder

The carry look ahead adder inputs are

The following are

，

For example, the multiply-accumulate apparatus according to the embodiment of the present invention will be further described with reference to the accompanying drawings.

The

，

The low-power consumption approximate parallel fixed-width multiplication accumulation unit structurally comprises an input truncation compensation circuit, a radix-8 Booth encoder and decoder circuit, a primary partial product compression circuit, a secondary partial product compression circuit and a carry look-ahead adder circuit.

The input truncation compensation circuit truncates the lower 5 bits of the two groups of input with the length of 16 of the parallel multiplication accumulation unit, supplements 1 on the 4 th bit, and finally outputs the 12-bit result to the radix-8 Booth encoder and decoder circuit. Only one group of inputs needs to be subjected to base-8 coding, the inputs are divided into three bits and one group, the other group of inputs are simultaneously sent to an approximate decoding adder to calculate 3x, then the three groups of inputs are sent to a traditional decoder, only a part compressed by a one-stage partial product compression circuit is generated, and the result is sent to a one-stage Wallace compression tree with different corresponding weights according to coding weights, as shown in figure 1. In FIG. 1, the first-level Wallace tree-0 represents the lowest Wallace tree in the first-level partial product compression circuit, the first-level Wallace tree-1 represents the second-lowest Wallace tree in the first-level partial product compression circuit, and so on. The first stage partial product compression circuit has 4 Wallace trees, each Wallace tree is a rectangle of 14 × 8, as shown in FIGS. 2 a-2 d. And shifting the result of the first-stage partial product compression circuit according to the weight, inputting the result into the second-stage partial product compression circuit, compressing the result into two lines by using a precise adder, finally taking the lower 16 bits to send into a carry-look-ahead adder, and taking the lower 16 bits as the final fixed-width output.

The radix-8 booth encoder has five output signals, the expression is:

；

；

；

；

；

decoder by generating

。

the approximate decoding adder adopts a two-bit group approximate accumulation for the lower 4 bits, and the formula is as follows:

，

，

(ii) a Meanwhile, an error recovery circuit is added to the lowest 3 bits and the lowest 4 bits, and the formula is as follows:

，

，

(ii) a The high order is accumulated using a ripple carry adder.

As shown in fig. 2 a-2 d, the sign expansion process is performed on all four wallace trees of the first-stage partial-product compression circuit, and finally two output signs are controlled to be positive, negative. The lowest order compression tree truncates the lower 8 bits, the next lowest order compression tree truncates the lower 5 bits, the next highest order compression tree truncates the lower 2 bits, the highest order compression tree does not truncate, except the highest order compression tree, the next highest 2 bits of all compression trees use an approximate 4_2 compressor, the lowest 1 bit of the highest order compression tree uses an approximate 4_2 compressor, the rest high bits are compressed into two rows by an accurate adder, and only the compression results of the accurate adders of all compression trees are input to a second-level partial product compression circuit.

，

(ii) a The precise adder includes a precise full adder and a precise half adder. The sign compensation bit of each row partial product is not added into the compression tree, and the error is reduced by a constant compensation method.

As a further optimization scheme of this embodiment, the two-stage partial product compression circuit performs unified processing on the sign bits under the condition that the input sign is determined: on the premise that only the numerical value bit is reserved, the high order of any output of the compression tree at the lowest order of the first-order partial product compressor is added with '111', and the high order of any output of the compression tree at the second lowest order and the second highest order is added with '110', as shown in fig. 3.

As a further optimization scheme of this embodiment, the two-stage partial product compressor includes a constant compensation part for truncation and approximation, and the constant compensation is derived from theoretical probability: the 2 nd and 4 th bits of the two-level partial product compression tree are complemented by 1, as shown in FIG. 3. The final 16-bit result is fed into the carry look ahead adder.

The input of the carry look ahead adder is 16 bits, the carry look ahead adder is divided into four bit groups, the traveling wave carry adder is arranged in each group, the carry look ahead adder is arranged between the groups, and the lower 16 bits are taken as the fixed width result of the final approximate multiplication accumulation unit.

Finally, compared with the original design, the improved design reduces the power delay product by 25 percent, reduces the power delay product of the full-precision copy by 80 percent and reduces the average error distance of the copy without compensation by 58 percent.

While preferred embodiments of the embodiments of this specification have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the true scope of the embodiments of the specification.

It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments described herein without departing from the spirit and scope of the embodiments described herein. Thus, if such modifications and variations of the embodiments of the present specification fall within the scope of the claims of the embodiments of the present specification and their equivalents, the embodiments of the present specification are intended to include such modifications and variations.

Claims

1. A low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device is characterized by comprising an input truncation compensation circuit, a radix-8 Booth encoder and decoder circuit, a primary partial product compression circuit, a secondary partial product compression circuit and a carry look-ahead adder circuit;

Each group of the elements is

The following processes are respectively carried out on the data: cutoff is low

In the first place

1 is added on the position, finally

The bit result is output to the radix-8 booth encoder and decoder circuitry;

the radix-8 booth encoder and decoder circuitry includes N sets of radix-8 booth encoders, approximate decoding adders, and a legacy decoder; the output of one group of input truncation compensation circuits is divided into three bits by a group for coding by the radix-8 Booth coder, and the coding result is output to a traditional decoder; the approximate decoding adder calculates the output of the other input truncation compensation circuit; the traditional decoder processes the results of the radix-8 Booth encoder and the approximate decoding adder to generate a partial product and outputs the partial product to a first-stage partial product compression circuit;

the primary partial product compression circuit comprises

A first-level Wallace tree, each first-level Wallace tree being of the size

Each first-level Wallace tree is a regular rectangle; each first-level Wallace tree is divided into three sections for approximate processing, and the weight of one section is

The first-level Wallace tree of (1) with low truncation

the two-stage partial product compression circuit comprises a two-stage Wallace tree and a probability constant compensation module, and the probabilityThe constant compensation module is used for compensating the first-stage partial product truncation, the error generated by using the approximate 4_2 compressor and the second-stage partial product truncation to obtain truncated and approximate constant compensation partial data; the second-level Wallace tree compresses the received input data and the constant compensation part data into two lines by using an accurate adder, and takes

The bit is output to a carry look ahead adder circuit;

To produce the output result of the final multiply-accumulate device.

2. The low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 1, wherein said radix-8 booth encoder includes five output signals, using

，

，

，

Respectively representing the most significant bit, the second least significant bit and the least significant bit of the input signal, and the expressions are respectively:

；

；

；

；

；

in which

To approximately decode the adder input

The number of bits is,

for approximately decoding the output of the adder

A bit.

3. The low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 1, wherein the approximate decoding adder is low for input data

The bits are approximately accumulated in a group of two bits, p is a non-negative integer determined according to the precision requirement, and the formula is as follows:

，

，

wherein

Represents the first of input y

The number of bits is,

in order to input the carry bit, the carry bit is input,

in order to output the carry bit,

to the final sum

A bit; to input data

And

the low-order additional error recovery circuit has the formula:

，

，

wherein

To the final sum

The error of the bits is recovered as a signal,

and

respectively after recovery

4. The low power consumption high precision approximate parallel fixed width multiply accumulate device of claim 3, wherein the approximate 4_2 compressor output signal used in the one stage partial product compression circuit has the formula:

，

wherein

Is the first tree of Wallace

Four inputs to a column; the precise adder comprises a precise full adder and a precise half adder; the sign compensation bit of each row partial product is not included in the Wallace tree, and the error is reduced by a constant compensation method.

5. The low-power consumption high-precision approximate parallel fixed-width multiply-accumulate device of claim 1, wherein the two-stage partial product compression circuit processes the sign bit uniformly in case of determining the input sign: on the premise that only the numerical value is reserved, the high-order addition 111 of any output of the compression tree at the lowest order of the first-order partial product compressor and the high-order addition 110 of any output of the compression trees at the second lowest order and the second highest order are carried out.

6. The low-energy-consumption high-precision approximately parallel fixed-width multiply-accumulate device of claim 1, wherein the procedure of compensating the first-stage partial product truncation comprises:

assuming that the inputs are uniformly distributed:

；

wherein

Is the mth bit of the input signal x; the probability of each bit after the truncation compensation is:

；

；

；

；

wherein

Is the result of Booth encoding

A bit; when the Booth code value is

Using the output of the approximate decoding adder

Decoding and calculating

；

wherein

Is composed of

To (1) a

A bit; the expectation of calculating the partial product is as follows:

where the index n denotes the result of the operation of the nth element of the two sets of input vectors and the index i or j denotes the ith or jth of the numberThe bit is carried in a binary system, and the bit,

is weighted as

Of the compression tree

Go to the first

Column partial product; the expectation of the sign-corrected bit is always 0.5,

。

7. the low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 4, wherein the compensation for using the approximate 4_2 compressor comprises:

use of

Representing the error between the actual output and the accurate output, the error expected by the calculation being

；

Presentation mode

The probability of the occurrence of the event is,

，

the error is represented by the number of bits in the error,

8. The low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 7, wherein the two-stage partial product truncation is expected to be constant compensated according to the output value of the approximate 4_2 compressor:

，

。

9. the low energy consumption high precision approximate parallel fixed width multiply accumulate device of claim 1, wherein the input of the carry look ahead adder is