CN113190787A

CN113190787A - FFT processor based on approximate complex multiplier

Info

Publication number: CN113190787A
Application number: CN202110452797.9A
Authority: CN
Inventors: 刘伟强; 杜锦鹤
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2021-07-30
Anticipated expiration: 2041-04-26
Also published as: CN113190787B

Abstract

The invention discloses an FFT processor based on approximate complex multiplier, which reduces the resource consumption during operation and improves the operation rate and the processor performance under the condition that the precision is kept at a certain level by approximating a Booth coding unit and a partial product compression unit in the multiplier in the FFT processor.

Description

FFT processor based on approximate complex multiplier

Technical Field

The invention belongs to the field of design of FFT (fast Fourier transform) processors, and particularly relates to an FFT processor based on an approximate complex multiplier.

Background

Conventional computer performance is simply pursuing accurate operation. This trend faces technical challenges in terms of power consumption, circuit reliability, and high performance. Approximation calculations have been proposed for energy efficient systems for emerging fault tolerant applications (e.g., speech recognition, image processing, data mining, video processing, etc.) that do not require the full accuracy sought. Digital Signal Processing (DSP) is also a fault tolerant calculation, and applying approximate calculations in DSP calculations is an efficient way to achieve low power consumption and high performance.

The Fast Fourier Transform (FFT) is a fast algorithm of the Discrete Fourier Transform (DFT). It is obtained by improving the algorithm of discrete Fourier transform according to the characteristics of odd, even, imaginary and real of the discrete Fourier transform. Since the FFT can be widely applied to various applications, the hardware structure of the FFT is also widely studied and optimized to adapt to different applications. The hardware implementation scheme of FFT mainly has two structures of reconfigurable structure and fixed structure. The variable-length FFT generally uses a reconfigurable FFT, and a common method used for FFTs of various transform lengths is a mixed-basis algorithm. The FFT of the fixed structure can also be divided into a parallel structure and a pipeline structure. The most typical of the pipeline architecture are multi-path delay switching (MDC) and single-path delay feedback (SDC). Different structures have different advantages and disadvantages, the parallel structure can process N inputs simultaneously, the delay is small, the control is easy, but more hardware resources are consumed; the pipeline architecture is simpler, but the data is processed sequentially with greater latency.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, the present invention provides an FFT processor based on an approximate complex multiplier.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

an FFT processor based on approximate complex multiplier comprises a plurality of basic units which are sequentially cascaded, each basic unit comprises a butterfly operation unit and m feedback units, each butterfly operation unit comprises a signal input end, a signal output end, m feedback input ends and m feedback output ends, m is a positive integer, each feedback output end is connected with the corresponding feedback output end through the corresponding feedback unit, the signal output end of the butterfly operation unit in the previous basic unit is connected with the signal input end of the butterfly operation unit in the next basic unit through a complex multiplier, the signal output by the butterfly operation unit in the previous basic unit is subjected to complex multiplication with twiddle factors in the complex multiplier and then is used as the input signal of the butterfly operation unit in the next basic unit, and the complex multiplier comprises a first subtractor, a second subtractor, a third subtractor, a fourth subtractor, a fifth a sixth subtractor, a sixth subtractor, a sixth subtractor, a sixth subtractor, a sixth adder, First to third adders and first to third multipliers, two input ends of the first subtractor respectively input a real part and an imaginary part of an output signal of the butterfly operation unit in the previous basic unit, two input ends of the second subtractor respectively input a real part and an imaginary part of the twiddle factor, two input ends of the first adder respectively input a real part and an imaginary part of an output signal of the butterfly operation unit in the previous basic unit, two input ends of the first multiplier respectively input an output signal of the first subtractor and an imaginary part of the twiddle factor, two input ends of the second multiplier respectively input an output signal of the second subtractor and a real part of an output signal of the butterfly operation unit in the previous basic unit, two input ends of the third multiplier respectively input an output signal of the first adder and a real part of the twiddle factor, two input ends of the second adder respectively input an output signal of the first multiplier and an output signal of the second multiplier, two input ends of the third adder respectively input the inverted signal of the output signal of the second multiplier and the output signal of the third multiplier; each multiplier comprises a Booth coding unit, a partial product compression unit and a quick summation unit, wherein the Booth coding unit is used for coding two multipliers to quickly generate partial products, the partial product compression unit is used for compressing the generated partial products to quickly obtain two rows of partial products, and the quick summation unit is used for adding the two rows of partial products by using a quick adder to generate a final product;

and performing approximate design on the Booth coding unit and the partial product compression unit, wherein the partial product expression of the Booth coding unit after the approximate design is as follows:

wherein the generated partial products are arranged as partial product arrays, pp_ijIs the partial product of the ith row and the jth column in the partial product array, a_jIs the j-th bit of the multiplier, b_2i+1Is the 2i +1 th bit of data in another multiplier,

represents an exclusive or operation;

designing an approximate 4-2 compressor for the partial volume compression unit, wherein the approximate 4-2 compressor comprises an OR gate, a first NOR gate, a third NOR gate and a first NOR gate, a second NOR gate, for 4 partial products in the same column in the partial product array, two input terminals of a first nor gate respectively input the partial products of the first row and the second row in the column, two input terminals of a second nor gate respectively input the partial products of the third row and the fourth row in the column, two input terminals of a third nor gate respectively input an output signal of the first nor gate and an output signal of the second nor gate, two input terminals of a first nor gate respectively input the partial products of the first row and the second row in the column, two input terminals of a second nor gate respectively input the partial products of the third row and the fourth row in the column, the two input ends of the OR gate respectively input the output signal of the first AND gate and the output signal of the second OR gate.

Furthermore, after the Booth coding unit generates a partial product array, the symbol compensation bit of the last row is directly deleted.

Further, a non-precision factor n is set, only the lower n least significant bits of the multiplier are approximated, and n is a positive integer.

Adopt the beneficial effect that above-mentioned technical scheme brought:

the invention approximates the Booth coding unit and the partial product compression unit in the multiplier in the FFT processor, reduces the resource consumption during operation and improves the operation rate and the processor performance under the condition that the precision is kept at a certain level.

Drawings

FIG. 1 is a view of the structure of an N-point R4 SDF;

FIG. 2 is a diagram of one stage operation of Radix4 pipeline type FFT;

FIG. 3 is a schematic diagram of a butterfly unit;

FIG. 4 is a diagram of a logic gate structure for Booth encoding with approximate optimization in the present invention;

FIG. 5 is a partial product array dot diagram according to the present invention;

FIG. 6 is a block diagram of the logic gates of the near optimized 4-2 compressor of the present invention;

fig. 7 is a 16-bit multiplier partial product array point diagram with non-precision factors of 8 and 16 in accordance with the present invention.

Detailed Description

The technical scheme of the invention is explained in detail in the following with the accompanying drawings.

Fig. 1 is an N-point R4SDF structure, which uses a shift register to delay data, and the storage utilization rate is improved by arranging the transmission path appropriately. The difference from the MDC structure is that the number of data paths in the MDC structure is directly related to the selected Radix alpha algorithm, the whole resource utilization rate is reduced rapidly along with the increase of alpha, most time is wasted in storage reading, and only one data path is needed among each level of the SDF structure no matter the algorithms such as Radix2, Radix4 and Radix8 are selected. Although the number of stages and the number of butterfly units required by the SDF structure are the same as those of the MDC structure, the resource utilization rate of each module is greatly improved.

The butterfly operation of the Radix4 algorithm can be shown in the left half of fig. 2, and the Radix4 butterfly unit completes the operation amount of Radix2 at two stages, wherein the required operation steps are also simplified to a certain extent. The first stage operation of Radix4 pipeline FFT can be represented as fig. 2, where the butterfly module has a small part of resources for controlling the data storage and butterfly operation. In the pipeline type FFT processor, the data input is a time continuous sequence, and there are different distances between butterfly input data according to the difference of "stages", so that it is necessary to perform delay operation processing on the input sequence and extract the stored data at an appropriate timing when necessary. Taking a butterfly operation unit of Radix4 as an example, the module interface design is shown in fig. 3, where Bank1_ in, Bank2_ in and Bank3_ in are input sequences of previous time read from three memory cells respectively, Data _ in is valid input of the current stage, validity of Data input is controlled by Data _ in _ valid, Clk and Rst _ n are global clock and reset signals, Dout _ re and Dout _ im are real and imaginary parts of output, and Dout _ valid controls validity of Data output. The butterfly operation unit only needs to operate butterfly calculation based on complex addition and subtraction operation and real part and imaginary part exchange, and the data storage read-write control part is completed by the aid of the control unit in the stage where the butterfly operation unit is located.

The complex multiplier part is shown in the right half of fig. 2. The design of the invention adopting the non-precise basis-4 Booth multiplier mainly comprises three parts: a radix-4 Booth coding unit, a partial product compression unit and a fast summation unit. The Booth coding is used for coding the multiplier and the multiplicand, quickly generating partial products and reducing the number of rows and the number of the partial products. The partial product compression unit compresses the partial product to quickly obtain the final two-row partial product, and effectively shortens the key path of the multiplier. The fast summation is the addition of the final two-row partial products with a fast adder to produce the final product. Booth encoding units and compression units in the three modules are the units used most in the operation process of the multiplier. Taking a 16-bit multiplier as an example, 144 Booth encoders are needed to generate 144 Booth encoders to generate 144 regular partial products (dividing sign extension bits and sign compensation bits), then two rows of partial products are generated through 80 4-2 compressors and 32 carry-save adders, and finally a fast adder is adopted to generate a final product. Therefore, the Booth coding module is subjected to non-precise optimization design, and the performance of the multiplier can be improved to the greatest extent.

The non-exact base-4 Booth code designed by the present invention is shown in Table 1 below:

TABLE 1

The partial product expression is as follows:

the structure of the logic gate is shown in fig. 4.

The exact radix-4 Booth encoding produces a partial product array with a sign offset bit in the last row, which makes the partial product array irregular. To ensure that an accurate final product is obtained, the design of an accurate multiplier requires one stage of compression for this row alone, so that the multiplier design requires more compressors and longer critical paths. In order to design a more regular partial product array, the non-exact radix-4 Booth multiplier design in the design can directly discard the sign offset bits of the last row. An 8-bit multiplier partial product array point diagram is taken as an example, as shown in fig. 5. FIG. 5 (a) shows the irregular partial product array distribution of an exact 8-bit multiplier after radix-4 Booth encoding, with solid black boxes representing the compressor; ● denotes the conventional partial product; o represents a symbol extension bit; circa represents a symbol compensation bit; Δ represents the sign offset bit of row 5. Fig. 5 (b) shows a regular partial product array of the non-exact 8-bit multiplier after radix-4 Booth encoding, which can generate two rows of partial products to be added finally after one-stage compression.

The non-exact regular partial product array reduces one-stage compression of the non-exact multiplier at the expense of dropping one sign offset bit to yield an error probability of 37.5%. Meanwhile, in the partial product array of the radix-4 Booth multiplier, the position generated by the sign compensation bit belongs to the lower weight bit, and the error distance is within the error tolerance range.

The logic gate circuit of the approximate 4-2 compressor designed by the invention is shown in FIG. 6 and comprises three NOR gates, two same OR gates and one OR gate, and the logic expression of the logic gate circuit is as follows:

the present invention may employ multipliers of different degrees of approximation. The inexact factor is set to 8, that is, the last 8 bits of the multiplier operation are approximate operation, and the rest are exact operation, the operation point diagram is shown as (1) in fig. 7, only the last 9 bits of the last result are approximate, when the intermediate bit width of the FFT is 16 bits, the result after the first 16 bits is truncated, so that the 9-bit inexact bits have little influence on the last result. The inexact factor is set to be 16, that is, the last 17 bits of the multiplier operation are approximate operation, and the rest are exact operation, the operation point diagram is shown as (2) in fig. 7, only the last 17 bits of the final result are approximate, when the intermediate bit width of the FFT is 16 bits, the result after the first 16 bits is truncated, so that the influence of the 17-bit inexact bits on the final result is very small.

The parameter results of this example were compared with those of the prior art, and the comparison results are shown in table 2 below:

TABLE 2

In the table, FFT-ex is the prior art, and FFT-1 is replaced by an approximate Booth multiplier with a non-precision factor of 8; FFT-1 is replaced by an approximate Booth multiplier of non-exact factor 16.

As can be seen from the table, in the case that the SNR reduction does not exceed 1dB, the present invention reduces 5.9% of LUT in 64-point FFT-1, 12.7% of LUT in 64-point FFT-2, 11.6% of LUT in 256-point FFT-1, 20.3% of LUT in 256-point FFT-2, 7.8% of LUT in 1024-point FFT-1, and 16.5% of LUT in 256-point FFT-2, compared with the prior art. The design can also obviously reduce the use amount of FF in FFT and can improve the frequency to the original 141.8% -146.1%.

The embodiments are only for illustrating the technical idea of the present invention, and the technical idea of the present invention is not limited thereto, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the scope of the present invention.

Claims

1. An FFT processor based on approximate complex multiplier comprises a plurality of basic units which are sequentially cascaded, each basic unit comprises a butterfly operation unit and m feedback units, each butterfly operation unit comprises a signal input end, a signal output end, m feedback input ends and m feedback output ends, m is a positive integer, each feedback output end is connected with the corresponding feedback output end through the corresponding feedback unit, the signal output end of the butterfly operation unit in the previous basic unit is connected with the signal input end of the butterfly operation unit in the next basic unit through a complex multiplier, the signal output by the butterfly operation unit in the previous basic unit is subjected to complex multiplication with twiddle factors in the complex multiplier and then is used as the input signal of the butterfly operation unit in the next basic unit, and the complex multiplier comprises a first subtractor, a second subtractor, a third subtractor, a fourth subtractor, a fifth a sixth subtractor, a sixth subtractor, a sixth subtractor, a sixth subtractor, a sixth adder, First to third adders and first to third multipliers, two input ends of the first subtractor respectively input a real part and an imaginary part of an output signal of the butterfly operation unit in the previous basic unit, two input ends of the second subtractor respectively input a real part and an imaginary part of the twiddle factor, two input ends of the first adder respectively input a real part and an imaginary part of an output signal of the butterfly operation unit in the previous basic unit, two input ends of the first multiplier respectively input an output signal of the first subtractor and an imaginary part of the twiddle factor, two input ends of the second multiplier respectively input an output signal of the second subtractor and a real part of an output signal of the butterfly operation unit in the previous basic unit, two input ends of the third multiplier respectively input an output signal of the first adder and a real part of the twiddle factor, two input ends of the second adder respectively input an output signal of the first multiplier and an output signal of the second multiplier, two input ends of the third adder respectively input the inverted signal of the output signal of the second multiplier and the output signal of the third multiplier; each multiplier comprises a Booth coding unit, a partial product compression unit and a quick summation unit, wherein the Booth coding unit is used for coding two multipliers to quickly generate partial products, the partial product compression unit is used for compressing the generated partial products to quickly obtain two rows of partial products, and the quick summation unit is used for adding the two rows of partial products by using a quick adder to generate a final product;

the method is characterized in that: and performing approximate design on the Booth coding unit and the partial product compression unit, wherein the partial product expression of the Booth coding unit after the approximate design is as follows:

wherein the generated partial products are arranged as partial product arrays, pp_ijIs the partial product of the ith row and the jth column in the partial product array, a_jIs the j-th bit of the multiplier, b_2i+1Is another multiplierThe 2i +1 th bit of data,

represents an exclusive or operation;

2. The approximate complex multiplier based FFT processor of claim 1, wherein: and after the Booth coding unit generates a partial product array, directly deleting the symbol compensation bit of the last row.

3. The approximate complex multiplier based FFT processor of claim 1, wherein: and setting a non-precision factor n, and only approximating the lower n least significant bits of the multiplier, wherein n is a positive integer.