CN112328958A - Optimized data rearrangement method based on base-64 two-dimensional FFT architecture - Google Patents

Optimized data rearrangement method based on base-64 two-dimensional FFT architecture Download PDF

Info

Publication number
CN112328958A
CN112328958A CN202011245309.9A CN202011245309A CN112328958A CN 112328958 A CN112328958 A CN 112328958A CN 202011245309 A CN202011245309 A CN 202011245309A CN 112328958 A CN112328958 A CN 112328958A
Authority
CN
China
Prior art keywords
base
architecture
block
output
fft
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011245309.9A
Other languages
Chinese (zh)
Other versions
CN112328958B (en
Inventor
曹宁
吴子诚
冯晔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN202011245309.9A priority Critical patent/CN112328958B/en
Publication of CN112328958A publication Critical patent/CN112328958A/en
Application granted granted Critical
Publication of CN112328958B publication Critical patent/CN112328958B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm

Landscapes

  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Discrete Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an optimized data rearrangement method based on a base-64 two-dimensional FFT architecture, belongs to the technical field of signal processing, and provides a new two-dimensional FFT architecture by utilizing an effective data rearrangement technology and using a base-64 algorithm. The architecture exploits a 64 x 64 two-dimensional FFT architecture with two parallel basis-64 block cascades. In the radix-64 structure, data rearrangement is performed using the six-bit mode selection signal as a control signal. The base-64 structure provided by the invention obviously reduces the intermediate memory in the one-dimensional FFT and reduces the delay; and; the proposed two-dimensional FFT architecture shifts the number of intermediate memories between two one-dimensional FFTs from N2Reducing to N; the method has higher flexibility, can be applied to a plurality of occasions, and is particularly suitable for data reconstruction of the original image.

Description

Optimized data rearrangement method based on base-64 two-dimensional FFT architecture
Technical Field
The invention belongs to the technical field of signal processing, and particularly relates to an optimized data rearrangement method based on a base-64 two-dimensional FFT architecture.
Background
Over the past few decades, research and applications in the field of signal processing have seen explosive growth. Digital Signal Processing (DSP) has wide applications in the fields of biomedical imaging, multimedia, digital television, broadcasting, and the like. Due to the development of Very Large Scale Integration (VLSI) technology, the implementation of these applications is possible. Developing hardware solutions for these applications has been an active area of research over the last two decades.
Discrete Fourier Transform (DFT) is an important component of DSP and communication systems. Fast Fourier Transform (FFT) is the most common fast method of computing the discrete fourier transform. Two-dimensional FFTs are widely applied to data reconstruction of original images and need to meet efficient implementation of real-time scenes. Image processing applications require large memory to support real-time processing of image data. Therefore, a suitable architecture is needed to optimize memory and support images of various sizes while providing the required throughput.
Cooley-Tukey is a common algorithm for computing FFT because it scales the complexity from O (N) compared to DFT2) Reduction to O (Nlog)2N). For x (n) of one-dimensional DFT, the n-point sequence can be calculated by equation (1):
Figure BDA0002769813330000011
in the formula WnIs a twiddle factor expressed by formula (2):
Wn=e-2πi/N (2)
an N-point Inverse DFT (IDFT) can be calculated as equation (3):
Figure BDA0002769813330000012
in addition to the logic applied in DFT, IDFT requires some other logic, such as division and conjugation operations. The two-dimensional FFT can be calculated from the one-dimensional FFT. An N × N two-dimensional FFT can be calculated by 2N one-dimensional FFTs, where N is the sequence length. One size of NxN with an input of x (i)1,i2) Is calculated as formula (4):
Figure BDA0002769813330000013
wherein k is1,k2=0,1,2,...,N-1
With two one-dimensional DFTs, a two-dimensional DFT can be performed based on a row-column decomposition algorithm as shown in the following equation:
Figure BDA0002769813330000021
wherein k is1=0,1,2,...,N-1
Figure BDA0002769813330000022
Wherein k is2=0,1,2,...,N-1
The decomposition size of the FFT is represented by a radix.
Existing FFT implementations have a variety of hardware and software solutions. Hardware implementation provides better performance and is more suitable for real-time embedded applications. Software solutions, such as general purpose processors and graphics processing units, are power hungry and are not suitable for real-time applications.
Pipeline architectures typically provide more area overhead and increase power consumption for FFT algorithms and architectures. The pipeline-based architecture based on the radix-2 linear decomposition is a single-path delay feedback (SDF) or multiple-path delay conversion (MDC) architecture. For large-sized FFTs, the memory-based architecture occupies less memory area and consumes less power than pipelined FFTs. Output reordering is a major functional block in the design of FFT architectures. The purpose of the reordering is to convert the non-natural order FFT output to natural order.
At present, the research aiming at the FFT output rearrangement technology and the complexity thereof is few, and no mature technology appears. In most existing architectures, reordering requires dedicated hardware or a large amount of memory. And requires a greater number of clock cycles to execute.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide an optimized data rearrangement method based on a base-64 two-dimensional FFT architecture, which reduces the number of operation execution cycles.
The technical scheme is as follows: in order to achieve the purpose, the invention provides the following technical scheme:
a data rearrangement optimizing method based on a base-64 two-dimensional FFT architecture comprises the following steps:
(1) designing a parallel pipeline architecture realized on an ASIC (application specific integrated circuit) and an FPGA (field programmable gate array) by utilizing the high regularity of an FFT (fast Fourier transform algorithm);
(2) with the parallel pipeline architecture, data reordering is performed using a six-bit mode select signal as a control signal.
Further, in step (1), the parallel pipeline architecture develops a 64 × 64 two-dimensional FFT architecture by cascading two parallel spread radix-64 blocks, and the 64 × 64FFT architecture is represented by using a novel radix-64 algorithm based on a radix-4 butterfly unit.
Further, in step (2), in the 64 × 64 two-dimensional FFT architecture, data rearrangement is performed using a six-bit mode selection signal as a control signal.
Further, the two-dimensional FFT is obtained by two one-dimensional N-point FFT calculation; an N x N two-dimensional FFT is a one-dimensional FFT in N row directions and a one-dimensional FFT in N column directions, and then N is generated between the two one-dimensional FFTs2The intermediate values are stored.
Further, the method for representing the 64 × 64FFT architecture by using the novel radix-64 algorithm based on the radix-4 butterfly unit specifically includes the following steps: a fully expanded radix-64 architecture uses parallel radix-4 butterfly units as basic sub-blocks; the radix-4 butterfly unit has four parallel inputs, and the output based on two-bit control input is called mode selection; the mode select signal determines the generation of one of the four outputs; generating outputs in an arbitrary order according to the mode selection signal; the equation for the radix-4 butterfly unit used is as follows:
Figure BDA0002769813330000031
wherein, X is a time domain signal sequence, and X is a frequency domain signal sequence; the first stage has 16 twiddle factor Read Only Memories (ROMs) to store W16; each rom contains four twiddle factor values; the second stage includes four base 4 engines and four ROMs for storing W64; each of the ROMs of the second stage consists of 16 twiddle factor values; mode selection is a 6-bit control signal; two bits of the data are distributed to each stage; generating one of 64 outputs according to the pattern of each stage; thus, the remaining outputs of the base-64 block are obtained in the architecture in a reordered form, saving memory and logical unit resources for reordering; initially, all mode selections for the base 4 engine are configured as mode 0; the output produced by the first stage is multiplied by the corresponding twiddle factor; the second stage performs similar operations; here, four base 4 engines are required to process the 16 outputs obtained from the first stage; likewise, the first four outputs are generated by configuring the mode selection of all four base 4 engines as mode 0, keeping the base 4 engines of the first stage themselves in mode 0; now, using these four outputs, the output required for the third stage is generated, and another mode selection is used in the final stage.
Further, the data scheduling in the 64 × 64FFT architecture includes the following steps:
consider input data of 64 x 64 size stored in RAM given by matrix a, where ai,jIs an element in row i and column j; the data scheduling and reordering process in the first base-64 block is: starting with the mode select signal set to 0, the input is A in cycle 11,1,A2,1,A3,1,...,A64,1(ii) a Performing an FFT of these 64 inputs produces the first output, denoted B, of the first row in the B matrix in the first cycle1,1
In cycle 2, input A1,2,A2,2,A3,2,...,A64,2Given, the hold mode select signal is 0; the output generated is B1,2(ii) a In a similar manner, in loop 64, input A1,64,A2,64,A3,64,...,A64,64Using mode select 0, output B is generated1,64(ii) a Calculating all elements in the first row; in the 65 th cycle, all elements in the first row of the B matrix, namely B1,1,B1,2,...,B1,64Inputting a second radix-64 block in parallel; calculating one line of the first base-64 block, and executing the next stage calculation by the second base-64 block; an intermediate register between the two base-64 blocks stores the output of the first block for use by the second block; in the second base-64 block, the mode select signal changes from 0 to 63 every cycle, producing 64 outputs, C1,1,C1,2,...,C1,64
The sequentially increasing mode select signal is inactive because it requires the output of the first base-64 block to be reordered before it is applied to the second block; the mode selection of the first base-64 block must be given in the order of 0, 16, 32, 48, 1, 17, 33, 49,. and 63 for the first block output;
in the 65 th cycle, the mode selection for the first base-64 becomes 16, and input A is provided1,1,A2,1,A3,1,...,A64,1(ii) a Performing the 64 input FFT, the first output B of the second row of the B matrix is calculated in the 65 th cycle2,1(ii) a In the 66 th cycle, the second output of the second row of the B matrix, B, is calculated2,2And so on; in the 128 th cycle, B is calculated2,64(ii) a Thus, all elements in the second row are computed;
in the 129 th cycle, the second block starts the computation using the second row of the B matrix, setting the mode selection to 0, 1, 2, 3.. 63 in 64 cycles, resulting in the second row elements of the C matrix; by changing the mode selection to 63 and repeating this process to obtain all the elements in the B matrix of size 64 x 64, which is the output of the first base-64 block; generating a corresponding B matrix output in each cycle; this completes the columnar calculation of the two-dimensional FFT.
Further, in the output of the two-dimensional FFT, elements in a C matrix give a final two-dimensional FFT output; generating an output from the second block each clock cycle; therefore, 4096 cycles are required to produce all 4096 outputs of the two-dimensional FFT.
Further, the two-dimensional FFT architecture is realized by parallel expansion on a base-64 block, and the output sequence of the two-dimensional FFT architecture is controlled by a plurality of control bits; performing a one-dimensional FFT with the first radix-64 block, the output of which is fed to the second radix-64 block to perform a row-by-row FFT to obtain a two-dimensional FFT; the first processor performs a 64-point FFT operation, giving a 4K median value; the second FFT processor performs 64-point FFT operations on these outputs and gives the final 64 × 64-point FFT; the two base-64 blocks are identical.
Has the advantages that: compared with the prior art, the optimized data rearrangement method based on the radix-64 two-dimensional FFT architecture provided by the invention has the advantages that the architecture has an efficient output rearrangement technology, and a parallel radix-4 butterfly unit is used; using a 6-bit control signal; the operation memory of the one-dimensional FFT is reduced, and the number of the intermediate memory units of the two-dimensional FFT is optimized from N2Reducing to N; and an ASIC and FPGA implementation architecture is proposed; the number of operation execution cycles is reduced.
Drawings
FIG. 1 is a parallel unfolding structure of the base-64 block;
FIG. 2 is a data schedule for a first base-64 block;
FIG. 3 is a data schedule for a second base-64 block;
FIG. 4 is a proposed two-dimensional FFT architecture using radix-64 blocks;
FIG. 5 is a comparison of the time consumption of the proposed architecture with that of the existing architecture;
FIG. 6 is a comparison of the hardware complexity of the proposed radix-64 line-parallel architecture with the existing radix-2 line architecture.
Detailed Description
The invention will be further described with reference to the following drawings and specific embodiments.
A data rearrangement optimizing method based on a base-64 two-dimensional FFT architecture is suitable for realizing an Application Specific Integrated Circuit (ASIC) and a Field Programmable Gate Array (FPGA) -based two-dimensional FFT. By utilizing the high regularity of the FFT algorithm, a parallel pipeline architecture which can be realized on an ASIC and an FPGA is designed. The two-dimensional FFT is calculated by two one-dimensional N-point FFTs. The performance of the one-dimensional FFT directly affects the performance of the two-dimensional FFT. An N two-dimensional FFT can be viewed as N one-dimensional FFTs in the row direction, and N columnsA directional one-dimensional FFT, then N is generated between the two one-dimensional FFTs2The intermediate values are stored.
In the present invention, the proposed 64 x 64FFT architecture is represented using a novel radix-64 algorithm based on a radix-4 butterfly unit. Two radix-64 blocks are cascaded to calculate a 64 x 64 complex-point FFT, and the generated output is reordered when the radix-64 blocks are realized, thereby saving intermediate memory and reducing delay. The method of (1) is to implement a two-dimensional FFT using a radix-64 parallel unfolding architecture.
Proposed base-64 architecture: a fully expanded radix-64 architecture uses parallel radix-4 butterfly units, which is the basic sub-block of the proposed architecture. The base 4 unit has four parallel inputs, each output based on a two-bit control input is called a mode select. The mode select signal determines the generation of one of the four outputs. The outputs may be generated in any order according to the mode select signal. However, in a conventional radix-4 butterfly unit, the outputs are generated in a particular order. The equation for the radix-4 butterfly unit used is as follows:
Figure BDA0002769813330000061
wherein, X is a time domain signal sequence, and X is a frequency domain signal sequence; the first stage has 16 twiddle factor Read Only Memories (ROMs) to store W16. Each rom contains four twiddle factor values. The second stage includes four base 4 engines and four ROMs for storing W64. Each rom of the second stage consists of 16 twiddle factor values. The mode selection is a 6 bit control signal. Two of which are assigned to each phase. One of 64 outputs is generated according to the pattern of each stage. Thus, the remaining outputs of the base-64 block are obtained in the form of a reorder in this architecture, which may save memory and logic cell resources for reordering. Initially, all mode selections for the base 4 engine are configured as mode 0. The output produced by the first stage is multiplied by the corresponding twiddle factor. The second phase performs similar operations. Here, four base 4 engines are required to process the 16 outputs obtained from the first stage. Likewise, the first four outputs are generated by configuring the mode selection of all four base 4 engines to mode 0, keeping the base 4 engine itself in mode 0 for the first stage. Now, using these four outputs, the output required for the third stage can be generated and another mode selection used in the final stage.
Data scheduling in the proposed structure: consider input data of 64 x 64 size stored in RAM given by matrix a, where ai,jIs an element in the ith row and jth column.
The data scheduling and reordering process in the first base-64 block is: starting with the mode select signal set to 0, the input is A in cycle 11,1,A2,1,A3,1,...,A64,1. Performing an FFT of these 64 inputs produces the first output, denoted B, of the first row in the B matrix in the first cycle1,1. In cycle 2, input A1,2,A2,2,A3,2,...,A64,2Given, the hold mode select signal is 0. The output generated is B1,2. In a similar manner, in loop 64, input A1,64,A2,64,A3,64,...,A64,64Using mode select 0, output B is generated1,64. Thus, all elements in the first row are calculated. In the 65 th cycle, all elements in the first row of the B matrix, namely B1,1,B1,2,...,B1,64And a second base-64 block is input in parallel. Here, the second base-64 block starts to perform the next stage of calculation as soon as one line of the first base-64 block is calculated. An intermediate register between the two base-64 blocks stores the output of the first block for use by the second block. In the second base-64 block, the mode select signal changes from 0 to 63 every cycle, producing 64 outputs, C1,1,C1,2,...,C1,64
The sequentially increasing mode select signal is inactive because it requires the output of the first base-64 block to be reordered before it is applied to the second block. The mode selection of the first base-64 block must be given in the order of 0, 16, 32, 48, 1, 17, 33, 49,.. and 63 for the first block output.
In the 65 th cycle, the mode selection for the first base-64 becomes 16, and input A is provided1,1,A2,1,A3,1,...,A64,1. Performing the 64 input FFT, the first output B of the second row of the B matrix is calculated in the 65 th cycle2,1. In the 66 th cycle, the second output of the second row of the B matrix, B, is calculated2,2And so on. In the 128 th cycle, B is calculated2,64. Thus, all elements in the second row are calculated.
In the 129 th cycle, the second block starts the computation using the second row of the B matrix, setting the mode selection to 0, 1, 2, 3,. 63 in 64 cycles, resulting in the second row elements of the C matrix. By changing the mode selection to 63 and repeating this process, all elements in the B matrix of size 64 x 64, which is the output of the first base-64 block, are obtained. A corresponding B matrix output is generated in each cycle. This completes the columnar calculation of the two-dimensional FFT.
The elements in the C matrix give the final two-dimensional FFT output. One output is generated from the second block every clock cycle. Therefore, 4096 cycles are required to produce all 4096 outputs of the two-dimensional FFT.
Two-dimensional FFT architecture: a parallel unrolling implementation is used for the base-64 block. The order of the outputs is controlled by several control bits. In this architecture, for a given set of inputs, at each stage, only the outputs required for the next stage are calculated without any loss of performance, so most intermediate buffers can be avoided. Therefore, there is a great optimization in terms of memory and latency.
A one-dimensional FFT is performed with the first radix-64 block, the output of which is fed to the second radix-64 block to perform a row-by-row FFT to obtain a two-dimensional FFT. The first processor performs a 64-point FFT operation, giving a 4K median value. The second FFT processor performs a 64-point FFT operation on these outputs and gives a final 64 x 64-point FFT. The two base-64 blocks are identical.
Examples
As shown in fig. 1, is a base-64 block parallel expansion architecture. The first stage has 16 basic 4 units, the second stage has 4 basic 4 units, and the third stage has 1 basic 4 unit. In this architecture, all base 4 blocks are the same. The symbols R40, R44, R48, R412 represent the 0 th, 4 th, 8 th, 1 st two radix-4 butterfly, and so on. W16 and W64 represent twiddle factors for the first and second phases. The first stage has 16 twiddle factors and Read Only Memories (ROMs) are used to store W16. Each ROM contains four twiddle factor values. The second stage includes four base 4 cells and four ROMs for storing W64. Each rom of the second stage consists of 16 twiddle factor values.
Of the N multipliers in each stage, the first N/4 multipliers in each stage have the same twiddle factor. So at execution, these multipliers are removed. Thus, the first stage has 1 two multipliers instead of 16, and the second stage has 3 multipliers instead of 4.
The mode selection is a 6 bit control signal. Two of which are assigned to each phase. One of 64 outputs is generated according to the pattern of each stage. Thus, obtaining the remaining outputs of the base-64 block in a reordered form in this architecture may save memory and logic device resources for reordering. Initially, all mode selections for the base 4 unit are configured as mode 0. The output produced by the first stage is multiplied by the corresponding twiddle factor. The second phase performs similar operations. Here, four base 4 units are required to process the 16 outputs obtained from the first stage. Likewise, the first four outputs are generated by configuring the mode selection of all four base 4 cells as mode 0, keeping the base 4 cells of the first stage themselves in mode 0. Now, using these four outputs, the output required for the third stage can be generated and another mode selection used in the final stage.
Fig. 2 and 3 show data scheduling for base-64 block, where a is mode 0 output, b is mode 16 output, and c is mode 63 output. The data scheduling and reordering process in the first base-64 block is: starting with the mode select signal set to 0, the input is A in cycle 11,1,A2,1,A3,1,...,A64,1. This is performed 64FFT of the inputs, in a first cycle, produces a first output, denoted B, in a first row of the B matrix1,1. In cycle 2, input A1,2,A2,2,A3,2,...,A64,2Given that the hold mode select signal is 0, the resulting output is B1,2. In a similar manner, in loop 64, input A1,64,A2,64,A3,64,...,A64,64Using mode select 0, output B is generated1,64. Thus, all elements in the first row are calculated. In the 65 th cycle, all elements in the first row of the B matrix, namely B1,1,B1,2,...,B1,64And a second base-64 block is input in parallel. Here, the second base-64 block starts to perform the next stage of calculation as soon as one line of the first base-64 block is calculated. An intermediate register between the two base-64 blocks stores the output of the first block for use by the second block. In the second base-64 block, the mode select signal changes from 0 to 63 every cycle, producing 64 outputs, C1,1,C1,2,...,C1,64
The sequentially increasing mode select signal is inactive because it requires the output of the first base-64 block to be reordered before it is applied to the second block. The mode selection of the first base-64 block must be given in the order of 0, 16, 32, 48, 1, 17, 33, 49,.. and 63 for the first block output.
In the 65 th cycle, the mode selection for the first base-64 becomes 16, and input A is provided1,1,A2,1,A3,1,...,A64,1. Performing the 64 input FFT, the first output B of the second row of the B matrix is calculated in the 65 th cycle2,1. In the 66 th cycle, the second output of the second row of the B matrix, B, is calculated2,2And so on. In the 128 th cycle, B is calculated2,64. Thus, all elements in the second row are calculated.
In the 129 th cycle, the second block starts the computation using the second row of the B matrix, setting the mode selection to 0, 1, 2, 3,. 63 in 64 cycles, resulting in the second row elements of the C matrix. By changing the mode selection to 63 and repeating this process, all elements in the B matrix of size 64 x 64, which is the output of the first base-64 block, are obtained. A corresponding B matrix output is generated in each cycle. This completes the columnar calculation of the two-dimensional FFT.
The elements in the C matrix give the final two-dimensional FFT output. One output is generated from the second block every clock cycle. Therefore, 4096 cycles are required to produce all 4096 outputs of the two-dimensional FFT.
Fig. 4 shows the proposed two-dimensional FFT architecture using radix-64 blocks. The order of the outputs is controlled by several control bits. In this architecture, for a given set of inputs, at each stage, only the outputs required for the next stage are calculated without any loss of performance, so most intermediate buffers can be avoided, with great optimization in terms of memory and latency, as shown by comparison of fig. 5 and 6.
A one-dimensional FFT is performed with the radix-64 block shown in fig. 1, the output of which is fed to a second radix-64 block to perform a row-by-row FFT, resulting in a two-dimensional FFT. The first processor performs a 64-point FFT operation, giving a 4K median value. The second FFT processor performs a 64-point FFT operation on these outputs and gives a final 64 x 64-point FFT. The two base-64 blocks are identical.
Inputting and caching: an input memory block consists of two sets of 64 RAMs, one set of input memory reads data and the other set reads data. The inputs are written to consecutive locations in RAM, i.e., RAM0 receives the first 64 inputs, RAM1 receives the 65 th later input, and so on. During a read operation, one input is provided per RAM, so that 64 inputs are available to the base-64 block. The read operation is done in parallel. Here, the read addresses of all RAMs are the same.
And (3) interstage treatment: the inter-stage processing consists of two sets of 64 registers each. The first set of registers is arranged as a chain of 64 shift registers. The output of the first base-64 block is concatenated into a first group. The first set of registers is pre-processed in each clock cycle. Once every 64 cycles, the outputs of all 64 registers in the first set are loaded in parallel with the second set. The registers in the second set are used as inputs to a second base-64 block.
The control circuit: the control circuit consists of a 12-bit up counter. Input RAMs on the input side require 6 bits for addressing. The present invention proposes that there are two such memory blocks in the architecture. Read addresses, write addresses and chip select signals are generated from the counter. In addition, two base-64 block mode select signals are generated from the counter. At the time of writing, chip selection signals are generated separately for each of RAMs. Upon reading, the same location of all RAMS is accessed in parallel and control signals are generated accordingly. All these signals are synchronized with respect to the previous stage delay.
Continuous flow FFT: the proposed FFT employs a new type of data scheduling mechanism that supports continuous streaming data. Here, the butterfly unit continuously performs calculations on the stream data. The FFT processor receives one input sample per clock cycle.

Claims (8)

1. A data rearrangement optimizing method based on a base-64 two-dimensional FFT architecture is characterized in that: the method comprises the following steps:
(1) designing a parallel pipeline architecture realized on an ASIC (application specific integrated circuit) and an FPGA (field programmable gate array) by utilizing the high regularity of an FFT (fast Fourier transform algorithm);
(2) with the parallel pipeline architecture, data reordering is performed using a six-bit mode select signal as a control signal.
2. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 1, wherein: in the step (1), the parallel pipeline architecture is a 64 × 64 two-dimensional FFT architecture developed by cascading two parallel expansion basis-64 blocks, and the 64 × 64FFT architecture is represented by a basis-64 algorithm based on a basis-4 butterfly unit.
3. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 2, wherein: in step (2), in the 64 × 64 two-dimensional FFT architecture, data rearrangement is performed using a six-bit mode selection signal as a control signal.
4. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 3, wherein: the two-dimensional FFT architecture is obtained by two one-dimensional N-point FFT calculation; an N x N two-dimensional FFT is a one-dimensional FFT in N row directions and a one-dimensional FFT in N column directions, and then N is generated between the two one-dimensional FFTs2The intermediate values are stored.
5. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 3, wherein: the method for representing the 64 × 64FFT architecture by using the novel radix-64 algorithm based on the radix-4 butterfly unit specifically comprises the following steps: a fully expanded radix-64 architecture uses parallel radix-4 butterfly units as basic sub-blocks; the radix-4 butterfly unit has four parallel inputs, and the output based on two-bit control input is called mode selection; the mode select signal determines the generation of one of the four outputs; generating outputs in an arbitrary order according to the mode selection signal;
the first stage has 16 twiddle factor Read Only Memory (ROM) for storing W16; each rom contains four twiddle factor values; the second stage includes four base 4 engines and four ROMs for storing W64; each of the ROMs of the second stage consists of 16 twiddle factor values; mode selection is a 6-bit control signal; two bits of the data are distributed to each stage; generating one of 64 outputs according to the pattern of each stage; initially, all mode selections for the base 4 engine are configured as mode 0; the output produced by the first stage is multiplied by the corresponding twiddle factor; the second stage performs similar operations; four base 4 engines are required to process the 16 outputs obtained from the first stage; likewise, the first four outputs are generated by configuring the mode selection of all four base 4 engines as mode 0, keeping the base 4 engines of the first stage themselves in mode 0; using these four outputs, the output required for the third stage is generated, and another mode selection is used in the final stage.
6. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 5, wherein: the data scheduling in the 64 × 64FFT architecture includes the following steps:
consider input data of 64 x 64 size stored in RAM given by matrix a, where ai,jIs an element in row i and column j; the data scheduling and reordering process in the first base-64 block is: starting with the mode select signal set to 0, the input is A in cycle 11,1,A2,1,A3,1,...,A64,1(ii) a Performing an FFT of these 64 inputs produces the first output, denoted B, of the first row in the B matrix in the first cycle1,1
In cycle 2, input A1,2,A2,2,A3,2,...,A64,2Given, the hold mode select signal is 0; the output generated is B1,2(ii) a In a similar manner, in loop 64, input A1,64,A2,64,A3,64,...,A64,64Using mode select 0, output B is generated1,64(ii) a Calculating all elements in the first row; in the 65 th cycle, all elements in the first row of the B matrix, namely B1,1,B1,2,...,B1,64Inputting a second radix-64 block in parallel; calculating one line of the first base-64 block, and executing the next stage calculation by the second base-64 block; an intermediate register between the two base-64 blocks stores the output of the first block for use by the second block; in the second base-64 block, the mode select signal changes from 0 to 63 every cycle, producing 64 outputs, C1,1,C1,2,...,C1,64
The sequentially increasing mode select signal is inactive because it requires the output of the first base-64 block to be reordered before it is applied to the second block; the mode selection of the first base-64 block must be given in the order of 0, 16, 32, 48, 1, 17, 33, 49,. and 63 for the first block output;
mode selection of the first base-64 in the 65 th cycleBecomes 16 and provides input a1,1,A2,1,A3,1,...,A64,1(ii) a Performing the 64 input FFT, the first output B of the second row of the B matrix is calculated in the 65 th cycle2,1(ii) a In the 66 th cycle, the second output of the second row of the B matrix, B, is calculated2,2And so on; in the 128 th cycle, B is calculated2,64(ii) a Thus, all elements in the second row are computed;
in the 129 th cycle, the second block starts the computation using the second row of the B matrix, setting the mode selection to 0, 1, 2, 3.. 63 in 64 cycles, resulting in the second row elements of the C matrix; by changing the mode selection to 63 and repeating this process to obtain all the elements in the B matrix of size 64 x 64, which is the output of the first base-64 block; generating a corresponding B matrix output in each cycle; this completes the columnar calculation of the two-dimensional FFT.
7. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 6, wherein: in the two-dimensional FFT output, elements in a C matrix give a final two-dimensional FFT output; generating an output from the second block each clock cycle; therefore, 4096 cycles are required to produce all 4096 outputs of the two-dimensional FFT.
8. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 6, wherein: the two-dimensional FFT architecture is realized by parallel expansion on a base-64 block, and the output sequence of the two-dimensional FFT architecture is controlled by a control bit; performing a one-dimensional FFT with the first radix-64 block, the output of which is fed to the second radix-64 block to perform a row-by-row FFT to obtain a two-dimensional FFT; the first processor performs a 64-point FFT operation, giving a 4K median value; the second FFT processor performs 64-point FFT operations on these outputs and gives the final 64 × 64-point FFT; the two base-64 blocks are identical.
CN202011245309.9A 2020-11-10 2020-11-10 Optimized data rearrangement method of two-dimensional FFT architecture based on base-64 Active CN112328958B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011245309.9A CN112328958B (en) 2020-11-10 2020-11-10 Optimized data rearrangement method of two-dimensional FFT architecture based on base-64

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011245309.9A CN112328958B (en) 2020-11-10 2020-11-10 Optimized data rearrangement method of two-dimensional FFT architecture based on base-64

Publications (2)

Publication Number Publication Date
CN112328958A true CN112328958A (en) 2021-02-05
CN112328958B CN112328958B (en) 2024-06-21

Family

ID=74317874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011245309.9A Active CN112328958B (en) 2020-11-10 2020-11-10 Optimized data rearrangement method of two-dimensional FFT architecture based on base-64

Country Status (1)

Country Link
CN (1) CN112328958B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6073154A (en) * 1998-06-26 2000-06-06 Xilinx, Inc. Computing multidimensional DFTs in FPGA
CN1988402A (en) * 2006-10-10 2007-06-27 东南大学 Method for realizing power line carrier communication system
CN101553808A (en) * 2006-04-04 2009-10-07 高通股份有限公司 Pipeline FFT architecture and method
CN103106180A (en) * 2011-09-09 2013-05-15 德州仪器公司 Constant geometry split radix FFT
CN103699515A (en) * 2013-12-27 2014-04-02 中国科学院计算技术研究所 FFT (fast Fourier transform) parallel processing device and FFT parallel processing method
CN105373367A (en) * 2015-10-29 2016-03-02 中国人民解放军国防科学技术大学 Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector
CN105893326A (en) * 2016-03-29 2016-08-24 西安科技大学 Device and method for realizing 65536 point FFT on basis of FPGA
CN110245322A (en) * 2019-05-09 2019-09-17 华中科技大学 A kind of method and system based on the real-time Hilbert transformation of hardware realization high-speed data-flow
CN110647719A (en) * 2019-09-20 2020-01-03 西安电子科技大学 Three-dimensional FFT (fast Fourier transform) calculation device based on FPGA (field programmable Gate array)

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6073154A (en) * 1998-06-26 2000-06-06 Xilinx, Inc. Computing multidimensional DFTs in FPGA
CN101553808A (en) * 2006-04-04 2009-10-07 高通股份有限公司 Pipeline FFT architecture and method
CN1988402A (en) * 2006-10-10 2007-06-27 东南大学 Method for realizing power line carrier communication system
CN103106180A (en) * 2011-09-09 2013-05-15 德州仪器公司 Constant geometry split radix FFT
CN103699515A (en) * 2013-12-27 2014-04-02 中国科学院计算技术研究所 FFT (fast Fourier transform) parallel processing device and FFT parallel processing method
CN105373367A (en) * 2015-10-29 2016-03-02 中国人民解放军国防科学技术大学 Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector
CN105893326A (en) * 2016-03-29 2016-08-24 西安科技大学 Device and method for realizing 65536 point FFT on basis of FPGA
CN110245322A (en) * 2019-05-09 2019-09-17 华中科技大学 A kind of method and system based on the real-time Hilbert transformation of hardware realization high-speed data-flow
CN110647719A (en) * 2019-09-20 2020-01-03 西安电子科技大学 Three-dimensional FFT (fast Fourier transform) calculation device based on FPGA (field programmable Gate array)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周国昌;张立新;: "基于RCSIMD的8192点FFT并行算法研究", 微电子学与计算机, no. 04 *

Also Published As

Publication number Publication date
CN112328958B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
JP3749022B2 (en) Parallel system with fast latency and array processing with short waiting time
US5313413A (en) Apparatus and method for preventing I/O bandwidth limitations in fast fourier transform processors
CN101847986B (en) Circuit and method for realizing FFT/IFFT conversion
US20100128818A1 (en) Fft processor
Wang et al. Scheduling of data access for the radix-2k FFT processor using single-port memory
Richardson et al. Building conflict-free FFT schedules
Chen et al. Energy-efficient architecture for stride permutation on streaming data
Kala et al. High throughput, low latency, memory optimized 64K point FFT architecture using novel radix-4 butterfly unit
US6728742B1 (en) Data storage patterns for fast fourier transforms
CN112328958B (en) Optimized data rearrangement method of two-dimensional FFT architecture based on base-64
Mathew et al. Radix‐4 3 based two‐dimensional FFT architecture with efficient data reordering scheme.
Hazarika et al. Low-complexity continuous-flow memory-based FFT architectures for real-valued signals
Hsiao et al. Design of low-cost and high-throughput linear arrays for DFT computations: Algorithms, architectures, and implementations
Jones Design and parallel computation of regularised fast Hartley transform
Liu et al. Efficient large-scale 1D FFT vectorization on multi-core vector accelerator
Dawwd et al. Reduced Area and Low Power Implementation of FFT/IFFT Processor.
Raman et al. Novel bit-reordering circuit for continuous-flow parallel FFT architectures
Guan et al. Design of an application-specific instruction set processor for high-throughput and scalable FFT
Jinhe et al. An efficient implementation of fft based on cgra
Melander et al. An FFT processor based on the SIC architecture with asynchronous PE
Kala et al. Image reconstruction using novel two-dimensional fourier transform
Song et al. An efficient FPGA-based accelerator design for convolution
Kumar et al. FPGA implementation of radix-4-based two-dimensional FFT with and without pipelining using efficient data reordering scheme
US20240020129A1 (en) Self-Ordering Fast Fourier Transform For Single Instruction Multiple Data Engines
Kumar et al. Design and Implementation of AGU based FFT Pipeline Architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant