CN112328958A

CN112328958A - Optimized data rearrangement method based on base-64 two-dimensional FFT architecture

Info

Publication number: CN112328958A
Application number: CN202011245309.9A
Authority: CN
Inventors: 曹宁; 吴子诚; 冯晔
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-02-05
Anticipated expiration: 2040-11-10
Also published as: CN112328958B

Abstract

The invention discloses an optimized data rearrangement method based on a base-64 two-dimensional FFT architecture, belongs to the technical field of signal processing, and provides a new two-dimensional FFT architecture by utilizing an effective data rearrangement technology and using a base-64 algorithm. The architecture exploits a 64 x 64 two-dimensional FFT architecture with two parallel basis-64 block cascades. In the radix-64 structure, data rearrangement is performed using the six-bit mode selection signal as a control signal. The base-64 structure provided by the invention obviously reduces the intermediate memory in the one-dimensional FFT and reduces the delay; and; the proposed two-dimensional FFT architecture shifts the number of intermediate memories between two one-dimensional FFTs from N²Reducing to N; the method has higher flexibility, can be applied to a plurality of occasions, and is particularly suitable for data reconstruction of the original image.

Description

Optimized data rearrangement method based on base-64 two-dimensional FFT architecture

Technical Field

The invention belongs to the technical field of signal processing, and particularly relates to an optimized data rearrangement method based on a base-64 two-dimensional FFT architecture.

Background

Over the past few decades, research and applications in the field of signal processing have seen explosive growth. Digital Signal Processing (DSP) has wide applications in the fields of biomedical imaging, multimedia, digital television, broadcasting, and the like. Due to the development of Very Large Scale Integration (VLSI) technology, the implementation of these applications is possible. Developing hardware solutions for these applications has been an active area of research over the last two decades.

Discrete Fourier Transform (DFT) is an important component of DSP and communication systems. Fast Fourier Transform (FFT) is the most common fast method of computing the discrete fourier transform. Two-dimensional FFTs are widely applied to data reconstruction of original images and need to meet efficient implementation of real-time scenes. Image processing applications require large memory to support real-time processing of image data. Therefore, a suitable architecture is needed to optimize memory and support images of various sizes while providing the required throughput.

Cooley-Tukey is a common algorithm for computing FFT because it scales the complexity from O (N) compared to DFT²) Reduction to O (Nlog)₂N). For x (n) of one-dimensional DFT, the n-point sequence can be calculated by equation (1):

in the formula W_nIs a twiddle factor expressed by formula (2):

W_n＝e^-2πi/N (2)

an N-point Inverse DFT (IDFT) can be calculated as equation (3):

in addition to the logic applied in DFT, IDFT requires some other logic, such as division and conjugation operations. The two-dimensional FFT can be calculated from the one-dimensional FFT. An N × N two-dimensional FFT can be calculated by 2N one-dimensional FFTs, where N is the sequence length. One size of NxN with an input of x (i)₁，i₂) Is calculated as formula (4):

wherein k is₁，k₂＝0，1，2，...，N-1

With two one-dimensional DFTs, a two-dimensional DFT can be performed based on a row-column decomposition algorithm as shown in the following equation:

wherein k is₁＝0，1，2，...，N-1

Wherein k is₂＝0，1，2，...，N-1

The decomposition size of the FFT is represented by a radix.

Existing FFT implementations have a variety of hardware and software solutions. Hardware implementation provides better performance and is more suitable for real-time embedded applications. Software solutions, such as general purpose processors and graphics processing units, are power hungry and are not suitable for real-time applications.

Pipeline architectures typically provide more area overhead and increase power consumption for FFT algorithms and architectures. The pipeline-based architecture based on the radix-2 linear decomposition is a single-path delay feedback (SDF) or multiple-path delay conversion (MDC) architecture. For large-sized FFTs, the memory-based architecture occupies less memory area and consumes less power than pipelined FFTs. Output reordering is a major functional block in the design of FFT architectures. The purpose of the reordering is to convert the non-natural order FFT output to natural order.

At present, the research aiming at the FFT output rearrangement technology and the complexity thereof is few, and no mature technology appears. In most existing architectures, reordering requires dedicated hardware or a large amount of memory. And requires a greater number of clock cycles to execute.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide an optimized data rearrangement method based on a base-64 two-dimensional FFT architecture, which reduces the number of operation execution cycles.

The technical scheme is as follows: in order to achieve the purpose, the invention provides the following technical scheme:

a data rearrangement optimizing method based on a base-64 two-dimensional FFT architecture comprises the following steps:

(1) designing a parallel pipeline architecture realized on an ASIC (application specific integrated circuit) and an FPGA (field programmable gate array) by utilizing the high regularity of an FFT (fast Fourier transform algorithm);

(2) with the parallel pipeline architecture, data reordering is performed using a six-bit mode select signal as a control signal.

Further, in step (1), the parallel pipeline architecture develops a 64 × 64 two-dimensional FFT architecture by cascading two parallel spread radix-64 blocks, and the 64 × 64FFT architecture is represented by using a novel radix-64 algorithm based on a radix-4 butterfly unit.

Further, in step (2), in the 64 × 64 two-dimensional FFT architecture, data rearrangement is performed using a six-bit mode selection signal as a control signal.

Further, the two-dimensional FFT is obtained by two one-dimensional N-point FFT calculation; an N x N two-dimensional FFT is a one-dimensional FFT in N row directions and a one-dimensional FFT in N column directions, and then N is generated between the two one-dimensional FFTs²The intermediate values are stored.

Further, the method for representing the 64 × 64FFT architecture by using the novel radix-64 algorithm based on the radix-4 butterfly unit specifically includes the following steps: a fully expanded radix-64 architecture uses parallel radix-4 butterfly units as basic sub-blocks; the radix-4 butterfly unit has four parallel inputs, and the output based on two-bit control input is called mode selection; the mode select signal determines the generation of one of the four outputs; generating outputs in an arbitrary order according to the mode selection signal; the equation for the radix-4 butterfly unit used is as follows:

wherein, X is a time domain signal sequence, and X is a frequency domain signal sequence; the first stage has 16 twiddle factor Read Only Memories (ROMs) to store W16; each rom contains four twiddle factor values; the second stage includes four base 4 engines and four ROMs for storing W64; each of the ROMs of the second stage consists of 16 twiddle factor values; mode selection is a 6-bit control signal; two bits of the data are distributed to each stage; generating one of 64 outputs according to the pattern of each stage; thus, the remaining outputs of the base-64 block are obtained in the architecture in a reordered form, saving memory and logical unit resources for reordering; initially, all mode selections for the base 4 engine are configured as mode 0; the output produced by the first stage is multiplied by the corresponding twiddle factor; the second stage performs similar operations; here, four base 4 engines are required to process the 16 outputs obtained from the first stage; likewise, the first four outputs are generated by configuring the mode selection of all four base 4 engines as mode 0, keeping the base 4 engines of the first stage themselves in mode 0; now, using these four outputs, the output required for the third stage is generated, and another mode selection is used in the final stage.

Further, the data scheduling in the 64 × 64FFT architecture includes the following steps:

consider input data of 64 x 64 size stored in RAM given by matrix a, where a_i，jIs an element in row i and column j; the data scheduling and reordering process in the first base-64 block is: starting with the mode select signal set to 0, the input is A in cycle 1_1，1，A_2，1，A_3，1，...，A_64，1(ii) a Performing an FFT of these 64 inputs produces the first output, denoted B, of the first row in the B matrix in the first cycle_1，1；

In cycle 2, input A_1，2，A_2，2，A_3，2，...，A_64，2Given, the hold mode select signal is 0; the output generated is B_1，2(ii) a In a similar manner, in loop 64, input A_1，64，A_2，64，A_3，64，...，A_64，64Using mode select 0, output B is generated_1，64(ii) a Calculating all elements in the first row; in the 65 th cycle, all elements in the first row of the B matrix, namely B_1，1，B_1，2，...，B_1，64Inputting a second radix-64 block in parallel; calculating one line of the first base-64 block, and executing the next stage calculation by the second base-64 block; an intermediate register between the two base-64 blocks stores the output of the first block for use by the second block; in the second base-64 block, the mode select signal changes from 0 to 63 every cycle, producing 64 outputs, C_1，1，C_1，2，...，C_1，64；

The sequentially increasing mode select signal is inactive because it requires the output of the first base-64 block to be reordered before it is applied to the second block; the mode selection of the first base-64 block must be given in the order of 0, 16, 32, 48, 1, 17, 33, 49,. and 63 for the first block output;

in the 65 th cycle, the mode selection for the first base-64 becomes 16, and input A is provided_1，1，A_2，1，A_3，1，...，A_64，1(ii) a Performing the 64 input FFT, the first output B of the second row of the B matrix is calculated in the 65 th cycle_2，1(ii) a In the 66 th cycle, the second output of the second row of the B matrix, B, is calculated_2，2And so on; in the 128 th cycle, B is calculated_2，64(ii) a Thus, all elements in the second row are computed;

in the 129 th cycle, the second block starts the computation using the second row of the B matrix, setting the mode selection to 0, 1, 2, 3.. 63 in 64 cycles, resulting in the second row elements of the C matrix; by changing the mode selection to 63 and repeating this process to obtain all the elements in the B matrix of size 64 x 64, which is the output of the first base-64 block; generating a corresponding B matrix output in each cycle; this completes the columnar calculation of the two-dimensional FFT.

Further, in the output of the two-dimensional FFT, elements in a C matrix give a final two-dimensional FFT output; generating an output from the second block each clock cycle; therefore, 4096 cycles are required to produce all 4096 outputs of the two-dimensional FFT.

Further, the two-dimensional FFT architecture is realized by parallel expansion on a base-64 block, and the output sequence of the two-dimensional FFT architecture is controlled by a plurality of control bits; performing a one-dimensional FFT with the first radix-64 block, the output of which is fed to the second radix-64 block to perform a row-by-row FFT to obtain a two-dimensional FFT; the first processor performs a 64-point FFT operation, giving a 4K median value; the second FFT processor performs 64-point FFT operations on these outputs and gives the final 64 × 64-point FFT; the two base-64 blocks are identical.

Has the advantages that: compared with the prior art, the optimized data rearrangement method based on the radix-64 two-dimensional FFT architecture provided by the invention has the advantages that the architecture has an efficient output rearrangement technology, and a parallel radix-4 butterfly unit is used; using a 6-bit control signal; the operation memory of the one-dimensional FFT is reduced, and the number of the intermediate memory units of the two-dimensional FFT is optimized from N²Reducing to N; and an ASIC and FPGA implementation architecture is proposed; the number of operation execution cycles is reduced.

Drawings

FIG. 1 is a parallel unfolding structure of the base-64 block;

FIG. 2 is a data schedule for a first base-64 block;

FIG. 3 is a data schedule for a second base-64 block;

FIG. 4 is a proposed two-dimensional FFT architecture using radix-64 blocks;

FIG. 5 is a comparison of the time consumption of the proposed architecture with that of the existing architecture;

FIG. 6 is a comparison of the hardware complexity of the proposed radix-64 line-parallel architecture with the existing radix-2 line architecture.

Detailed Description

The invention will be further described with reference to the following drawings and specific embodiments.

A data rearrangement optimizing method based on a base-64 two-dimensional FFT architecture is suitable for realizing an Application Specific Integrated Circuit (ASIC) and a Field Programmable Gate Array (FPGA) -based two-dimensional FFT. By utilizing the high regularity of the FFT algorithm, a parallel pipeline architecture which can be realized on an ASIC and an FPGA is designed. The two-dimensional FFT is calculated by two one-dimensional N-point FFTs. The performance of the one-dimensional FFT directly affects the performance of the two-dimensional FFT. An N two-dimensional FFT can be viewed as N one-dimensional FFTs in the row direction, and N columnsA directional one-dimensional FFT, then N is generated between the two one-dimensional FFTs²The intermediate values are stored.

In the present invention, the proposed 64 x 64FFT architecture is represented using a novel radix-64 algorithm based on a radix-4 butterfly unit. Two radix-64 blocks are cascaded to calculate a 64 x 64 complex-point FFT, and the generated output is reordered when the radix-64 blocks are realized, thereby saving intermediate memory and reducing delay. The method of (1) is to implement a two-dimensional FFT using a radix-64 parallel unfolding architecture.

Proposed base-64 architecture: a fully expanded radix-64 architecture uses parallel radix-4 butterfly units, which is the basic sub-block of the proposed architecture. The base 4 unit has four parallel inputs, each output based on a two-bit control input is called a mode select. The mode select signal determines the generation of one of the four outputs. The outputs may be generated in any order according to the mode select signal. However, in a conventional radix-4 butterfly unit, the outputs are generated in a particular order. The equation for the radix-4 butterfly unit used is as follows:

wherein, X is a time domain signal sequence, and X is a frequency domain signal sequence; the first stage has 16 twiddle factor Read Only Memories (ROMs) to store W16. Each rom contains four twiddle factor values. The second stage includes four base 4 engines and four ROMs for storing W64. Each rom of the second stage consists of 16 twiddle factor values. The mode selection is a 6 bit control signal. Two of which are assigned to each phase. One of 64 outputs is generated according to the pattern of each stage. Thus, the remaining outputs of the base-64 block are obtained in the form of a reorder in this architecture, which may save memory and logic cell resources for reordering. Initially, all mode selections for the base 4 engine are configured as mode 0. The output produced by the first stage is multiplied by the corresponding twiddle factor. The second phase performs similar operations. Here, four base 4 engines are required to process the 16 outputs obtained from the first stage. Likewise, the first four outputs are generated by configuring the mode selection of all four base 4 engines to mode 0, keeping the base 4 engine itself in mode 0 for the first stage. Now, using these four outputs, the output required for the third stage can be generated and another mode selection used in the final stage.

Data scheduling in the proposed structure: consider input data of 64 x 64 size stored in RAM given by matrix a, where a_i，jIs an element in the ith row and jth column.

The data scheduling and reordering process in the first base-64 block is: starting with the mode select signal set to 0, the input is A in cycle 1_1，1，A_2，1，A_3，1，...，A_64，1. Performing an FFT of these 64 inputs produces the first output, denoted B, of the first row in the B matrix in the first cycle_1，1. In cycle 2, input A_1，2，A_2，2，A_3，2，...，A_64，2Given, the hold mode select signal is 0. The output generated is B_1，2. In a similar manner, in loop 64, input A_1，64，A_2，64，A_3，64，...，A_64，64Using mode select 0, output B is generated_1，64. Thus, all elements in the first row are calculated. In the 65 th cycle, all elements in the first row of the B matrix, namely B_1，1，B_1，2，...，B_1，64And a second base-64 block is input in parallel. Here, the second base-64 block starts to perform the next stage of calculation as soon as one line of the first base-64 block is calculated. An intermediate register between the two base-64 blocks stores the output of the first block for use by the second block. In the second base-64 block, the mode select signal changes from 0 to 63 every cycle, producing 64 outputs, C_1，1，C_1，2，...，C_1，64。

The sequentially increasing mode select signal is inactive because it requires the output of the first base-64 block to be reordered before it is applied to the second block. The mode selection of the first base-64 block must be given in the order of 0, 16, 32, 48, 1, 17, 33, 49,.. and 63 for the first block output.

In the 65 th cycle, the mode selection for the first base-64 becomes 16, and input A is provided_1，1，A_2，1，A_3，1，...，A_64，1. Performing the 64 input FFT, the first output B of the second row of the B matrix is calculated in the 65 th cycle_2，1. In the 66 th cycle, the second output of the second row of the B matrix, B, is calculated_2，2And so on. In the 128 th cycle, B is calculated_2，64. Thus, all elements in the second row are calculated.

In the 129 th cycle, the second block starts the computation using the second row of the B matrix, setting the mode selection to 0, 1, 2, 3,. 63 in 64 cycles, resulting in the second row elements of the C matrix. By changing the mode selection to 63 and repeating this process, all elements in the B matrix of size 64 x 64, which is the output of the first base-64 block, are obtained. A corresponding B matrix output is generated in each cycle. This completes the columnar calculation of the two-dimensional FFT.

The elements in the C matrix give the final two-dimensional FFT output. One output is generated from the second block every clock cycle. Therefore, 4096 cycles are required to produce all 4096 outputs of the two-dimensional FFT.

Two-dimensional FFT architecture: a parallel unrolling implementation is used for the base-64 block. The order of the outputs is controlled by several control bits. In this architecture, for a given set of inputs, at each stage, only the outputs required for the next stage are calculated without any loss of performance, so most intermediate buffers can be avoided. Therefore, there is a great optimization in terms of memory and latency.

A one-dimensional FFT is performed with the first radix-64 block, the output of which is fed to the second radix-64 block to perform a row-by-row FFT to obtain a two-dimensional FFT. The first processor performs a 64-point FFT operation, giving a 4K median value. The second FFT processor performs a 64-point FFT operation on these outputs and gives a final 64 x 64-point FFT. The two base-64 blocks are identical.

Examples

As shown in fig. 1, is a base-64 block parallel expansion architecture. The first stage has 16 basic 4 units, the second stage has 4 basic 4 units, and the third stage has 1 basic 4 unit. In this architecture, all base 4 blocks are the same. The symbols R40, R44, R48, R412 represent the 0 th, 4 th, 8 th, 1 st two radix-4 butterfly, and so on. W16 and W64 represent twiddle factors for the first and second phases. The first stage has 16 twiddle factors and Read Only Memories (ROMs) are used to store W16. Each ROM contains four twiddle factor values. The second stage includes four base 4 cells and four ROMs for storing W64. Each rom of the second stage consists of 16 twiddle factor values.

Of the N multipliers in each stage, the first N/4 multipliers in each stage have the same twiddle factor. So at execution, these multipliers are removed. Thus, the first stage has 1 two multipliers instead of 16, and the second stage has 3 multipliers instead of 4.

The mode selection is a 6 bit control signal. Two of which are assigned to each phase. One of 64 outputs is generated according to the pattern of each stage. Thus, obtaining the remaining outputs of the base-64 block in a reordered form in this architecture may save memory and logic device resources for reordering. Initially, all mode selections for the base 4 unit are configured as mode 0. The output produced by the first stage is multiplied by the corresponding twiddle factor. The second phase performs similar operations. Here, four base 4 units are required to process the 16 outputs obtained from the first stage. Likewise, the first four outputs are generated by configuring the mode selection of all four base 4 cells as mode 0, keeping the base 4 cells of the first stage themselves in mode 0. Now, using these four outputs, the output required for the third stage can be generated and another mode selection used in the final stage.

Fig. 2 and 3 show data scheduling for base-64 block, where a is mode 0 output, b is mode 16 output, and c is mode 63 output. The data scheduling and reordering process in the first base-64 block is: starting with the mode select signal set to 0, the input is A in cycle 1_1，1，A_2，1，A_3，1，...，A_64，1. This is performed 64FFT of the inputs, in a first cycle, produces a first output, denoted B, in a first row of the B matrix_1，1. In cycle 2, input A_1，2，A_2，2，A_3，2，...，A_64，2Given that the hold mode select signal is 0, the resulting output is B_1，2. In a similar manner, in loop 64, input A_1，64，A_2，64，A_3，64，...，A_64，64Using mode select 0, output B is generated_1，64. Thus, all elements in the first row are calculated. In the 65 th cycle, all elements in the first row of the B matrix, namely B_1，1，B_1，2，...，B_1，64And a second base-64 block is input in parallel. Here, the second base-64 block starts to perform the next stage of calculation as soon as one line of the first base-64 block is calculated. An intermediate register between the two base-64 blocks stores the output of the first block for use by the second block. In the second base-64 block, the mode select signal changes from 0 to 63 every cycle, producing 64 outputs, C_1，1，C_1，2，...，C_1，64。

Fig. 4 shows the proposed two-dimensional FFT architecture using radix-64 blocks. The order of the outputs is controlled by several control bits. In this architecture, for a given set of inputs, at each stage, only the outputs required for the next stage are calculated without any loss of performance, so most intermediate buffers can be avoided, with great optimization in terms of memory and latency, as shown by comparison of fig. 5 and 6.

A one-dimensional FFT is performed with the radix-64 block shown in fig. 1, the output of which is fed to a second radix-64 block to perform a row-by-row FFT, resulting in a two-dimensional FFT. The first processor performs a 64-point FFT operation, giving a 4K median value. The second FFT processor performs a 64-point FFT operation on these outputs and gives a final 64 x 64-point FFT. The two base-64 blocks are identical.

Inputting and caching: an input memory block consists of two sets of 64 RAMs, one set of input memory reads data and the other set reads data. The inputs are written to consecutive locations in RAM, i.e., RAM0 receives the first 64 inputs, RAM1 receives the 65 th later input, and so on. During a read operation, one input is provided per RAM, so that 64 inputs are available to the base-64 block. The read operation is done in parallel. Here, the read addresses of all RAMs are the same.

And (3) interstage treatment: the inter-stage processing consists of two sets of 64 registers each. The first set of registers is arranged as a chain of 64 shift registers. The output of the first base-64 block is concatenated into a first group. The first set of registers is pre-processed in each clock cycle. Once every 64 cycles, the outputs of all 64 registers in the first set are loaded in parallel with the second set. The registers in the second set are used as inputs to a second base-64 block.

The control circuit: the control circuit consists of a 12-bit up counter. Input RAMs on the input side require 6 bits for addressing. The present invention proposes that there are two such memory blocks in the architecture. Read addresses, write addresses and chip select signals are generated from the counter. In addition, two base-64 block mode select signals are generated from the counter. At the time of writing, chip selection signals are generated separately for each of RAMs. Upon reading, the same location of all RAMS is accessed in parallel and control signals are generated accordingly. All these signals are synchronized with respect to the previous stage delay.

Continuous flow FFT: the proposed FFT employs a new type of data scheduling mechanism that supports continuous streaming data. Here, the butterfly unit continuously performs calculations on the stream data. The FFT processor receives one input sample per clock cycle.

Claims

1. A data rearrangement optimizing method based on a base-64 two-dimensional FFT architecture is characterized in that: the method comprises the following steps:

2. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 1, wherein: in the step (1), the parallel pipeline architecture is a 64 × 64 two-dimensional FFT architecture developed by cascading two parallel expansion basis-64 blocks, and the 64 × 64FFT architecture is represented by a basis-64 algorithm based on a basis-4 butterfly unit.

3. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 2, wherein: in step (2), in the 64 × 64 two-dimensional FFT architecture, data rearrangement is performed using a six-bit mode selection signal as a control signal.

4. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 3, wherein: the two-dimensional FFT architecture is obtained by two one-dimensional N-point FFT calculation; an N x N two-dimensional FFT is a one-dimensional FFT in N row directions and a one-dimensional FFT in N column directions, and then N is generated between the two one-dimensional FFTs²The intermediate values are stored.

5. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 3, wherein: the method for representing the 64 × 64FFT architecture by using the novel radix-64 algorithm based on the radix-4 butterfly unit specifically comprises the following steps: a fully expanded radix-64 architecture uses parallel radix-4 butterfly units as basic sub-blocks; the radix-4 butterfly unit has four parallel inputs, and the output based on two-bit control input is called mode selection; the mode select signal determines the generation of one of the four outputs; generating outputs in an arbitrary order according to the mode selection signal;

the first stage has 16 twiddle factor Read Only Memory (ROM) for storing W16; each rom contains four twiddle factor values; the second stage includes four base 4 engines and four ROMs for storing W64; each of the ROMs of the second stage consists of 16 twiddle factor values; mode selection is a 6-bit control signal; two bits of the data are distributed to each stage; generating one of 64 outputs according to the pattern of each stage; initially, all mode selections for the base 4 engine are configured as mode 0; the output produced by the first stage is multiplied by the corresponding twiddle factor; the second stage performs similar operations; four base 4 engines are required to process the 16 outputs obtained from the first stage; likewise, the first four outputs are generated by configuring the mode selection of all four base 4 engines as mode 0, keeping the base 4 engines of the first stage themselves in mode 0; using these four outputs, the output required for the third stage is generated, and another mode selection is used in the final stage.

6. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 5, wherein: the data scheduling in the 64 × 64FFT architecture includes the following steps:

mode selection of the first base-64 in the 65 th cycleBecomes 16 and provides input a_1，1，A_2，1，A_3，1，...，A_64，1(ii) a Performing the 64 input FFT, the first output B of the second row of the B matrix is calculated in the 65 th cycle_2，1(ii) a In the 66 th cycle, the second output of the second row of the B matrix, B, is calculated_2，2And so on; in the 128 th cycle, B is calculated_2，64(ii) a Thus, all elements in the second row are computed;

7. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 6, wherein: in the two-dimensional FFT output, elements in a C matrix give a final two-dimensional FFT output; generating an output from the second block each clock cycle; therefore, 4096 cycles are required to produce all 4096 outputs of the two-dimensional FFT.

8. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 6, wherein: the two-dimensional FFT architecture is realized by parallel expansion on a base-64 block, and the output sequence of the two-dimensional FFT architecture is controlled by a control bit; performing a one-dimensional FFT with the first radix-64 block, the output of which is fed to the second radix-64 block to perform a row-by-row FFT to obtain a two-dimensional FFT; the first processor performs a 64-point FFT operation, giving a 4K median value; the second FFT processor performs 64-point FFT operations on these outputs and gives the final 64 × 64-point FFT; the two base-64 blocks are identical.