CN112328958A - Optimized data rearrangement method based on base-64 two-dimensional FFT architecture - Google Patents
Optimized data rearrangement method based on base-64 two-dimensional FFT architecture Download PDFInfo
- Publication number
- CN112328958A CN112328958A CN202011245309.9A CN202011245309A CN112328958A CN 112328958 A CN112328958 A CN 112328958A CN 202011245309 A CN202011245309 A CN 202011245309A CN 112328958 A CN112328958 A CN 112328958A
- Authority
- CN
- China
- Prior art keywords
- base
- architecture
- block
- output
- fft
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 230000008707 rearrangement Effects 0.000 title claims abstract description 22
- 230000015654 memory Effects 0.000 claims abstract description 22
- 239000011159 matrix material Substances 0.000 claims description 39
- 230000008569 process Effects 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 abstract description 8
- 238000005516 engineering process Methods 0.000 abstract description 5
- 108010076504 Protein Sorting Signals Proteins 0.000 description 4
- 238000000354 decomposition reaction Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000003775 Density Functional Theory Methods 0.000 description 2
- 239000000872 buffer Substances 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 101100325756 Arabidopsis thaliana BAM5 gene Proteins 0.000 description 1
- 101150046378 RAM1 gene Proteins 0.000 description 1
- 101100476489 Rattus norvegicus Slc20a2 gene Proteins 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000021615 conjugation Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
- G06F17/141—Discrete Fourier transforms
- G06F17/142—Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
Landscapes
- Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Discrete Mathematics (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses an optimized data rearrangement method based on a base-64 two-dimensional FFT architecture, belongs to the technical field of signal processing, and provides a new two-dimensional FFT architecture by utilizing an effective data rearrangement technology and using a base-64 algorithm. The architecture exploits a 64 x 64 two-dimensional FFT architecture with two parallel basis-64 block cascades. In the radix-64 structure, data rearrangement is performed using the six-bit mode selection signal as a control signal. The base-64 structure provided by the invention obviously reduces the intermediate memory in the one-dimensional FFT and reduces the delay; and; the proposed two-dimensional FFT architecture shifts the number of intermediate memories between two one-dimensional FFTs from N2Reducing to N; the method has higher flexibility, can be applied to a plurality of occasions, and is particularly suitable for data reconstruction of the original image.
Description
Technical Field
The invention belongs to the technical field of signal processing, and particularly relates to an optimized data rearrangement method based on a base-64 two-dimensional FFT architecture.
Background
Over the past few decades, research and applications in the field of signal processing have seen explosive growth. Digital Signal Processing (DSP) has wide applications in the fields of biomedical imaging, multimedia, digital television, broadcasting, and the like. Due to the development of Very Large Scale Integration (VLSI) technology, the implementation of these applications is possible. Developing hardware solutions for these applications has been an active area of research over the last two decades.
Discrete Fourier Transform (DFT) is an important component of DSP and communication systems. Fast Fourier Transform (FFT) is the most common fast method of computing the discrete fourier transform. Two-dimensional FFTs are widely applied to data reconstruction of original images and need to meet efficient implementation of real-time scenes. Image processing applications require large memory to support real-time processing of image data. Therefore, a suitable architecture is needed to optimize memory and support images of various sizes while providing the required throughput.
Cooley-Tukey is a common algorithm for computing FFT because it scales the complexity from O (N) compared to DFT2) Reduction to O (Nlog)2N). For x (n) of one-dimensional DFT, the n-point sequence can be calculated by equation (1):
in the formula WnIs a twiddle factor expressed by formula (2):
Wn=e-2πi/N (2)
an N-point Inverse DFT (IDFT) can be calculated as equation (3):
in addition to the logic applied in DFT, IDFT requires some other logic, such as division and conjugation operations. The two-dimensional FFT can be calculated from the one-dimensional FFT. An N × N two-dimensional FFT can be calculated by 2N one-dimensional FFTs, where N is the sequence length. One size of NxN with an input of x (i)1,i2) Is calculated as formula (4):
wherein k is1,k2=0,1,2,...,N-1
With two one-dimensional DFTs, a two-dimensional DFT can be performed based on a row-column decomposition algorithm as shown in the following equation:
wherein k is1=0,1,2,...,N-1
Wherein k is2=0,1,2,...,N-1
The decomposition size of the FFT is represented by a radix.
Existing FFT implementations have a variety of hardware and software solutions. Hardware implementation provides better performance and is more suitable for real-time embedded applications. Software solutions, such as general purpose processors and graphics processing units, are power hungry and are not suitable for real-time applications.
Pipeline architectures typically provide more area overhead and increase power consumption for FFT algorithms and architectures. The pipeline-based architecture based on the radix-2 linear decomposition is a single-path delay feedback (SDF) or multiple-path delay conversion (MDC) architecture. For large-sized FFTs, the memory-based architecture occupies less memory area and consumes less power than pipelined FFTs. Output reordering is a major functional block in the design of FFT architectures. The purpose of the reordering is to convert the non-natural order FFT output to natural order.
At present, the research aiming at the FFT output rearrangement technology and the complexity thereof is few, and no mature technology appears. In most existing architectures, reordering requires dedicated hardware or a large amount of memory. And requires a greater number of clock cycles to execute.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide an optimized data rearrangement method based on a base-64 two-dimensional FFT architecture, which reduces the number of operation execution cycles.
The technical scheme is as follows: in order to achieve the purpose, the invention provides the following technical scheme:
a data rearrangement optimizing method based on a base-64 two-dimensional FFT architecture comprises the following steps:
(1) designing a parallel pipeline architecture realized on an ASIC (application specific integrated circuit) and an FPGA (field programmable gate array) by utilizing the high regularity of an FFT (fast Fourier transform algorithm);
(2) with the parallel pipeline architecture, data reordering is performed using a six-bit mode select signal as a control signal.
Further, in step (1), the parallel pipeline architecture develops a 64 × 64 two-dimensional FFT architecture by cascading two parallel spread radix-64 blocks, and the 64 × 64FFT architecture is represented by using a novel radix-64 algorithm based on a radix-4 butterfly unit.
Further, in step (2), in the 64 × 64 two-dimensional FFT architecture, data rearrangement is performed using a six-bit mode selection signal as a control signal.
Further, the two-dimensional FFT is obtained by two one-dimensional N-point FFT calculation; an N x N two-dimensional FFT is a one-dimensional FFT in N row directions and a one-dimensional FFT in N column directions, and then N is generated between the two one-dimensional FFTs2The intermediate values are stored.
Further, the method for representing the 64 × 64FFT architecture by using the novel radix-64 algorithm based on the radix-4 butterfly unit specifically includes the following steps: a fully expanded radix-64 architecture uses parallel radix-4 butterfly units as basic sub-blocks; the radix-4 butterfly unit has four parallel inputs, and the output based on two-bit control input is called mode selection; the mode select signal determines the generation of one of the four outputs; generating outputs in an arbitrary order according to the mode selection signal; the equation for the radix-4 butterfly unit used is as follows:
wherein, X is a time domain signal sequence, and X is a frequency domain signal sequence; the first stage has 16 twiddle factor Read Only Memories (ROMs) to store W16; each rom contains four twiddle factor values; the second stage includes four base 4 engines and four ROMs for storing W64; each of the ROMs of the second stage consists of 16 twiddle factor values; mode selection is a 6-bit control signal; two bits of the data are distributed to each stage; generating one of 64 outputs according to the pattern of each stage; thus, the remaining outputs of the base-64 block are obtained in the architecture in a reordered form, saving memory and logical unit resources for reordering; initially, all mode selections for the base 4 engine are configured as mode 0; the output produced by the first stage is multiplied by the corresponding twiddle factor; the second stage performs similar operations; here, four base 4 engines are required to process the 16 outputs obtained from the first stage; likewise, the first four outputs are generated by configuring the mode selection of all four base 4 engines as mode 0, keeping the base 4 engines of the first stage themselves in mode 0; now, using these four outputs, the output required for the third stage is generated, and another mode selection is used in the final stage.
Further, the data scheduling in the 64 × 64FFT architecture includes the following steps:
consider input data of 64 x 64 size stored in RAM given by matrix a, where ai,jIs an element in row i and column j; the data scheduling and reordering process in the first base-64 block is: starting with the mode select signal set to 0, the input is A in cycle 11,1,A2,1,A3,1,...,A64,1(ii) a Performing an FFT of these 64 inputs produces the first output, denoted B, of the first row in the B matrix in the first cycle1,1;
In cycle 2, input A1,2,A2,2,A3,2,...,A64,2Given, the hold mode select signal is 0; the output generated is B1,2(ii) a In a similar manner, in loop 64, input A1,64,A2,64,A3,64,...,A64,64Using mode select 0, output B is generated1,64(ii) a Calculating all elements in the first row; in the 65 th cycle, all elements in the first row of the B matrix, namely B1,1,B1,2,...,B1,64Inputting a second radix-64 block in parallel; calculating one line of the first base-64 block, and executing the next stage calculation by the second base-64 block; an intermediate register between the two base-64 blocks stores the output of the first block for use by the second block; in the second base-64 block, the mode select signal changes from 0 to 63 every cycle, producing 64 outputs, C1,1,C1,2,...,C1,64;
The sequentially increasing mode select signal is inactive because it requires the output of the first base-64 block to be reordered before it is applied to the second block; the mode selection of the first base-64 block must be given in the order of 0, 16, 32, 48, 1, 17, 33, 49,. and 63 for the first block output;
in the 65 th cycle, the mode selection for the first base-64 becomes 16, and input A is provided1,1,A2,1,A3,1,...,A64,1(ii) a Performing the 64 input FFT, the first output B of the second row of the B matrix is calculated in the 65 th cycle2,1(ii) a In the 66 th cycle, the second output of the second row of the B matrix, B, is calculated2,2And so on; in the 128 th cycle, B is calculated2,64(ii) a Thus, all elements in the second row are computed;
in the 129 th cycle, the second block starts the computation using the second row of the B matrix, setting the mode selection to 0, 1, 2, 3.. 63 in 64 cycles, resulting in the second row elements of the C matrix; by changing the mode selection to 63 and repeating this process to obtain all the elements in the B matrix of size 64 x 64, which is the output of the first base-64 block; generating a corresponding B matrix output in each cycle; this completes the columnar calculation of the two-dimensional FFT.
Further, in the output of the two-dimensional FFT, elements in a C matrix give a final two-dimensional FFT output; generating an output from the second block each clock cycle; therefore, 4096 cycles are required to produce all 4096 outputs of the two-dimensional FFT.
Further, the two-dimensional FFT architecture is realized by parallel expansion on a base-64 block, and the output sequence of the two-dimensional FFT architecture is controlled by a plurality of control bits; performing a one-dimensional FFT with the first radix-64 block, the output of which is fed to the second radix-64 block to perform a row-by-row FFT to obtain a two-dimensional FFT; the first processor performs a 64-point FFT operation, giving a 4K median value; the second FFT processor performs 64-point FFT operations on these outputs and gives the final 64 × 64-point FFT; the two base-64 blocks are identical.
Has the advantages that: compared with the prior art, the optimized data rearrangement method based on the radix-64 two-dimensional FFT architecture provided by the invention has the advantages that the architecture has an efficient output rearrangement technology, and a parallel radix-4 butterfly unit is used; using a 6-bit control signal; the operation memory of the one-dimensional FFT is reduced, and the number of the intermediate memory units of the two-dimensional FFT is optimized from N2Reducing to N; and an ASIC and FPGA implementation architecture is proposed; the number of operation execution cycles is reduced.
Drawings
FIG. 1 is a parallel unfolding structure of the base-64 block;
FIG. 2 is a data schedule for a first base-64 block;
FIG. 3 is a data schedule for a second base-64 block;
FIG. 4 is a proposed two-dimensional FFT architecture using radix-64 blocks;
FIG. 5 is a comparison of the time consumption of the proposed architecture with that of the existing architecture;
FIG. 6 is a comparison of the hardware complexity of the proposed radix-64 line-parallel architecture with the existing radix-2 line architecture.
Detailed Description
The invention will be further described with reference to the following drawings and specific embodiments.
A data rearrangement optimizing method based on a base-64 two-dimensional FFT architecture is suitable for realizing an Application Specific Integrated Circuit (ASIC) and a Field Programmable Gate Array (FPGA) -based two-dimensional FFT. By utilizing the high regularity of the FFT algorithm, a parallel pipeline architecture which can be realized on an ASIC and an FPGA is designed. The two-dimensional FFT is calculated by two one-dimensional N-point FFTs. The performance of the one-dimensional FFT directly affects the performance of the two-dimensional FFT. An N two-dimensional FFT can be viewed as N one-dimensional FFTs in the row direction, and N columnsA directional one-dimensional FFT, then N is generated between the two one-dimensional FFTs2The intermediate values are stored.
In the present invention, the proposed 64 x 64FFT architecture is represented using a novel radix-64 algorithm based on a radix-4 butterfly unit. Two radix-64 blocks are cascaded to calculate a 64 x 64 complex-point FFT, and the generated output is reordered when the radix-64 blocks are realized, thereby saving intermediate memory and reducing delay. The method of (1) is to implement a two-dimensional FFT using a radix-64 parallel unfolding architecture.
Proposed base-64 architecture: a fully expanded radix-64 architecture uses parallel radix-4 butterfly units, which is the basic sub-block of the proposed architecture. The base 4 unit has four parallel inputs, each output based on a two-bit control input is called a mode select. The mode select signal determines the generation of one of the four outputs. The outputs may be generated in any order according to the mode select signal. However, in a conventional radix-4 butterfly unit, the outputs are generated in a particular order. The equation for the radix-4 butterfly unit used is as follows:
wherein, X is a time domain signal sequence, and X is a frequency domain signal sequence; the first stage has 16 twiddle factor Read Only Memories (ROMs) to store W16. Each rom contains four twiddle factor values. The second stage includes four base 4 engines and four ROMs for storing W64. Each rom of the second stage consists of 16 twiddle factor values. The mode selection is a 6 bit control signal. Two of which are assigned to each phase. One of 64 outputs is generated according to the pattern of each stage. Thus, the remaining outputs of the base-64 block are obtained in the form of a reorder in this architecture, which may save memory and logic cell resources for reordering. Initially, all mode selections for the base 4 engine are configured as mode 0. The output produced by the first stage is multiplied by the corresponding twiddle factor. The second phase performs similar operations. Here, four base 4 engines are required to process the 16 outputs obtained from the first stage. Likewise, the first four outputs are generated by configuring the mode selection of all four base 4 engines to mode 0, keeping the base 4 engine itself in mode 0 for the first stage. Now, using these four outputs, the output required for the third stage can be generated and another mode selection used in the final stage.
Data scheduling in the proposed structure: consider input data of 64 x 64 size stored in RAM given by matrix a, where ai,jIs an element in the ith row and jth column.
The data scheduling and reordering process in the first base-64 block is: starting with the mode select signal set to 0, the input is A in cycle 11,1,A2,1,A3,1,...,A64,1. Performing an FFT of these 64 inputs produces the first output, denoted B, of the first row in the B matrix in the first cycle1,1. In cycle 2, input A1,2,A2,2,A3,2,...,A64,2Given, the hold mode select signal is 0. The output generated is B1,2. In a similar manner, in loop 64, input A1,64,A2,64,A3,64,...,A64,64Using mode select 0, output B is generated1,64. Thus, all elements in the first row are calculated. In the 65 th cycle, all elements in the first row of the B matrix, namely B1,1,B1,2,...,B1,64And a second base-64 block is input in parallel. Here, the second base-64 block starts to perform the next stage of calculation as soon as one line of the first base-64 block is calculated. An intermediate register between the two base-64 blocks stores the output of the first block for use by the second block. In the second base-64 block, the mode select signal changes from 0 to 63 every cycle, producing 64 outputs, C1,1,C1,2,...,C1,64。
The sequentially increasing mode select signal is inactive because it requires the output of the first base-64 block to be reordered before it is applied to the second block. The mode selection of the first base-64 block must be given in the order of 0, 16, 32, 48, 1, 17, 33, 49,.. and 63 for the first block output.
In the 65 th cycle, the mode selection for the first base-64 becomes 16, and input A is provided1,1,A2,1,A3,1,...,A64,1. Performing the 64 input FFT, the first output B of the second row of the B matrix is calculated in the 65 th cycle2,1. In the 66 th cycle, the second output of the second row of the B matrix, B, is calculated2,2And so on. In the 128 th cycle, B is calculated2,64. Thus, all elements in the second row are calculated.
In the 129 th cycle, the second block starts the computation using the second row of the B matrix, setting the mode selection to 0, 1, 2, 3,. 63 in 64 cycles, resulting in the second row elements of the C matrix. By changing the mode selection to 63 and repeating this process, all elements in the B matrix of size 64 x 64, which is the output of the first base-64 block, are obtained. A corresponding B matrix output is generated in each cycle. This completes the columnar calculation of the two-dimensional FFT.
The elements in the C matrix give the final two-dimensional FFT output. One output is generated from the second block every clock cycle. Therefore, 4096 cycles are required to produce all 4096 outputs of the two-dimensional FFT.
Two-dimensional FFT architecture: a parallel unrolling implementation is used for the base-64 block. The order of the outputs is controlled by several control bits. In this architecture, for a given set of inputs, at each stage, only the outputs required for the next stage are calculated without any loss of performance, so most intermediate buffers can be avoided. Therefore, there is a great optimization in terms of memory and latency.
A one-dimensional FFT is performed with the first radix-64 block, the output of which is fed to the second radix-64 block to perform a row-by-row FFT to obtain a two-dimensional FFT. The first processor performs a 64-point FFT operation, giving a 4K median value. The second FFT processor performs a 64-point FFT operation on these outputs and gives a final 64 x 64-point FFT. The two base-64 blocks are identical.
Examples
As shown in fig. 1, is a base-64 block parallel expansion architecture. The first stage has 16 basic 4 units, the second stage has 4 basic 4 units, and the third stage has 1 basic 4 unit. In this architecture, all base 4 blocks are the same. The symbols R40, R44, R48, R412 represent the 0 th, 4 th, 8 th, 1 st two radix-4 butterfly, and so on. W16 and W64 represent twiddle factors for the first and second phases. The first stage has 16 twiddle factors and Read Only Memories (ROMs) are used to store W16. Each ROM contains four twiddle factor values. The second stage includes four base 4 cells and four ROMs for storing W64. Each rom of the second stage consists of 16 twiddle factor values.
Of the N multipliers in each stage, the first N/4 multipliers in each stage have the same twiddle factor. So at execution, these multipliers are removed. Thus, the first stage has 1 two multipliers instead of 16, and the second stage has 3 multipliers instead of 4.
The mode selection is a 6 bit control signal. Two of which are assigned to each phase. One of 64 outputs is generated according to the pattern of each stage. Thus, obtaining the remaining outputs of the base-64 block in a reordered form in this architecture may save memory and logic device resources for reordering. Initially, all mode selections for the base 4 unit are configured as mode 0. The output produced by the first stage is multiplied by the corresponding twiddle factor. The second phase performs similar operations. Here, four base 4 units are required to process the 16 outputs obtained from the first stage. Likewise, the first four outputs are generated by configuring the mode selection of all four base 4 cells as mode 0, keeping the base 4 cells of the first stage themselves in mode 0. Now, using these four outputs, the output required for the third stage can be generated and another mode selection used in the final stage.
Fig. 2 and 3 show data scheduling for base-64 block, where a is mode 0 output, b is mode 16 output, and c is mode 63 output. The data scheduling and reordering process in the first base-64 block is: starting with the mode select signal set to 0, the input is A in cycle 11,1,A2,1,A3,1,...,A64,1. This is performed 64FFT of the inputs, in a first cycle, produces a first output, denoted B, in a first row of the B matrix1,1. In cycle 2, input A1,2,A2,2,A3,2,...,A64,2Given that the hold mode select signal is 0, the resulting output is B1,2. In a similar manner, in loop 64, input A1,64,A2,64,A3,64,...,A64,64Using mode select 0, output B is generated1,64. Thus, all elements in the first row are calculated. In the 65 th cycle, all elements in the first row of the B matrix, namely B1,1,B1,2,...,B1,64And a second base-64 block is input in parallel. Here, the second base-64 block starts to perform the next stage of calculation as soon as one line of the first base-64 block is calculated. An intermediate register between the two base-64 blocks stores the output of the first block for use by the second block. In the second base-64 block, the mode select signal changes from 0 to 63 every cycle, producing 64 outputs, C1,1,C1,2,...,C1,64。
The sequentially increasing mode select signal is inactive because it requires the output of the first base-64 block to be reordered before it is applied to the second block. The mode selection of the first base-64 block must be given in the order of 0, 16, 32, 48, 1, 17, 33, 49,.. and 63 for the first block output.
In the 65 th cycle, the mode selection for the first base-64 becomes 16, and input A is provided1,1,A2,1,A3,1,...,A64,1. Performing the 64 input FFT, the first output B of the second row of the B matrix is calculated in the 65 th cycle2,1. In the 66 th cycle, the second output of the second row of the B matrix, B, is calculated2,2And so on. In the 128 th cycle, B is calculated2,64. Thus, all elements in the second row are calculated.
In the 129 th cycle, the second block starts the computation using the second row of the B matrix, setting the mode selection to 0, 1, 2, 3,. 63 in 64 cycles, resulting in the second row elements of the C matrix. By changing the mode selection to 63 and repeating this process, all elements in the B matrix of size 64 x 64, which is the output of the first base-64 block, are obtained. A corresponding B matrix output is generated in each cycle. This completes the columnar calculation of the two-dimensional FFT.
The elements in the C matrix give the final two-dimensional FFT output. One output is generated from the second block every clock cycle. Therefore, 4096 cycles are required to produce all 4096 outputs of the two-dimensional FFT.
Fig. 4 shows the proposed two-dimensional FFT architecture using radix-64 blocks. The order of the outputs is controlled by several control bits. In this architecture, for a given set of inputs, at each stage, only the outputs required for the next stage are calculated without any loss of performance, so most intermediate buffers can be avoided, with great optimization in terms of memory and latency, as shown by comparison of fig. 5 and 6.
A one-dimensional FFT is performed with the radix-64 block shown in fig. 1, the output of which is fed to a second radix-64 block to perform a row-by-row FFT, resulting in a two-dimensional FFT. The first processor performs a 64-point FFT operation, giving a 4K median value. The second FFT processor performs a 64-point FFT operation on these outputs and gives a final 64 x 64-point FFT. The two base-64 blocks are identical.
Inputting and caching: an input memory block consists of two sets of 64 RAMs, one set of input memory reads data and the other set reads data. The inputs are written to consecutive locations in RAM, i.e., RAM0 receives the first 64 inputs, RAM1 receives the 65 th later input, and so on. During a read operation, one input is provided per RAM, so that 64 inputs are available to the base-64 block. The read operation is done in parallel. Here, the read addresses of all RAMs are the same.
And (3) interstage treatment: the inter-stage processing consists of two sets of 64 registers each. The first set of registers is arranged as a chain of 64 shift registers. The output of the first base-64 block is concatenated into a first group. The first set of registers is pre-processed in each clock cycle. Once every 64 cycles, the outputs of all 64 registers in the first set are loaded in parallel with the second set. The registers in the second set are used as inputs to a second base-64 block.
The control circuit: the control circuit consists of a 12-bit up counter. Input RAMs on the input side require 6 bits for addressing. The present invention proposes that there are two such memory blocks in the architecture. Read addresses, write addresses and chip select signals are generated from the counter. In addition, two base-64 block mode select signals are generated from the counter. At the time of writing, chip selection signals are generated separately for each of RAMs. Upon reading, the same location of all RAMS is accessed in parallel and control signals are generated accordingly. All these signals are synchronized with respect to the previous stage delay.
Continuous flow FFT: the proposed FFT employs a new type of data scheduling mechanism that supports continuous streaming data. Here, the butterfly unit continuously performs calculations on the stream data. The FFT processor receives one input sample per clock cycle.
Claims (8)
1. A data rearrangement optimizing method based on a base-64 two-dimensional FFT architecture is characterized in that: the method comprises the following steps:
(1) designing a parallel pipeline architecture realized on an ASIC (application specific integrated circuit) and an FPGA (field programmable gate array) by utilizing the high regularity of an FFT (fast Fourier transform algorithm);
(2) with the parallel pipeline architecture, data reordering is performed using a six-bit mode select signal as a control signal.
2. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 1, wherein: in the step (1), the parallel pipeline architecture is a 64 × 64 two-dimensional FFT architecture developed by cascading two parallel expansion basis-64 blocks, and the 64 × 64FFT architecture is represented by a basis-64 algorithm based on a basis-4 butterfly unit.
3. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 2, wherein: in step (2), in the 64 × 64 two-dimensional FFT architecture, data rearrangement is performed using a six-bit mode selection signal as a control signal.
4. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 3, wherein: the two-dimensional FFT architecture is obtained by two one-dimensional N-point FFT calculation; an N x N two-dimensional FFT is a one-dimensional FFT in N row directions and a one-dimensional FFT in N column directions, and then N is generated between the two one-dimensional FFTs2The intermediate values are stored.
5. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 3, wherein: the method for representing the 64 × 64FFT architecture by using the novel radix-64 algorithm based on the radix-4 butterfly unit specifically comprises the following steps: a fully expanded radix-64 architecture uses parallel radix-4 butterfly units as basic sub-blocks; the radix-4 butterfly unit has four parallel inputs, and the output based on two-bit control input is called mode selection; the mode select signal determines the generation of one of the four outputs; generating outputs in an arbitrary order according to the mode selection signal;
the first stage has 16 twiddle factor Read Only Memory (ROM) for storing W16; each rom contains four twiddle factor values; the second stage includes four base 4 engines and four ROMs for storing W64; each of the ROMs of the second stage consists of 16 twiddle factor values; mode selection is a 6-bit control signal; two bits of the data are distributed to each stage; generating one of 64 outputs according to the pattern of each stage; initially, all mode selections for the base 4 engine are configured as mode 0; the output produced by the first stage is multiplied by the corresponding twiddle factor; the second stage performs similar operations; four base 4 engines are required to process the 16 outputs obtained from the first stage; likewise, the first four outputs are generated by configuring the mode selection of all four base 4 engines as mode 0, keeping the base 4 engines of the first stage themselves in mode 0; using these four outputs, the output required for the third stage is generated, and another mode selection is used in the final stage.
6. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 5, wherein: the data scheduling in the 64 × 64FFT architecture includes the following steps:
consider input data of 64 x 64 size stored in RAM given by matrix a, where ai,jIs an element in row i and column j; the data scheduling and reordering process in the first base-64 block is: starting with the mode select signal set to 0, the input is A in cycle 11,1,A2,1,A3,1,...,A64,1(ii) a Performing an FFT of these 64 inputs produces the first output, denoted B, of the first row in the B matrix in the first cycle1,1;
In cycle 2, input A1,2,A2,2,A3,2,...,A64,2Given, the hold mode select signal is 0; the output generated is B1,2(ii) a In a similar manner, in loop 64, input A1,64,A2,64,A3,64,...,A64,64Using mode select 0, output B is generated1,64(ii) a Calculating all elements in the first row; in the 65 th cycle, all elements in the first row of the B matrix, namely B1,1,B1,2,...,B1,64Inputting a second radix-64 block in parallel; calculating one line of the first base-64 block, and executing the next stage calculation by the second base-64 block; an intermediate register between the two base-64 blocks stores the output of the first block for use by the second block; in the second base-64 block, the mode select signal changes from 0 to 63 every cycle, producing 64 outputs, C1,1,C1,2,...,C1,64;
The sequentially increasing mode select signal is inactive because it requires the output of the first base-64 block to be reordered before it is applied to the second block; the mode selection of the first base-64 block must be given in the order of 0, 16, 32, 48, 1, 17, 33, 49,. and 63 for the first block output;
mode selection of the first base-64 in the 65 th cycleBecomes 16 and provides input a1,1,A2,1,A3,1,...,A64,1(ii) a Performing the 64 input FFT, the first output B of the second row of the B matrix is calculated in the 65 th cycle2,1(ii) a In the 66 th cycle, the second output of the second row of the B matrix, B, is calculated2,2And so on; in the 128 th cycle, B is calculated2,64(ii) a Thus, all elements in the second row are computed;
in the 129 th cycle, the second block starts the computation using the second row of the B matrix, setting the mode selection to 0, 1, 2, 3.. 63 in 64 cycles, resulting in the second row elements of the C matrix; by changing the mode selection to 63 and repeating this process to obtain all the elements in the B matrix of size 64 x 64, which is the output of the first base-64 block; generating a corresponding B matrix output in each cycle; this completes the columnar calculation of the two-dimensional FFT.
7. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 6, wherein: in the two-dimensional FFT output, elements in a C matrix give a final two-dimensional FFT output; generating an output from the second block each clock cycle; therefore, 4096 cycles are required to produce all 4096 outputs of the two-dimensional FFT.
8. The optimized data rearrangement method based on radix-64 two-dimensional FFT architecture of claim 6, wherein: the two-dimensional FFT architecture is realized by parallel expansion on a base-64 block, and the output sequence of the two-dimensional FFT architecture is controlled by a control bit; performing a one-dimensional FFT with the first radix-64 block, the output of which is fed to the second radix-64 block to perform a row-by-row FFT to obtain a two-dimensional FFT; the first processor performs a 64-point FFT operation, giving a 4K median value; the second FFT processor performs 64-point FFT operations on these outputs and gives the final 64 × 64-point FFT; the two base-64 blocks are identical.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011245309.9A CN112328958B (en) | 2020-11-10 | 2020-11-10 | Optimized data rearrangement method of two-dimensional FFT architecture based on base-64 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011245309.9A CN112328958B (en) | 2020-11-10 | 2020-11-10 | Optimized data rearrangement method of two-dimensional FFT architecture based on base-64 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112328958A true CN112328958A (en) | 2021-02-05 |
CN112328958B CN112328958B (en) | 2024-06-21 |
Family
ID=74317874
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011245309.9A Active CN112328958B (en) | 2020-11-10 | 2020-11-10 | Optimized data rearrangement method of two-dimensional FFT architecture based on base-64 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112328958B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6073154A (en) * | 1998-06-26 | 2000-06-06 | Xilinx, Inc. | Computing multidimensional DFTs in FPGA |
CN1988402A (en) * | 2006-10-10 | 2007-06-27 | 东南大学 | Method for realizing power line carrier communication system |
CN101553808A (en) * | 2006-04-04 | 2009-10-07 | 高通股份有限公司 | Pipeline FFT architecture and method |
CN103106180A (en) * | 2011-09-09 | 2013-05-15 | 德州仪器公司 | Constant geometry split radix FFT |
CN103699515A (en) * | 2013-12-27 | 2014-04-02 | 中国科学院计算技术研究所 | FFT (fast Fourier transform) parallel processing device and FFT parallel processing method |
CN105373367A (en) * | 2015-10-29 | 2016-03-02 | 中国人民解放军国防科学技术大学 | Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector |
CN105893326A (en) * | 2016-03-29 | 2016-08-24 | 西安科技大学 | Device and method for realizing 65536 point FFT on basis of FPGA |
CN110245322A (en) * | 2019-05-09 | 2019-09-17 | 华中科技大学 | A kind of method and system based on the real-time Hilbert transformation of hardware realization high-speed data-flow |
CN110647719A (en) * | 2019-09-20 | 2020-01-03 | 西安电子科技大学 | Three-dimensional FFT (fast Fourier transform) calculation device based on FPGA (field programmable Gate array) |
-
2020
- 2020-11-10 CN CN202011245309.9A patent/CN112328958B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6073154A (en) * | 1998-06-26 | 2000-06-06 | Xilinx, Inc. | Computing multidimensional DFTs in FPGA |
CN101553808A (en) * | 2006-04-04 | 2009-10-07 | 高通股份有限公司 | Pipeline FFT architecture and method |
CN1988402A (en) * | 2006-10-10 | 2007-06-27 | 东南大学 | Method for realizing power line carrier communication system |
CN103106180A (en) * | 2011-09-09 | 2013-05-15 | 德州仪器公司 | Constant geometry split radix FFT |
CN103699515A (en) * | 2013-12-27 | 2014-04-02 | 中国科学院计算技术研究所 | FFT (fast Fourier transform) parallel processing device and FFT parallel processing method |
CN105373367A (en) * | 2015-10-29 | 2016-03-02 | 中国人民解放军国防科学技术大学 | Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector |
CN105893326A (en) * | 2016-03-29 | 2016-08-24 | 西安科技大学 | Device and method for realizing 65536 point FFT on basis of FPGA |
CN110245322A (en) * | 2019-05-09 | 2019-09-17 | 华中科技大学 | A kind of method and system based on the real-time Hilbert transformation of hardware realization high-speed data-flow |
CN110647719A (en) * | 2019-09-20 | 2020-01-03 | 西安电子科技大学 | Three-dimensional FFT (fast Fourier transform) calculation device based on FPGA (field programmable Gate array) |
Non-Patent Citations (1)
Title |
---|
周国昌;张立新;: "基于RCSIMD的8192点FFT并行算法研究", 微电子学与计算机, no. 04 * |
Also Published As
Publication number | Publication date |
---|---|
CN112328958B (en) | 2024-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP3749022B2 (en) | Parallel system with fast latency and array processing with short waiting time | |
US5313413A (en) | Apparatus and method for preventing I/O bandwidth limitations in fast fourier transform processors | |
CN101847986B (en) | Circuit and method for realizing FFT/IFFT conversion | |
US20100128818A1 (en) | Fft processor | |
Wang et al. | Scheduling of data access for the radix-2k FFT processor using single-port memory | |
Richardson et al. | Building conflict-free FFT schedules | |
Chen et al. | Energy-efficient architecture for stride permutation on streaming data | |
Kala et al. | High throughput, low latency, memory optimized 64K point FFT architecture using novel radix-4 butterfly unit | |
US6728742B1 (en) | Data storage patterns for fast fourier transforms | |
CN112328958B (en) | Optimized data rearrangement method of two-dimensional FFT architecture based on base-64 | |
Mathew et al. | Radix‐4 3 based two‐dimensional FFT architecture with efficient data reordering scheme. | |
Hazarika et al. | Low-complexity continuous-flow memory-based FFT architectures for real-valued signals | |
Hsiao et al. | Design of low-cost and high-throughput linear arrays for DFT computations: Algorithms, architectures, and implementations | |
Jones | Design and parallel computation of regularised fast Hartley transform | |
Liu et al. | Efficient large-scale 1D FFT vectorization on multi-core vector accelerator | |
Dawwd et al. | Reduced Area and Low Power Implementation of FFT/IFFT Processor. | |
Raman et al. | Novel bit-reordering circuit for continuous-flow parallel FFT architectures | |
Guan et al. | Design of an application-specific instruction set processor for high-throughput and scalable FFT | |
Jinhe et al. | An efficient implementation of fft based on cgra | |
Melander et al. | An FFT processor based on the SIC architecture with asynchronous PE | |
Kala et al. | Image reconstruction using novel two-dimensional fourier transform | |
Song et al. | An efficient FPGA-based accelerator design for convolution | |
Kumar et al. | FPGA implementation of radix-4-based two-dimensional FFT with and without pipelining using efficient data reordering scheme | |
US20240020129A1 (en) | Self-Ordering Fast Fourier Transform For Single Instruction Multiple Data Engines | |
Kumar et al. | Design and Implementation of AGU based FFT Pipeline Architecture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |