CN118152710A - Implementation of pipeline structure of multipath parallel input and output of butterfly unit of DSP (digital Signal processor) core FFT (fast Fourier transform) coprocessor - Google Patents

Implementation of pipeline structure of multipath parallel input and output of butterfly unit of DSP (digital Signal processor) core FFT (fast Fourier transform) coprocessor Download PDF

Info

Publication number
CN118152710A
CN118152710A CN202410318071.XA CN202410318071A CN118152710A CN 118152710 A CN118152710 A CN 118152710A CN 202410318071 A CN202410318071 A CN 202410318071A CN 118152710 A CN118152710 A CN 118152710A
Authority
CN
China
Prior art keywords
grp1
unit
sequence
butterfly
address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410318071.XA
Other languages
Chinese (zh)
Inventor
王玉体
张家钧
姚力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Jicui Integrated Circuit Application Technology Innovation Center Co ltd
Yangtze River Delta Integrated Circuit Industrial Application Technology Innovation Center
Jiangsu Jicui Integrated Circuit Application Technology Management Co ltd
Original Assignee
Jiangsu Jicui Integrated Circuit Application Technology Innovation Center Co ltd
Yangtze River Delta Integrated Circuit Industrial Application Technology Innovation Center
Jiangsu Jicui Integrated Circuit Application Technology Management Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Jicui Integrated Circuit Application Technology Innovation Center Co ltd, Yangtze River Delta Integrated Circuit Industrial Application Technology Innovation Center, Jiangsu Jicui Integrated Circuit Application Technology Management Co ltd filed Critical Jiangsu Jicui Integrated Circuit Application Technology Innovation Center Co ltd
Priority to CN202410318071.XA priority Critical patent/CN118152710A/en
Publication of CN118152710A publication Critical patent/CN118152710A/en
Pending legal-status Critical Current

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a realization of a multi-path parallel input and output pipeline structure of a butterfly unit of a DSP (digital signal processor) core FFT (fast Fourier transform) coprocessor, which is characterized in that 2 n paths of parallel output results of butterfly operation units in any stage of a radix-2 n time-slot FFT conversion synthesis process are parallelly stored into an input buffer according to the address sequence of 2 n paths of input operands required by each butterfly operation unit of the next stage and read from the output buffer. In the FFT conversion synthesis process, each stage shares one input buffer and one output buffer, and the mode of role switching is carried out at each adjacent stage, so that FFT conversion synthesis in a pipeline mode is realized, 2 n paths of input operands and 2 n paths of operation results of butterfly units in a multi-stage pipeline structure are read out and buffered simultaneously, and compared with the design performance of taking only one buffer as input and output simultaneously, the design performance of the device is obviously improved; splitting the shared cache in a mode of equally dividing the capacity by 2 n, and improving the performance of FFT conversion synthesis by nearly 2 n -1 times under the condition of basically keeping the occupied area of the ASIC unchanged.

Description

Implementation of pipeline structure of multipath parallel input and output of butterfly unit of DSP (digital Signal processor) core FFT (fast Fourier transform) coprocessor
Technical Field
The invention relates to a digital signal processor and a signal processing method, in particular to an implementation of a pipeline structure of multipath parallel input and output of a butterfly unit of a DSP core FFT coprocessor.
Background
Digital Signal Processors (DSPs) are widely used in the field of communications, and Fast Fourier Transforms (FFTs) are used as basic operations for time-domain and frequency-domain transforms, which are extremely widely used in digital spectrum analysis and 5G communication channel coding.
In the implementation process of the radix-2, radix-4, radix-8, radix-16 and other radix-2 n time-pulling FFT algorithm, 2 n paths of input operands are sequentially fetched from an input buffer through 2 n clock cycles in each butterfly operation, are provided for a butterfly unit for processing after being subjected to serial-parallel conversion, and 2 n clock cycles of parallel 2 n paths of output of a butterfly operation result are saved in an output buffer after being subjected to parallel-serial conversion, so that the performance of the whole FFT conversion is reduced.
Disclosure of Invention
The invention aims to: aiming at the problems existing in the prior art, the invention aims to provide the realization of a pipeline structure of the multipath parallel input and output of the butterfly unit of the DSP core FFT coprocessor, and the operation result of the flounder-shaped unit corresponding to the current stage is stored into the output buffer of the next stage according to the read address sequence of 2 n paths of data in the output buffer, which are required by the butterfly unit operation sequentially executed by the next stage in the FFT conversion synthesis process; thus, the 2 n paths of operands required by each butterfly operation in each stage can be ensured to be taken from the address of the output buffer, and the performance of FFT conversion can be obviously improved under the condition of using the same memory resource.
The technical scheme is as follows: a pipelined architecture of multiple parallel input and output of a DSP core FFT coprocessor butterfly unit, comprising:
The acquisition module is used for acquiring sampling points synthesized by FFT (fast Fourier transform) transformation in the process of N-point basis-2 n, wherein N represents the number of the sampling points and is less than or equal to 4;
The preprocessing module is used for performing code bit inversion processing on the sampling points and arranging the sampling points in sequence; the N-point radix-2 n time-decimated FFT transform synthesis comprises S stages, each stage comprising N/2 n=2n(S-1) butterfly operation units, S=log 2n N; storing 2 n paths of operands required by N/4 times of butterfly unit operation of an even level in a first cache unit, and storing an operation result in a second cache unit; storing 2 n paths of operands required by N/2 n times of butterfly unit operation of an odd-numbered stage in a second cache unit, and storing operation results in a first cache unit, wherein the capacity of the first cache unit and the capacity of the second cache unit are N multiplied by 64 bits;
the buffer memory sampling point module is used for sequentially writing elements in a sequence formed by N sampling points with inverted code bits into the first buffer memory unit according to the arrangement sequence;
the operand acquisition module is used for reading 2 n paths of input operands required by each stage of butterfly operation units;
The result cache address conversion module is used for writing the result of the n paths of operation of each level of butterfly unit 2;
And the final result reading module is used for reading the N point radix-2 n and performing FFT conversion to synthesize a final result.
An operation method based on the pipeline structure aims at N-point basis-2 n time-sampled FFT conversion synthesis, and each sampling point is represented by a 64-bit complex number; after inverting the code bit, N sampling points form a sequence, the index of the subscript of the sequence is m, the m is divided by 2 n, the quotient is marked as m//2 n, the remainder is marked as m%2 n, and m E [0, N-1]; the arrangement order of the elements in the sequence is as follows:
lst[0],lst[1],lst[2],lst[3],…,lst[N-2],lst[N-1];
The FFT transform synthesis comprises S stages, each stage comprising N/2 n=2n(S-1) butterfly units,
Storing 2 n paths of operands required by N/2 n times of butterfly unit operation of an even-numbered stage in a first cache unit, and storing an operation result in a second cache unit; and storing 2 n paths of operands required by N/2 n times of butterfly unit operation of the odd-numbered stage in a second cache unit, wherein the operation result is stored in a first cache unit, and the capacities of the first cache unit and the second cache unit are N multiplied by 64 bits.
In one embodiment, the first buffer unit and/or the second buffer unit adopts a true dual-port RAM. Further, the first buffer unit and/or the second buffer unit is composed of 2 n buffer subunits with the capacity size of (N/2 n) multiplied by 64 bits.
In one embodiment, n=2; the first cache unit and/or the second cache unit consists of 4 cache subunits with the capacity of (N/4) multiplied by 64 bits; the cache subunits of the first cache unit are Grp1_ram0, grp1_ram1, grp1_ram2 and Grp1_ram3 respectively; the cache subunits of the second cache unit are Grp2_ram0, grp2_ram1, grp2_ram2 and Grp2_ram3 respectively.
In one embodiment, the operation method includes a step of sequentially writing elements in a sequence of N sampling points with inverted code bits into the first buffer unit according to a permutation order, where the specific writing order is as follows:
S1, when the index m of the subscript meets m4=0, writing the corresponding element into Grp1_ram0, wherein the written data addresses are 0,1, … … and N/4-1 in sequence;
s2, when the index m of the subscript meets m4=1, writing the corresponding element into Grp1_ram1, wherein the written data addresses are 0,1, … … and N/4-1 in sequence;
s3, when the index m of the subscript meets m4=2, writing the corresponding element into Grp1_ram2, wherein the written data addresses are 0,1, … … and N/4-1 in sequence;
s4, when the index m of the subscript meets m4=3, writing the corresponding element into Grp1_ram3, wherein the written data addresses are 0,1, … … and N/4-1 in sequence;
The sequence of writing N sampling points into the first buffer unit in each clka clock cycle is lst [0], lst [1], lst [2], lst [3], …, …, lst [ N-2], lst [ N-1] in sequence.
In one embodiment, the operation method includes a step of reading 4 input operands required by each stage of butterfly operation units, and specifically includes the following steps:
The j-th input operand required by the ith butterfly operation in each stage of the FFT conversion synthesis process is read from the corresponding buffer subunit address i//4, i//4 represents that i is divided by 4 quotient to obtain an integer, i epsilon [0, N/4-1], j represents the number of the read corresponding buffer subunit, j epsilon [0,3];
The reading addresses of 4 paths of input operands required by N/4 times of butterfly operation sequentially executed at each level are 0,1, … … and N/4-1 in sequence in the corresponding 4 cache subunits of the synthesis level;
the selection mode of the corresponding 4 cache subunits in the process of reading the 4 paths of input operands of the butterfly operation unit is as follows:
Reading 4-way input operands from Grp1_ram0, grp1_ram1, grp1_ram2, grp1_ram3 when the number of synthesis stages s is even;
Reading 4-way input operands from Grp2_ram0, grp2_ram1, grp2_ram2, grp2_ram3 when the number of synthesis stages s is odd;
The number of stages in the synthesis process is 0, 1, … …, S-1, S E [0,S-1].
In one embodiment, the operation method includes a step of writing 4-way operation results of each level butterfly unit, and specifically includes the following steps:
The address of the jth path of operation result written into the corresponding buffer subunit in the ith butterfly operation result in each stage of FFT conversion synthesis process is i//4+j × (4 (S-2)), i//4+j × (4 (S-2)) which represents the address written into the buffer subunit Grp1_ram (i% 4) or Grp2_ram (i% 4) by the jth operation result output by the ith butterfly operation, i%4 represents the number of the buffer subunit written by the operation result of the butterfly unit, and the power operation is represented by the power operation;
the selection mode of the 4 cache subunits corresponding to the 4 paths of operation results written into the butterfly unit is as follows:
When the number of the synthesis stages s is even, writing 4 paths of operation results into Grp2_ram0, grp2_ram1, grp2_ram2 and Grp2_ram3;
When the number of synthesis stages s is odd, 4 paths of operation results are written from Grp1_ram0, grp1_ram1, grp1_ram2 and Grp1_ram 3.
In one embodiment, the operation method includes a step of extracting FFT to synthesize a final result when reading N-radix-4, and specifically includes the following steps:
the final result of the FFT transform synthesis is stored in the following locations:
When the total number S of FFT conversion synthesis stages is even, the final result of N-point FFT conversion synthesis is stored in 4 cache subunits of the first cache unit;
When the total number S of FFT conversion synthesis stages is an odd number, the final result of N-point FFT conversion synthesis is stored in 4 cache subunits of the second cache unit;
the address read from the first buffer unit or the second buffer unit by the final result of the N-point FFT transform synthesis is as follows:
The read address ram_address of Grp1/2_ram0 is 0,1, … and N/4-1 in sequence, and the read result is stored in the element with the sequence index of 4 Xram_address;
the read address ram_address of Grp1/2_ram1 is 0,1, … … and N/4-1 in sequence, and the read result is stored in the element with the sequence index of 4 Xram_address+1;
The read address ram_address of Grp1/2_ram2 is 0,1, … … and N/4-1 in sequence, and the read result is stored in the element with the sequence index of 4 x ram_address+2;
The read addresses of Grp1/2_ram3 are sequentially 0,1, … … and N/4-1, and the read result is stored in the element with the sequence index of 4 Xram_address+3;
The final result of the N-point FFT transform synthesis is read from the buffer sub-units of the first buffer unit or the second buffer unit in the following order:
The operation result with the address ram_address is read out from Grp1/2_ram0, grp1/2_ram1, grp1/2_ram2 and Grp1/2_ram3 in sequence, and the operation result with the address ram_address+1 is read out from Grp1/2_ram0, grp1/2_ram1, grp1/2_ram2 and Grp1/2_ram3 in sequence until the result with the address N/4-1 is read out.
A computer-readable storage medium storing at least one executable instruction that, when executed on an electronic device, causes the electronic device to perform the operations of the computing method.
Compared with the prior art, the method has the following beneficial effects:
1. In the FFT conversion synthesis process, each stage shares one input buffer and one output buffer, and the mode of role switching is carried out at each adjacent stage, so that FFT conversion synthesis in a pipeline mode is realized, 2 n paths of input operands and 2 n paths of operation results of butterfly units in a multi-stage pipeline structure are read out and buffered simultaneously, and compared with the design performance of taking only one buffer as input and output simultaneously, the design performance of the device is obviously improved;
2. Splitting the pair of shared caches in a mode of dividing the capacity by 2 n equally, and improving the performance of FFT conversion synthesis by nearly 2 n -1 times under the condition of basically keeping the occupied area of the ASIC unchanged.
Drawings
FIG. 1 is a diagram of a true dual port RAM interface in accordance with one embodiment of the present invention;
FIG. 2 is a schematic diagram showing a connection relationship between each sub-RAM and a butterfly unit according to an embodiment of the present invention;
Fig. 3 is a timing diagram of a level 2 FFT butterfly operation process for n=64-point radix-4 time-slicing;
FIG. 4 is a timing diagram of a 4-way input operand read from RAM_Rrp1/2 for 16 times at stage 0 of the N=64-point radix-4 time-tap FFT;
fig. 5 is a timing chart of a process of writing the result of the 4-way operation of the 0 th stage of the n=64-point radix-4 time-pulling FFT into the ram_rrp1/2.
Detailed Description
In a first aspect, the invention provides a DSP core FFT coprocessor based on a multistage pipeline structure, the processor comprising an acquisition module, a preprocessing module, a buffer sampling point module, an operand acquisition module, a result buffer address conversion module, and a final result reading module.
The acquisition module is used for acquiring sampling points synthesized by FFT (fast Fourier transform) conversion when N points are based-2 n, wherein N represents the number of the sampling points, and N is generally 1,2,3 and 4.
The preprocessing module is used for performing code bit inversion processing on the sampling points and arranging the sampling points in sequence; wherein the FFT transform synthesis comprises S stages, each stage comprising N/2 n=2n(S-1) butterfly units,Storing 2 n paths of operands required by N/2 n times of butterfly unit operation of an even-numbered stage in a first cache unit, and storing an operation result in a second cache unit; and storing 2 n paths of operands required by N/2 n times of butterfly unit operation of the odd-numbered stage in a second cache unit, wherein the operation result is stored in a first cache unit, and the capacities of the first cache unit and the second cache unit are N multiplied by 64 bits.
The buffer sampling point module is used for sequentially writing elements in a sequence formed by N sampling points with inverted code bits into the first buffer unit according to the arrangement sequence.
The operand acquisition module is used for reading 2 n paths of input operands required by each stage of butterfly operation units.
The result cache address conversion module is used for writing the n paths of operation results of each level of butterfly units 2.
The final result reading module is used for extracting FFT conversion to synthesize a final result when reading N point radix-2 n.
The second aspect, based on the DSP core FFT coprocessor, the invention provides an operation method, aiming at N-point basis-2 n time-extracted FFT conversion synthesis, each sampling point is represented by a 64-bit complex number; after inverting the code bit, N sampling points form a sequence, the index of the subscript of the sequence is m, the m is divided by 2 n, the quotient is marked as m//2 n, the remainder is marked as m%2 n, and m E [0, N-1]; the arrangement order of the elements in the sequence is as follows:
lst[0],lst[1],lst[2],lst[3],…,lst[N-2],lst[N-1];
The FFT transform synthesis comprises S stages, each stage comprising N/2 n=2n(S-1) butterfly units,
Storing 2 n paths of operands required by N/2 n times of butterfly unit operation of an even-numbered stage in a first cache unit, and storing an operation result in a second cache unit; and storing 2 n paths of operands required by N/2 n times of butterfly unit operation of the odd-numbered stage in a second cache unit, wherein the operation result is stored in a first cache unit, and the capacities of the first cache unit and the second cache unit are N multiplied by 64 bits.
The technical scheme of the operation method of the invention is explained and illustrated in detail below by taking a radix-4 time-slicing FFT algorithm as an example and combining the drawings and the specific embodiments.
The following is given for the N-point radix-4 time-tap FFT transform:
wherein each of the N sampling points is represented by a 64bit complex number;
the N samples after inversion of the code bits form a sequence (denoted as a list lst) in which the elements are arranged as follows:
lst[0],lst[1],lst[2],lst[3],…,lst[N-2],lst[N-1];
In the above list lst, the quotient of the index m divided by 4 is expressed as m//4, and the remainder is expressed as m%4, m.epsilon.0, N-1.
The whole process of FFT conversion synthesis comprises S stages, each stage comprises N/4=4 S-1 butterfly operation units,
S=log4 N。
For the s-th stage, the s-power of 4 is denoted as 4 s, and the number on the bit after s conversion into binary data representation is denoted as s [0], s e [0,S-1].
In the FFT conversion synthesis process, the number of stages is from the 0 th stage, the maximum number of stages is the S-1 st stage, four paths of operands required by N/4 times of butterfly operation of the even number of stages (namely S [0] =0) are stored in a cache RAM_Grp1 shown in fig. 2, and the operation result is stored in a cache RAM_Grp2; the four-way operand required for N/4 times of butterfly unit operation in the odd stage (i.e., s [0] =1) is stored in the cache ram_grp2 shown in fig. 2, and the operation result is stored in the cache ram_grp1. The capacity sizes of the cache RAM_Grp1 and the cache RAM_Grp2 are N multiplied by 64 bits, and each cache RAM_Grp2 is composed of 4 True dual ports RAM with the capacity size of (N/4) multiplied by 64 bits; the sub RAMs of the cache RAM_Grp1 are Grp1_ram0, grp1_ram1, grp1_ram2 and Grp1_ram3; the sub-RAMs of cache RAM_Grp2 are Grp2_ram0, grp2_ram1, grp2_ram2 and Grp2_ram3.
PortA of the sub-RAMs are used for writing N sample point data with inverted code bits into each sub-RAM by a system bus outside the FFT accelerator, and for reading out the final FFT conversion result from each sub-RAM, and an independent clock clka is used. Port B of the sub RAM is used for reading 4 paths of input data required by a butterfly operation unit in the FFT accelerator and writing in a conversion result of each stage in the FFT conversion process, and an independent clock clkb is also used.
As shown in FIG. 1, CEBA/B are active low, WEBA/B are high indicating a read operation, and low indicates a write operation. The same true dual port RAM (true dual ports RAM) as shown in FIG. 1 is used for both the 8 sub-RAMs Grp1_ram0/1/2/3 and Grp2_ram0/1/2/3 in FIG. 2.
FIG. 3 is a timing diagram of the FFT stage 2 butterfly operation process with N=64 radix-4, showing only the real part of the 4-way operand (x 0/1/2/3_real [31:0 ]) the real part of the 3 twiddle factors (factor 1/2/3_real [31:0 ]) and the real part of the result (y 0/1/2/3_real [31:0 ]).
Specifically, the butterfly operation process comprises the following key steps:
(1) The data in the sequence (lst) of N sampling points with inverted code bits is sequentially written into ram_grp1 in fig. 3 in the following writing order:
s1: when the index m of the subscript of the element in the list satisfies m% 4=0, the corresponding element (sampling point) is written in grp1_ram0 in fig. 2; and the written data addresses are 0,1, … … and N/4-1 in sequence;
S2: when the index m of the subscript satisfies m4=1, the corresponding element is written into Grp1_ram1, and the written data addresses are 0,1, … … and N/4-1 in sequence;
S3: when the index m of the subscript satisfies m4=2, the corresponding element is written into Grp1_ram2, and the written data addresses are 0,1, … … and N/4-1 in sequence;
S4: when the index m of the subscript satisfies m4=3, the corresponding element is written into Grp1_ram3, and the written data addresses are 0,1, …, m, … and N/4-1 in sequence;
S5: the N sample points are written to RAM_Grp1 of FIG. 2 in the sequence lst [0], lst [1], lst [2], lst [3], …, …, lst [ N-2], lst [ N-1] at each clka clock cycle.
(2) 4 Paths of input operands required by butterfly operation units in each stage of the N-point FFT conversion synthesis process are read.
The j-th input operand required by the ith butterfly operation unit in each stage of the FFT conversion synthesis process is read from the corresponding sub-RAM address (i// 4); wherein i epsilon [0, N/4-1], j represents the number of the corresponding sub-RAM read, j epsilon [0,3], i.e. represents the selection of Grp1_ ramj or Grp2_ram;
For example, if the number of synthesis stages s is even (i.e., s [0] =0), when j=0, it means that the 0 th input operand is read from the sub-RAM grp1_ram 0; if the number of synthesis stages s is odd (i.e. s [0] =1), when j=0, it means that the 0 th input operand is read from the sub-RAM grp2_ram 0;
i//4 denotes that the jth input operand is read from address i//4 of the corresponding sub-RAM (i.e., grp 1/2_ramj);
The 4 paths of input operands required by the N/4 times butterfly operation units sequentially executed in each stage of the N-point FFT conversion synthesis process are sequentially as follows in the 4 sub RAM read addresses corresponding to the synthesis stage: 0,1, … …, N/4-1.
The selection mode of the corresponding 4 sub RAMs in the process of reading the 4 paths of input operands of the butterfly operation unit is as follows:
reading 4-way input operands from Grp1_ram0, grp1_ram1, grp1_ram2, grp1_ram3 when the number of synthesis stages s is even (i.e. s [0] =0);
when the number of synthesis stages s is odd (i.e., s [0] =1), 4 input operands are read from grp2_ram0, grp2_ram1, grp2_ram2, grp2_ram 3.
The number of stages in the synthesis process is counted in the order of 0 th stage, 1 st stage, … … th stage, S-1 st stage, S=log 4 N.
Fig. 4 is a timing diagram of a 4-way input operand read from ram_rrp1/2 for 16 times at stage 0 of n=64-point radix-4 time-decimated FFT.
Wherein, dram1e0addrb [7:0]/dram1e1addrb [15:8]/dram1e2addrb [23:16 ]/dram1e 1:3924 ]
DRAM1e3addrb [31:24] corresponds in sequence to AB [7:0] of port B of Grp1_ram0/1/2/3 or Grp2_ram0/1/2/3, respectively;
dram1e0_doutb[63:0]/dram1e1_doutb[127:64]/dram1e2_doutb[191:128]/
drm1e3_ doutb [255:192] corresponds in turn to QB [63:0] of portB of Grp1_ram0/1/2/3 or Grp2_ram0/1/2/3, respectively;
WEBB [0]/[1]/[2]/[3] corresponds to WEBB [0]/[1]/[2]/[3] of portB of Grp1_ram0/1/2/3 or Grp2_ram0/1/2/3, respectively, in that order; CEBB [0]/[1]/[2]/[3] correspond to CEBB [0]/[1]/[2]/[3] of the portB of Grp1_ram0/1/2/3 or Grp2_ram0/1/2/3, respectively, in that order.
(3) And writing 4 paths of operation results of butterfly units in each level of the N-point FFT conversion synthesis process.
The address written into the corresponding sub RAM of the jth path of operation result in the ith butterfly unit operation result in each stage of FFT conversion synthesis process is i//4+j × (4 x (S-2)); wherein i epsilon [0, N/4-1], j epsilon [0,3], i%4 is the number of the sub RAM written by the operation result representing the butterfly unit, and represents the selection of Grp1_ram (i% 4) or Grp2_ram (i% 4);
For example, if the number of synthesis stages s is even (i.e., s [0] =0), when i=0, writing the result of butterfly unit operation into the sub-RAM grp2_ram 0; if the number of synthesis stages s is odd (i.e. s [0] =1), when i=0, it means that the butterfly unit operation result is written into the sub-RAM which is grp1_ram 0;
i//4+j × (4×s-2)) represents the address at which the jth operation result output by the ith butterfly unit is written into the sub-RAM (i.e., grp1_ram (i% 4) or grp2_ram (i% 4));
the selection mode of the 4 sub RAMs corresponding to the 4 paths of operation results written into the butterfly unit is as follows:
When the number of the synthesis stages is even (i.e. s [0] =0), writing 4 paths of operation results into Grp2_ram0, grp2_ram1, grp2_ram2 and Grp2_ram 3;
When the number of synthesis stages is odd (i.e., s [0] =1), 4 paths of operation results are written into Grp1_ram0, grp1_ram1, grp1_ram2 and Grp1_ram 3.
Fig. 5 is a timing chart of a process of writing duRAM _rrp1/2 of the result of the 4-way operation of the 0 th stage of the n=64-point radix-4 time-pulling FFT.
Wherein dram2e0_ addrb [7:0]/dram2e1_ addrb [15:8]/dram2e2_ addrb [23:16]/dram2e3_ addrb [31:24] correspond to AB [7:0] of grp1_ram0/1/2/3 or grp2_ram0/1/2/3, port B, respectively, in that order;
dram2e0_ doutb [63:0]/dram2e1_ doutb [127:64]/dram2e2_ doutb [191:128]/dram2e3_ doutb [255:192] correspond to QB [63:0] of the portB of Grp1_ram0/1/2/3 or Grp2_ram0/1/2/3, respectively, in that order;
WEBB [0]/[1]/[2]/[3] corresponds to WEBB [0]/[1]/[2]/[3] of portB of Grp1_ram0/1/2/3 or Grp2_ram0/1/2/3, respectively, in that order;
CEBB [0]/[1]/[2]/[3] correspond to CEBB [0]/[1]/[2]/[3] of the portB of Grp1_ram0/1/2/3 or Grp2_ram0/1/2/3, respectively, in that order.
(4) And reading the final result of the N-point FFT conversion.
The final result preservation positions of the N-point FFT transform are as follows:
When the FFT transform synthesis total number s=log 4 N is even, the final result of the N-point FFT transform is stored in ram_grp1 (including grp1_ram0, grp1_ram1, grp1_ram2, grp1_ram3 for 4 sub-RAMs);
When the FFT transform synthesis total number s=log 4 N is odd, the final result of the N-point FFT transform is stored in ram_grp2 (including grp2_ram0, grp2_ram1, grp2_ram2, grp2_ram3 for 4 sub-RAMs);
the final result of the N-point FFT transform is the address read from the output buffer ram_grp1 or ram_grp2:
The read addresses ram_address from Grp1/2_ram0 are sequentially 0,1, …, N/4-1, and the read result is stored in the 4×ram_address element with index of the index lst.
The read address ram_address of Grp1/2_ram1 is sequentially 0,1, … … and N/4-1, and the read result is stored in the element with the sequence index of 4×ram_address+1.
The read addresses ram_address from Grp1/2_ram2 are sequentially 0,1, … … and N/4-1, and the read result is stored in the element with the sequence index of 4 Xram_address+2.
The read addresses from Grp1/2_ram3 are sequentially 0,1, … … and N/4-1, and the read result is stored in the element with the sequence index of 4 Xram_address+3.
The final result of the N-point FFT transform is read from the sub-RAMs in the output buffer ram_grp1 or ram_grp2 in the following order:
The operation result with the address ram_address is read out from Grp1/2_ram0, grp1/2_ram1, grp1/2_ram2 and Grp1/2_ram3 in sequence, and the operation result with the address ram_address+1 is read out from Grp1/2_ram0, grp1/2_ram1, grp1/2_ram2 and Grp1/2_ram3 in sequence, and … is finished after the result with the address N/4-1 is read out.
According to the method, the operation result of the plaice-shaped unit corresponding to the current stage is stored in the output buffer of the next stage according to the reading address sequence of 4 paths of data in the output buffer, which is required by the butterfly-shaped unit operation sequentially executed by the next stage in the FFT conversion synthesis process; this ensures that the 4-way operands required by the butterfly unit at each stage kth (k=0, 1,2, … …, N/4-1) are all taken from the address k of the output buffer, thus improving the performance of the FFT transform by approximately three times using the same memory resources.
The above-mentioned embodiments take the radix-4 time-slicing FFT algorithm as an example, which is only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, and therefore, the protection scope of the present invention is not limited thereto. It will be apparent to those skilled in the art that several modifications and variations can be made without departing from the principles of the present invention, and these modifications and variations should be considered as the scope of the present invention, i.e. the present invention is equally applicable to the implementation of the radix-2, radix-8, radix-16 etc. FFT algorithm, taking care of properly adjusting the number of divisions of the input buffer and the output buffer. For example, the radix-8 FFT needs to split the input-output buffer into 8 sub-buffers with equal capacity, and properly adjust the operation results of each butterfly unit to store the address of the input buffer.
In a third aspect, the present invention further provides a computer readable storage medium, where at least one executable instruction is stored, where the executable instruction when executed on an electronic device causes the electronic device to execute a process of the above-mentioned operation method.
Those skilled in the art will appreciate that the present invention may be implemented as a system, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: either entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or entirely software, or a combination of hardware and software, referred to herein generally as a "circuit," module "or" system. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media, which contain computer-readable program code.

Claims (10)

1. A pipelined architecture of multiple parallel input and output of a DSP core FFT coprocessor butterfly unit comprising:
The acquisition module is used for acquiring sampling points synthesized by FFT (fast Fourier transform) transformation in the process of N-point basis-2 n, wherein N represents the number of the sampling points and is less than or equal to 4;
The preprocessing module is used for performing code bit inversion processing on the sampling points and arranging the sampling points in sequence; the N-point radix-2 n time-decimated FFT synthesis comprises S stages, each stage comprising N/2 n=2n(S-1) butterfly operation units, Storing 2 n paths of operands required by N/4 times of butterfly unit operation of an even level in a first cache unit, and storing an operation result in a second cache unit; storing 2 n paths of operands required by N/2 n times of butterfly unit operation of an odd-numbered stage in a second cache unit, and storing operation results in a first cache unit, wherein the capacity of the first cache unit and the capacity of the second cache unit are N multiplied by 64 bits;
the buffer memory sampling point module is used for sequentially writing elements in a sequence formed by N sampling points with inverted code bits into the first buffer memory unit according to the arrangement sequence;
the operand acquisition module is used for reading 2 n paths of input operands required by each stage of butterfly operation units;
The result cache address conversion module is used for writing the result of the n paths of operation of each level of butterfly unit 2;
And the final result reading module is used for reading the N point radix-2 n and performing FFT conversion to synthesize a final result.
2. A method of operation based on the pipeline architecture of claim 1, characterized by:
For N-point basis-2 n time-sampling FFT conversion synthesis, each sampling point is represented by a 64-bit complex number;
After inverting the code bit, N sampling points form a sequence, the index of the subscript of the sequence is m, the m is divided by 2 n, the quotient is marked as m//2 n, the remainder is marked as m%2 n, and m E [0, N-1];
the arrangement order of the elements in the sequence is as follows:
lst[0],lst[1],lst[2],lst[3],…,lst[N-2],lst[N-1];
The FFT transform synthesis comprises S stages, each stage comprising N/2 n=2n(S-1) butterfly units,
Storing 2 n paths of operands required by N/2 n times of butterfly unit operation of an even-numbered stage in a first cache unit, and storing an operation result in a second cache unit; and storing 2 n paths of operands required by N/2 n times of butterfly unit operation of the odd-numbered stage in a second cache unit, wherein the operation result is stored in a first cache unit, and the capacities of the first cache unit and the second cache unit are N multiplied by 64 bits.
3. The operation method according to claim 2, characterized in that: and the first cache unit and/or the second cache unit adopts a true dual-port RAM.
4. A butterfly method according to claim 3, characterized in that: the first cache unit and/or the second cache unit is/are composed of 2 n cache subunits with the capacity of (N/2 n) multiplied by 64 bits.
5. The method of operation according to claim 4, wherein: n=2;
The first cache unit and/or the second cache unit consists of 4 cache subunits with the capacity of (N/4) multiplied by 64 bits; the cache subunits of the first cache unit are Grp1_ram0, grp1_ram1, grp1_ram2 and Grp1_ram3 respectively; the cache subunits of the second cache unit are Grp2_ram0, grp2_ram1, grp2_ram2 and Grp2_ram3 respectively.
6. The operation method according to claim 5, comprising the step of sequentially writing elements in the sequence of N sampling points subjected to code bit inversion into the first buffer unit in the order of arrangement, the specific writing order being as follows:
S1, when the index m of the subscript meets m4=0, writing the corresponding element into Grp1_ram0, wherein the written data addresses are 0,1, … … and N/4-1 in sequence;
s2, when the index m of the subscript meets m4=1, writing the corresponding element into Grp1_ram1, wherein the written data addresses are 0,1, … … and N/4-1 in sequence;
s3, when the index m of the subscript meets m4=2, writing the corresponding element into Grp1_ram2, wherein the written data addresses are 0,1, … … and N/4-1 in sequence;
s4, when the index m of the subscript meets m4=3, writing the corresponding element into Grp1_ram3, wherein the written data addresses are 0,1, … … and N/4-1 in sequence;
The sequence of writing N sampling points into the first buffer unit in each clka clock cycle is lst [0], lst [1], lst [2], lst [3], …, …, lst [ N-2], lst [ N-1] in sequence.
7. The method according to claim 5, comprising the step of reading 4-way input operands required by each butterfly unit, comprising the steps of:
The j-th input operand required by the ith butterfly operation in each stage of the FFT conversion synthesis process is read from the corresponding buffer subunit address i//4, i//4 represents that i is divided by 4 quotient to obtain an integer, i epsilon [0, N/4-1], j represents the number of the read corresponding buffer subunit, j epsilon [0,3];
The reading addresses of 4 paths of input operands required by N/4 times of butterfly operation sequentially executed at each level are 0,1, … … and N/4-1 in sequence in the corresponding 4 cache subunits of the synthesis level;
the selection mode of the corresponding 4 cache subunits in the process of reading the 4 paths of input operands of the butterfly operation unit is as follows:
Reading 4-way input operands from Grp1_ram0, grp1_ram1, grp1_ram2, grp1_ram3 when the number of synthesis stages s is even;
Reading 4-way input operands from Grp2_ram0, grp2_ram1, grp2_ram2, grp2_ram3 when the number of synthesis stages s is odd;
The number of stages in the synthesis process is 0, 1, … …, S-1, S E [0,S-1].
8. The operation method according to claim 7, comprising the step of writing 4-way operation results of each butterfly unit, specifically comprising the following:
The address of the jth path of operation result written into the corresponding buffer subunit in the ith butterfly operation result in each stage of FFT conversion synthesis process is i//4+j × (4 (S-2)), i//4+j × (4 (S-2)) which represents the address written into the buffer subunit Grp1_ram (i% 4) or Grp2_ram (i% 4) by the jth operation result output by the ith butterfly operation, i%4 represents the number of the buffer subunit written by the operation result of the butterfly unit, and the power operation is represented by the power operation;
the selection mode of the 4 cache subunits corresponding to the 4 paths of operation results written into the butterfly unit is as follows:
When the number of the synthesis stages s is even, writing 4 paths of operation results into Grp2_ram0, grp2_ram1, grp2_ram2 and Grp2_ram3;
When the number of synthesis stages s is odd, 4 paths of operation results are written from Grp1_ram0, grp1_ram1, grp1_ram2 and Grp1_ram 3.
9. The method of claim 5, comprising the step of reading the N-radix-4 time-warping FFT to synthesize a final result, comprising the steps of:
the final result of the FFT transform synthesis is stored in the following locations:
When the total number S of FFT conversion synthesis stages is even, the final result of N-point FFT conversion synthesis is stored in 4 cache subunits of the first cache unit;
When the total number S of FFT conversion synthesis stages is an odd number, the final result of N-point FFT conversion synthesis is stored in 4 cache subunits of the second cache unit;
the address read from the first buffer unit or the second buffer unit by the final result of the N-point FFT transform synthesis is as follows:
The read address ram_address of Grp1/2_ram0 is 0,1, … and N/4-1 in sequence, and the read result is stored in the element with the sequence index of 4 Xram_address;
the read address ram_address of Grp1/2_ram1 is 0,1, … … and N/4-1 in sequence, and the read result is stored in the element with the sequence index of 4 Xram_address+1;
The read address ram_address of Grp1/2_ram2 is 0,1, … … and N/4-1 in sequence, and the read result is stored in the element with the sequence index of 4 x ram_address+2;
The read addresses of Grp1/2_ram3 are sequentially 0,1, … … and N/4-1, and the read result is stored in the element with the sequence index of 4 Xram_address+3;
The final result of the N-point FFT transform synthesis is read from the buffer sub-units of the first buffer unit or the second buffer unit in the following order:
The operation result with the address ram_address is read out from Grp1/2_ram0, grp1/2_ram1, grp1/2_ram2 and Grp1/2_ram3 in sequence, and the operation result with the address ram_address+1 is read out from Grp1/2_ram0, grp1/2_ram1, grp1/2_ram2 and Grp1/2_ram3 in sequence until the result with the address N/4-1 is read out.
10. A computer readable storage medium, wherein at least one executable instruction is stored in the storage medium, which when executed on an electronic device, causes the electronic device to perform the operations of the method of any one of claims 2 to 9.
CN202410318071.XA 2024-03-20 2024-03-20 Implementation of pipeline structure of multipath parallel input and output of butterfly unit of DSP (digital Signal processor) core FFT (fast Fourier transform) coprocessor Pending CN118152710A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410318071.XA CN118152710A (en) 2024-03-20 2024-03-20 Implementation of pipeline structure of multipath parallel input and output of butterfly unit of DSP (digital Signal processor) core FFT (fast Fourier transform) coprocessor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410318071.XA CN118152710A (en) 2024-03-20 2024-03-20 Implementation of pipeline structure of multipath parallel input and output of butterfly unit of DSP (digital Signal processor) core FFT (fast Fourier transform) coprocessor

Publications (1)

Publication Number Publication Date
CN118152710A true CN118152710A (en) 2024-06-07

Family

ID=91296450

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410318071.XA Pending CN118152710A (en) 2024-03-20 2024-03-20 Implementation of pipeline structure of multipath parallel input and output of butterfly unit of DSP (digital Signal processor) core FFT (fast Fourier transform) coprocessor

Country Status (1)

Country Link
CN (1) CN118152710A (en)

Similar Documents

Publication Publication Date Title
Cheng et al. High-throughput VLSI architecture for FFT computation
US20080208944A1 (en) Digital signal processor structure for performing length-scalable fast fourier transformation
KR20000050581A (en) Fft processor with cbfp algorithm
US9262378B2 (en) Methods and devices for multi-granularity parallel FFT butterfly computation
CN112231626A (en) FFT processor
US9176929B2 (en) Multi-granularity parallel FFT computation device
US20020194235A1 (en) Processing apparatus
Revanna et al. A scalable FFT processor architecture for OFDM based communication systems
US20210157602A1 (en) Apparatus and method of a scalable and reconfigurable fast fourier transform
US9098449B2 (en) FFT accelerator
US20150331634A1 (en) Continuous-flow conflict-free mixed-radix fast fourier transform in multi-bank memory
Joshi FFT architectures: a review
WO2014108718A1 (en) Continuous-flow conflict-free mixed-radix fast fourier transform in multi-bank memory
US9268744B2 (en) Parallel bit reversal devices and methods
WO2001078290A2 (en) Traced fast fourier transform apparatus and method
CN118152710A (en) Implementation of pipeline structure of multipath parallel input and output of butterfly unit of DSP (digital Signal processor) core FFT (fast Fourier transform) coprocessor
Kaya et al. A novel addressing algorithm of radix-2 FFT using single-bank dual-port memory
Cui-xiang et al. Some new parallel fast Fourier transform algorithms
Takala et al. Butterfly unit supporting radix-4 and radix-2 FFT
Hassan et al. Implementation of a reconfigurable ASIP for high throughput low power DFT/DCT/FIR engine
TWI402695B (en) Apparatus and method for split-radix-2/8 fast fourier transform
Banerjee et al. A Novel Paradigm of CORDIC-Based FFT Architecture Framed on the Optimality of High-Radix Computation
JP3709291B2 (en) Fast complex Fourier transform method and apparatus
CN118051709A (en) FFT processor and operation method
EP4307138A1 (en) Self-ordering fast fourier transform for single instruction multiple data engines

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination