WO2017000756A1 - 基于3072点快速傅里叶变换的数据处理方法及处理器、存储介质 - Google Patents

基于3072点快速傅里叶变换的数据处理方法及处理器、存储介质 Download PDF

Info

Publication number
WO2017000756A1
WO2017000756A1 PCT/CN2016/085423 CN2016085423W WO2017000756A1 WO 2017000756 A1 WO2017000756 A1 WO 2017000756A1 CN 2016085423 W CN2016085423 W CN 2016085423W WO 2017000756 A1 WO2017000756 A1 WO 2017000756A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
storage module
point
data storage
fourier transform
Prior art date
Application number
PCT/CN2016/085423
Other languages
English (en)
French (fr)
Inventor
刘览
程晨
崔玉姣
张炜
赵艳艳
Original Assignee
深圳市中兴微电子技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市中兴微电子技术有限公司 filed Critical 深圳市中兴微电子技术有限公司
Priority to US15/561,980 priority Critical patent/US10152455B2/en
Publication of WO2017000756A1 publication Critical patent/WO2017000756A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B3/00Line transmission systems
    • H04B3/54Systems for transmission via power distribution lines
    • H04B3/542Systems for transmission via power distribution lines the information being in digital form

Definitions

  • the present invention relates to the field of power line carrier communication, and in particular, to a data processing method based on 3072 point fast Fourier transform, a processor, and a storage medium.
  • Power Line Communication (PLC) communication uses high-voltage power lines (usually referred to as voltage levels of 35kV and above in the power line carrier field), medium-voltage power lines (referred to as 10kV voltage levels), or low-voltage distribution lines (380/220V subscriber lines).
  • PLC Power Line Communication
  • Homeplug is a protocol for power line communication, which is mainly used by the power line carrier technology standardization organization.
  • the modem (Modem) defined by the physical (PHY) layer uses the 3072-point fast Fourier transform (FFT, Fast). Fourier Transformation) is one of its core technologies to implement the modulation function of Orthogonal Frequency Division Multiplexing (OFDM).
  • OFDM Orthogonal Frequency Division Multiplexing
  • the main methods for implementing FFT in the industry are base-2 and base-4 algorithms, which have many mature methods from software simulation to hardware implementation.
  • FPGA Field Programmable Gate Array
  • this algorithm can only process the power of 2 or the power of 4 points of Fourier transform.
  • interpolation can be used to interpolate the original data into 2
  • the power of the power or the power of 4 points is followed by a base-2 or base-4 fast Fourier transform process.
  • this will bring two main problems: First, due to the use of interpolation, it will inevitably bring errors; Second, due to changes in the sampling rate, the complexity of synchronization will be increased in the OFDM system.
  • the hybrid FFT algorithm is widely used in the industry, mainly using the butterfly algorithm (Cooley-Tukey), Vinogra Algorithms such as the German-Fourier Transform Algorithm (WFTA), but often the hardware implementation of such algorithms is more complicated, and it is necessary to open up more storage space to store intermediate operation results and perform data position conversion, and increase random access memory (RAM, The resources of Random Access Memory and the problem of congestion in the chip are more serious.
  • the current methods are mainly to improve the parallelism of the operation, and the local asynchronous structure is used to improve the local processing frequency.
  • improving the parallelism of computing is a common method, and local asynchronous brings great challenges to circuit design and low power consumption. Therefore, chips that are sensitive to cost and power consumption generally do not use such methods.
  • Increasing the degree of parallelism of the algorithm increases the complexity of storing the intermediate results in the operation.
  • a ping-pong storage method is proposed. Realize the operation of data and the access of intermediate results. Although this method reduces the complexity of intermediate result access, it also brings the consequences of using more RAM, which will increase significantly in the fast Fourier transform of large points. Chip area and power consumption.
  • an embodiment of the present invention provides a data processing method based on a 3072 point fast Fourier transform, where the method includes:
  • the 3072 points of data are stored in the data storage module according to a predetermined mapping relationship
  • DFT discrete Fourier transform
  • 32 data are read in parallel from the data storage module, and a 1024-point DFT operation is performed until the fast Fourier transform of the 3072 point data is completed.
  • the 3072 point data is stored to the number according to a predetermined mapping relationship.
  • the storage module including:
  • the 3072 points of data are sorted, and 3072 points of data are sequentially stored into the data storage module according to the sorting result;
  • the data storage module is composed of 32 96 ⁇ 36 (depth row number ⁇ bit width bit) RAM.
  • the reading 16 data in parallel from the storage module and performing a 3-point DFT operation includes:
  • 16 data are read in parallel from the storage module, and a 3-point DFT operation is performed using the Goertzel algorithm.
  • the 32 data is read in parallel from the data storage module, and the 1024-point DFT operation is performed, including:
  • 32 data are read in parallel from the data storage module, and a 10-level FFT operation is performed.
  • 32 data are read in parallel from the data storage module, and a 1024-point DFT operation is performed until the fast Fourier transform of the 3072 point data is completed, including:
  • 32 data are read in parallel from the data storage module, and a 1024-point DFT operation is performed by the Cooley-Tukey algorithm until the fast Fourier transform of 3072 points of data is completed.
  • the 3052-point fast Fourier transform-based processor provided by the embodiment of the present invention includes:
  • mapping unit configured to store 3072 points of data in a data storage module according to a predetermined mapping relationship
  • a 3-point DFT operation unit configured to read 16 data in parallel from the storage module and perform a 3-point DFT operation, and store the result in the data storage module in place after the operation is completed;
  • the 1024-point DFT operation unit is configured to read 32 data in parallel from the data storage module and perform a 1024-point DFT operation until the fast Fourier transform of the 3072 point data is completed.
  • the mapping unit is further configured to sort 3072 points of data according to the Good-Thomas algorithm, and sequentially store 3072 points of data into the data storage module according to the sorting result;
  • the data storage module consists of 32 96 ⁇ 36 (depth row number ⁇ bit width bit) RAM group to make.
  • the 3-point DFT operation unit is further configured to read 16 data in parallel from the storage module, and perform a 3-point DFT operation by using a Goertzel algorithm.
  • the 1024-point DFT operation unit is further configured to read 32 data in parallel from the data storage module, and perform 10-level FFT operation.
  • the 1024-point DFT operation unit is further configured to read 32 data in parallel from the data storage module, and perform a 1024-point DFT operation by using a Cooley-Tukey algorithm until the 3,072 data is completed.
  • the middle leaf transform is further configured to read 32 data in parallel from the data storage module, and perform a 1024-point DFT operation by using a Cooley-Tukey algorithm until the 3,072 data is completed.
  • a storage medium is provided in the embodiment of the present invention.
  • the storage medium stores a computer program, and the computer program is configured to execute the data processing method based on the 3072 point fast Fourier transform.
  • 3072 points of data are first input to the data access control module, and the data access control module stores the 3072 points of data in a data storage module according to a Good-Thomas algorithm. After all the 3072 points of data are stored, the data access control module issues a read enable data from the data storage module and sends it to the 3-point DFT operation unit. The operation result is then written back to the data storage module by the data access control module. After all the 3-point DFT operations, the data is completely rewritten back to the data storage module. At this time, the data access control module issues the read enable again to read the data from the data storage module and send it to the 1024-point DFT operation unit for calculation.
  • the operation result is then written back to the data storage module by the data access control module.
  • the data access control module stores the data storage module in a certain order and reads and outputs a complete 3072 point fast. Fourier transform operation.
  • the technical solution of the embodiment of the present invention overcomes the problems and defects that require more storage capacity in order to cache and reorder the input and output and the intermediate results in the fast Fourier transform process in the prior art.
  • the algorithm, combined with the different characteristics of various algorithms, is considered as a whole, that is, the amount of computation is reduced, and it is often used in current hybrid algorithms. It is often the case that a large amount of cache is needed to store the intermediate results of the operation for optimization, and the balance of resources and performance is achieved.
  • the hardware implementation is simple, the data buffer consumption is small, the multiplier unit is small, the operation parallelism is high, and the operation precision is flexible.
  • FIG. 1 is a schematic flow chart of a data processing method based on 3072-point fast Fourier transform according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram 1 of a processor based on 3072-point fast Fourier transform according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a Good-Thomas algorithm according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of relationship between input data and a RAM address according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of an operation unit of a Goertzel algorithm according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a 3-point DFT operation unit according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a butterfly operation according to an embodiment of the present invention.
  • FIG. 8 is a flow chart of a base-2 1024-point FFT algorithm according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of 1024-point DFT operation data and address closure according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of relationship between data and address written back after the first stage 1024-point DFT calculation according to an embodiment of the present invention
  • FIG. 11 is a schematic structural diagram of a 1024-point DFT operation unit according to an embodiment of the present invention.
  • FIG. 12 is a diagram showing relationship between output data and a RAM address according to an embodiment of the present invention.
  • FIG. 13 is a second structural diagram of a processor based on a 3072-point fast Fourier transform according to an embodiment of the present invention
  • FIG. 14 is a third structural diagram of a processor based on a 3072-point fast Fourier transform according to an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart of a data processing method based on 3072-point fast Fourier transform according to an embodiment of the present invention, where the method is applied to a processor based on 3072-point fast Fourier transform, as shown in FIG.
  • the data processing method of the point fast Fourier transform includes the following steps:
  • Step 101 Store 3072 points of data into a data storage module according to a predetermined mapping relationship.
  • 3072 points of data are sorted according to the Good-Thomas algorithm, and 3072 points of data are sequentially stored into the data storage module according to the sorting result;
  • the data storage module is composed of 32 96 ⁇ 36 (depth row number ⁇ bit Wide bit) RAM composition.
  • the processor based on the 3072 point fast Fourier transform has a 3-point DFT operation unit and a 1024-point DFT operation unit.
  • the 3072 point FFT is decomposed into a 3-point DFT operation and a 1024-point DFT operation by using the Good-Thomas algorithm.
  • the processor based on the 3072 point fast Fourier transform further has a data access control module and a data storage module; wherein the main function of the data access control module is to input data, calculate intermediate results, and need to output
  • the data is managed centrally and is the core management module for data input, arithmetic processing and final output.
  • the main function of the data storage module is to store the input data and the intermediate operation result, which is composed of 32 96 ⁇ 36 (depth row number ⁇ bit width bit) RAM.
  • the data access control module uniformly maps the storage bit order of the input data according to the requirement of the degree of parallelism. Specifically, according to the Good-Thomas algorithm, the data is arranged, and the 3072 points of data are distributed and stored to 32. 96 ⁇ 36 (depth row number ⁇ bit width bit) RAM The data storage module is composed.
  • Step 102 Read 16 data in parallel from the storage module, perform a 3-point DFT operation, and store the result in the data storage module in place after the operation is completed.
  • 16 data are read in parallel from the storage module, and a 3-point DFT operation is performed by using a Goertzel algorithm.
  • the 3-point DFT operation unit that processes the 3-point DFT operation includes three complex additions and two complex multiplications.
  • Step 103 Read 32 data from the data storage module in parallel, and perform a 1024-point DFT operation. After the operation is completed, the result is stored in the data storage module in situ until the fast Fourier transform of the 3072 point data is completed.
  • 32 data are read in parallel from the data storage module, and a 10-level FFT operation is performed.
  • the Cooley-Tukey algorithm is used to perform the 1024-point DFT operation. After the operation is completed, the result is stored in the data storage module in situ until the fast Fourier transform of the 3072 point data is completed.
  • the 1024-point DFT operation unit of the embodiment of the present invention includes 16 butterfly operation units and a rotation factor generation unit, and performs 1024-point DFT operation on the read data by using the Cooley-Tukey algorithm.
  • each butterfly operation unit includes one complex multiplication operation, and each complex multiplication can be split into five real additions and three real multiplications.
  • the control module needs to adjust the writeback address when the first FFT operation of the data is completed and then write back to the data storage module. This adjustment only needs to be performed once.
  • the first row in the figure is marked with the "sequence" corresponding column indicating that the data written in the first stage is still stored in the original position; the column corresponding to the "reverse order" indicates that the first level is completed.
  • the data storage sequence written back needs to be reversed. For example, the data of RAM0 address 1 will be written to RAM31 address 1 after the first stage FFT operation.
  • the data access control module sequentially reads the data output from the data storage module according to the Good-Thomas algorithm, and completes the entire 3072 point fast Fourier transform.
  • 3072 points of data are first input to the data access control module, and the data access control module stores the 3072 points of data in a data storage module according to a Good-Thomas algorithm. After all the 3072 points of data are stored, the data access control module issues a read enable data from the data storage module and sends it to the 3-point DFT operation unit. The operation result is then written back to the data storage module by the data access control module. After all the 3-point DFT operations, the data is completely rewritten back to the data storage module. At this time, the data access control module issues the read enable again to read the data from the data storage module and send it to the 1024-point DFT operation unit for calculation.
  • the operation result is then written back to the data storage module by the data access control module.
  • the data access control module sequentially stores the data stored in the data storage module according to the order determined by the Good-Thomas algorithm. Output, thus completing a complete 3072 point fast Fourier transform operation.
  • the technical solution of the embodiment of the present invention overcomes the problems and defects that require more storage capacity in order to cache and reorder the input and output and the intermediate results in the fast Fourier transform process in the prior art.
  • the algorithm, combined with the different characteristics of various algorithms, is considered as a whole, that is, the amount of computation is reduced, and the current hybrid algorithm often needs a large amount of cache to store the intermediate results of the operation, and the resources and performance are achieved. Balance. And hardware implementation Simple, the data cache consumption is small, the multiplier unit is small, the operation parallelism is high, and the operation precision is flexible.
  • the Good-Thomas algorithm adopted in the embodiment of the present invention has a flow as shown in FIG. 3.
  • the 3072-point fast Fourier transform is decomposed into a 3-point DFT operation and a 1024-point DFT operation.
  • the input data is rearranged, then the DFT operation is performed every 3 points of data, and the operation result is mapped to the 1024-point DFT operation unit according to the mapping relationship shown in FIG.
  • Using Good-Thomas as the top-level algorithm structure can reduce 2046 complex multiplications by using Cooley-Tukey as the top-level algorithm structure.
  • n (N 2 n 1 +N 1 n 2 ) mod N (0 ⁇ n 1 ⁇ N 1 -1; 0 ⁇ n 2 ⁇ N 2 -1) (1)
  • n1 is equal to 0, the first line of input data will be stored in addresses 0 to 31 of the 32 RAMs of the data storage module; n1 is equal to 1, and the second line of input data will be stored in the data storage module 32 RAMs.
  • addresses 32-63 n1 is equal to 2
  • the third line of input data is stored in addresses 64-95 of the 32 RAMs of the data storage module.
  • each row in Table 1 is written to The address order in the RAM should also be reversed in the data access control module in the reverse order of the bit.
  • the relationship between the input data and the RAM address is shown in Figure 4.
  • the three blocks in the figure above represent the 32 RAMs.
  • the first row of each block represents the specific RAM address, a total of 96 addresses; the first column on the far right of each block represents the RAM number, a total of 32 RAM.
  • the serial number in each address indicates the sequence number of the input data, and the first input data number is 0.
  • the third row and the second column 1536 in the first block of the figure indicate that the first 1537th data is input in the RAM1. In the address number 0.
  • the 3-point DFT operation is started, and a total of 1024 times are performed. Since the 3-point DFT operation unit of the present invention employs 16 parallel 3 Point DFT operation unit, so 16 rounds of 3-point DFT operation in one round, and 16 rounds of 1024 3-point DFT operations.
  • the data access control unit takes data from the 0th to 15th RAM of the 32 RAMs of the data storage unit and the RAM ping pong of the 16th to 31st, and sends the data to the 3D DFT operation unit for calculation. The result of the operation is written back to the corresponding position in the original position.
  • the RAM address goes. It should be noted that the order in which the data is fetched according to the Goertzel algorithm is that the addresses are taken out from high to low.
  • the Goertzel algorithm is used in the present invention, and the formula of the 3-point DFT operation is as follows:
  • Equation (2) can be transformed into the form of equation (4):
  • y(n) is the result obtained for each step
  • y(n) x(n)+REG
  • x(n) and REG are two intermediate results.
  • REG represents the data in the register.
  • the present invention adopts a structure in which three Goertzel arithmetic units are operated in parallel in a 3-point DFT operation unit, and the processing time can be increased by three times, and the structure thereof is as shown in FIG. 6.
  • the No. 0 3-point DFT operation unit data is taken from the 2048th, 1024th, and 0th addresses of the data storage module RAM0 and input into the 3-point DFT operation unit.
  • the results of (0), X(1), and X(2) are written back to addresses 0, 1024, and 2048 of RAM0, respectively, to achieve the effect of the home position calculation.
  • the other 3-point DFTs of other rounds are operated in a similar order from the data access control unit to retrieve data from the data storage unit and written back in place.
  • Each Goertzel unit needs to match the input butterfly rotation factor in addition to the input x value. as well as Input 12-bit data (1 bit sign bit, 2 bit decimal place, 9 bit decimal place) in the arithmetic unit, and output 13-bit data (1 bit sign bit, 3 bit decimal place, 9 bit decimal place) after 3 iterations as shown in FIG. Perform 2 complex multiplications and 2 complex additions.
  • the output data needs to be overflow protected during the last iteration of the output.
  • the specific method is to first determine whether the cut-off bit is all "0" or all "1", if it is, it means that there is no overflow, and the data outputs the intercepted 13-bit data according to the rule; if not all "0” or all “1” means that the data exceeds 13 bits, indicating that there is an overflow. At this time, the highest bit of the cutoff bit is judged. If it is "0”, the output is 13'b0111111111111; if it is "1", the output is 13'b1000000000000.
  • a 1024-point DFT operation can be performed on each row in Table 1 in accordance with the Good-Thomas algorithm for a total of three rounds.
  • the present invention uses the Cooley-Tukey time extraction base-2 FFT algorithm, and its algorithm flow diagram is shown in FIG.
  • the 1024-point DFT operation requires a total of 10 levels of operations, each stage of 512 butterfly operations, a total of 5120 butterfly operations;
  • Figure 7 Shown in the dotted line is a butterfly operation unit, which mainly performs the butterfly operation as shown in FIG. 8.
  • the butterfly operator inputs the data of the port 1 plus or minus the product of the input port 2 and the twiddle factor.
  • Each butterfly operation includes a complex multiplication, so the entire 1024-point DFT operation requires a total of 5120 complex multiplications; the embodiment of the present invention uses 16 butterfly-parallel computing hardware structures, so each stage of 1024-point DFT operations needs to be performed. For 32 operations, the entire 1024-point DFT requires 320 operations.
  • the 1024-point DFT operation unit is used to perform the 1024-point DFT operation.
  • the DFT operation unit includes 16 butterfly operation units, and a rotation factor generation unit is used to generate the operation.
  • a twiddle factor (as shown in FIG. 11), wherein the butterfly operation unit mainly performs a butterfly operation as shown in FIG. 8, wherein the input X 1 (k) and X 2 (k) correspond to a and b of the butterfly operation unit.
  • the port (refer to the butterfly operation unit in Fig. 11), the tw port (refer to the butterfly operation unit in Fig. 11) inputs the rotation factor required for each butterfly operation, which is generated by the rotation factor generation unit.
  • the butterfly operation unit adopts a full pipeline structure, and a butterfly operation can be completed every 4 clocks (clk). It takes 32 clocks (clk) to complete the first level operation. Since the 1024-point DFT operation requires 10 levels, in order to save hardware storage resources as much as possible under the premise of ensuring performance, it is necessary to perform the truncation processing on the result of each level of operation. The interception of each level and the fixed point situation are shown in Table 3.
  • the butterfly unit also needs to overflow the result. Protection, the method is similar to the Goertzel operation unit, and it is judged whether the cut-off bit is all "0" or all "1", and the output data is adjusted accordingly according to the judgment result.
  • a 1024-point fast Fourier transform operation is performed on 3072 points of data in three rounds.
  • the first round of 1024-point DFT operation is taken as an example to introduce an access operation of the data access control module to the intermediate result of the operation.
  • the data of the first round of 1024-point DFT operation is stored in the address 0 ⁇ 31 of RAM0 ⁇ 31 in the data storage unit, a total of 1024 data, for the convenience of description, the relationship between data and address is as shown in Fig. 9, where each address
  • the stored data is the result of writing back to the corresponding address just after the 3-point DFT operation.
  • each clock cycle is read by the data access control unit from the 32 RAMs of the data storage unit, and is sent to the 1024-point DFT operation unit for calculation.
  • the 1024-point DFT operation requires a total of 10 levels to complete the operation, and each stage requires 32 operations.
  • Each operation is performed in parallel by 16 butterfly units, that is, each clock cycle is from RAM.
  • 32 data is read and operated, as long as no address conflict occurs, the whole operation process can be pipelined.
  • the first 1024-point fast Fourier operation is performed. After the level operation is completed, the data storage address needs to be adjusted when the data is written back.
  • the first line in the figure is marked with the "sequence" corresponding column indicating that the data written in the first level is still stored in the original position.
  • the column corresponding to the "reverse order” indicates that the data storage order of the first-stage operation is reversed.
  • the data of the RAM0 address 1 is written to the RAM 31 address 1 after the first-level FFT operation.
  • This writeback data address adjustment is performed only after the completion of the first stage of the 1024 DFT operation, and the subsequent operations read and write data from the RAM in strict accordance with the principle of the home position operation. The purpose of this adjustment is because the data written back is stored according to the correspondence between the previous input data and the RAM address.
  • an address conflict occurs, which affects the parallelism of the operation. This adjustment is only done when writing back, and will not interrupt the pipeline processing of the algorithm.
  • each level of operation requires a slave data storage unit In the RAM, the data is input and output to the 1024-point DFT module in the order shown in FIG.
  • each subsequent level of operation data can be performed in situ. This has the advantage of making the data storage of the entire operation relatively simple and saving a lot of data buffer space. This makes the 3072 point fast Fourier transform only possible to use 3072 storage space in the whole operation process, and the resources using RAM are minimized.
  • the first row of each block represents the specific RAM address, a total of 96 addresses; the first column on the far right of each block represents the RAM number, a total of 32 RAM.
  • Each The number in the address indicates the sequence number of the output data, and the first output data number is 0.
  • the third row in the first block in the figure, the second column 2049 indicates that the output of the 2050th data is stored in the RAM1. Address 0.
  • the data access control module reads the data output from the data storage unit in order according to the address data mapping relationship shown in the figure, and finally completes the 3072 point fast Fourier transform.
  • the processor based on the 3072-point fast Fourier transform provided by the embodiment of the present invention is specifically configured as shown in FIG. 13, and includes:
  • the data storage module is composed of 32 96 ⁇ 36 (depth row number ⁇ bit width bit) RAM.
  • the data access control module is responsible for reading and writing data.
  • the 3-point DFT unit consists of 16 sets of 3 parallel Goertzel units.
  • the 1024-point DFT operation unit is composed of 16 parallel operation butterfly unit and rotation factor generation unit.
  • FIG. 14 is a third structural diagram of a processor based on a 3072-point fast Fourier transform according to an embodiment of the present invention. As shown in FIG. 14, the processor includes:
  • the mapping unit 11 is configured to store 3072 points of data in a data storage module according to a predetermined mapping relationship
  • the 3-point DFT operation unit 12 is configured to read 16 data in parallel from the storage module, perform a 3-point DFT operation, and store the result in the data storage module in place after the operation is completed;
  • the 1024-point DFT operation unit 13 is configured to read 32 data in parallel from the data storage module and perform a 1024-point DFT operation until the fast Fourier transform of the 3072 point data is completed.
  • the mapping unit 11 is further configured to sort 3072 points of data according to the Good-Thomas algorithm, and sequentially store 3072 points of data into the data storage module according to the sorting result;
  • the data storage module consists of 32 96 ⁇ 36 (depth line number ⁇ bit width bit) RAM composition.
  • the 3-point DFT operation unit 12 is further configured to be from the storage module.
  • the block reads 16 data in parallel and uses the Goertzel algorithm to perform a 3-point DFT operation.
  • the 1024-point DFT operation unit 13 is further configured to read 32 data from the data storage module in parallel and perform a 10-level FFT operation.
  • the 1024-point DFT operation unit 13 is further configured to read 32 data in parallel from the data storage module, and perform a 1024-point DFT operation using the Cooley-Tukey algorithm until the 3072 point data is completed. Fourier transform.
  • each unit in the 3072-point fast Fourier transform-based processor shown in FIG. 14 can be understood by referring to the foregoing description of the data processing method based on 3072-point fast Fourier transform.
  • the functions of the units in the 3072-point fast Fourier transform-based processor shown in FIG. 14 can be implemented by a program running on a processor, or can be realized by a specific logic circuit.
  • mapping unit 11 the 3-point DFT operation unit 12, and the 1024-point DFT operation unit 13 may be a central processing unit (CPU), a digital signal processor (DSP, Digital Singnal Processor), or programmable.
  • DSP digital signal processor
  • FPGA Field-Programmable Gate Array
  • the embodiment of the invention further describes a storage medium in which a computer program is stored, the computer program being configured to execute the data processing method based on the 3072 point fast Fourier transform of the foregoing embodiments.
  • the disclosed method and smart device may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner such as: multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored or not executed.
  • the components shown or discussed The coupling, or direct coupling, or communication connection between the components may be an indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms.
  • the units described above as separate components may or may not be physically separated, and the components displayed as the unit may or may not be physical units, that is, may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one second processing unit, or each unit may be separately used as one unit, or two or more units may be integrated into one unit;
  • the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the invention optimizes the overall situation by selecting multiple algorithms and combining different characteristics of various algorithms, that is, reducing the amount of calculation, and optimizing the situation that a large amount of buffer is needed to store the intermediate results of the operation in the current hybrid algorithm.
  • a balance of resources and performance has been achieved.
  • the hardware implementation is simple, the data buffer consumption is small, the multiplier unit is small, the operation parallelism is high, and the operation precision is flexible.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Discrete Mathematics (AREA)
  • Power Engineering (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Complex Calculations (AREA)

Abstract

一种基于3072点快速傅里叶变换的数据处理方法及处理器,包括:将3072点数据按照预定的映射关系,存储至数据存储模块中(101);从所述存储模块中并行读取16个数据,进行3点DFT运算,运算完成后将结果按原位存储至数据存储模块中(102);从所述数据存储模块中并行读取32个数据,进行1024点DFT运算,运算完成后将结果按原位存储至数据存储模块中,直至完成3072点数据的快速傅里叶变换(103)。

Description

基于3072点快速傅里叶变换的数据处理方法及处理器、存储介质 技术领域
本发明涉及电力线载波通信领域,尤其涉及一种基于3072点快速傅里叶变换的数据处理方法及处理器、存储介质。
背景技术
电力线载波(PLC,Power Line Communication)通信是利用高压电力线(在电力线载波领域通常指35kV及以上电压等级)、中压电力线(指10kV电压等级)或低压配电电线(380/220V用户线)作为信息传输媒介进行语音或数据传输的一种特殊通信方式。其最大特点是不需要重新架设网络,只要有电线,就可以进行数据传递。
家庭插电联盟(Homeplug)是电力线载波技术标准化组织主推的一种适用于电力线通信的协议,其物理(PHY)层定义的调制解调器(Modem)中使用了3072点快速傅里叶变换(FFT,Fast Fourier Transformation)来实现正交频分复用(OFDM,Orthogonal Frequency Division Multiplexing)的调制功能,是其核心技术之一。
目前业内实现FFT的主要方法有基-2、基-4算法,其从软件仿真到硬件实现已经有很多种成熟的方法。在工程上都有很多种对应的处理器和可编程门阵列(FPGA,Field Programmable Gate Array)IP核。但此算法只能处理2的幂次或者4的幂次点数的傅立叶变换,对于非2的幂次或者4的幂次点数的数据,可以采用内插的方式,将原数据内插成2的幂次或者4的幂次点数的数据之后在进行基-2或者基-4快速傅立叶变换处理。但是这样会带来两个主要问题:一、由于采用了内插,必然会带来误差;二、由于采样速率发生变化,在OFDM系统中还将增加同步的复杂度。对于上述 中不满足2的幂次或者4的幂次点数的数据的快速傅立叶变换,如果数据点数是复合数,目前业内普遍采用混合基FFT算法,主要采用蝶形算法(Cooley-Tukey)、维诺格拉德-傅里叶变换算法(WFTA)等算法,但是往往这类算法硬件实现较为复杂,需要开辟更多的存储空间来存储中间运算结果以及进行数据位置转换,增加了随机存取存储器(RAM,Random Access Memory)的资源以及并且是芯片中走线拥塞问题更加严重。在提高快速傅里叶处理性能方面,目前采用的手段主要是提高运算的并行性,以及采用局部异步结构即提高局部处理主频。其中提高运算并行性是比较常用的方法,而局部异步对于电路设计以及低功耗都会带来不小的挑战,所以对成本以及功耗很敏感的芯片一般不采用这类方法。而增加算法的并行度又会增加对运算中间结果存储的复杂度,为了避免由于运算过程中由于存取数据造成冲突而造成运算中间出现气泡导致性能损失,提出了一种采用乒乓存储的方式来实现数据的运算和中间结果的存取,虽然该方法降低了中间结果存取的复杂度,但是也带来了RAM使用较多的后果,在大点数的快速傅里叶变换运算时会显著增加芯片的面积和功耗。
发明内容
为解决上述技术问题,本发明实施例提供了一种基于3072点快速傅里叶变换的数据处理方法,所述方法包括:
将3072点数据按照预定的映射关系,存储至数据存储模块中;
从所述存储模块中并行读取16个数据,进行3点离散傅里叶变换(DFT,Discrete Fourier Transform)运算,运算完成后将结果按原位存储至数据存储模块中;
从所述数据存储模块中并行读取32个数据,进行1024点DFT运算,直至完成3072点数据的快速傅里叶变换。
本发明实施例中,所述将3072点数据按照预定的映射关系,存储至数 据存储模块中,包括:
按照Good-Thomas算法对3072点数据进行排序,依据排序结果将3072点数据依次存储至数据存储模块中;所述数据存储模块由32个96×36(深度行数×位宽bit)RAM组成。
本发明实施例中,所述从所述存储模块中并行读取16个数据,进行3点DFT运算,包括:
从所述存储模块中并行读取16个数据,采用Goertzel算法进行3点DFT运算。
本发明实施例中,所述从所述数据存储模块中并行读取32个数据,进行1024点DFT运算,包括:
从所述数据存储模块中并行读取32个数据,进行10级FFT运算。
本发明实施例中,所对从所述数据存储模块中并行读取32个数据,进行1024点DFT运算,直至完成3072点数据的快速傅里叶变换,包括:
从所述数据存储模块中并行读取32个数据,采用Cooley-Tukey算法进行1024点DFT运算,直至完成3072点数据的快速傅里叶变换。
本发明实施例提供的基于3072点快速傅里叶变换的处理器包括:
映射单元,配置为将3072点数据按照预定的映射关系,存储至数据存储模块中;
3点DFT运算单元,配置为从所述存储模块中并行读取16个数据,进行3点DFT运算,运算完成后将结果按原位存储至数据存储模块中;
1024点DFT运算单元,配置为从所述数据存储模块中并行读取32个数据,进行1024点DFT运算,直至完成3072点数据的快速傅里叶变换。
本发明实施例中,所述映射单元,还配置为按照Good-Thomas算法对3072点数据进行排序,依据排序结果将3072点数据依次存储至数据存储模块中;所述数据存储模块由32个96×36(深度行数×位宽bit)的RAM组 成。
本发明实施例中,所述3点DFT运算单元,还配置为从所述存储模块中并行读取16个数据,采用Goertzel算法进行3点DFT运算。
本发明实施例中,所述1024点DFT运算单元,还配置为从所述数据存储模块中并行读取32个数据,进行10级FFT运算。
本发明实施例中,所述1024点DFT运算单元,还配置为从所述数据存储模块中并行读取32个数据,采用Cooley-Tukey算法进行1024点DFT运算,直至完成3072点数据的快速傅里叶变换。
本发明实施例提供的一种存储介质,所述存储介质中存储有计算机程序,所述计算机程序配置为执行所述的基于3072点快速傅里叶变换的数据处理方法。
本发明实施例的技术方案中,3072点数据首先依次输入到数据存取控制模块,数据存取控制模块根据Good-Thomas算法将3072点数据按一定位序关系分散存储到数据存储模块中,待3072点数据全部存储完成之后再由数据存取控制模块发出读使能从数据存储模块取出数据送入到3点DFT运算单元,运算结果再由数据存取控制模块写回到数据存储模块,完成全部的3点DFT运算之后,数据又全部重新写回到数据存储模块,这时数据存取控制模块再次发出读使能从数据存储模块读取数据送入到1024点DFT运算单元中进行运算,运算结果再由数据存取控制模块写回到数据存储模块,待1024点DFT运算完成之后再由数据存取控制模块按照一定的顺序将存储在数据存储模块读取输出完成一次完整的3072点快速傅里叶变换运算。本发明实施例的技术方案克服了现有技术中存在的快速傅里叶变换过程中为了对输入输出以及运算中间结果进行缓存以及重新排序而需要较多的存储容量的问题和缺陷,通过选用多重算法,并结合各种算法的不同特性,从整体上进行统筹考虑,即减少了运算量,又对于目前混合算法中常 常出现需要大量缓存来存储运算中间结果的情况进行优化,做到了资源、性能的平衡。并且,硬件实现简单,数据缓存消耗小、乘法器单元少,运算并行度高,运算精度灵活。
附图说明
图1为本发明实施例的基于3072点快速傅里叶变换的数据处理方法的流程示意图;
图2为本发明实施例的基于3072点快速傅里叶变换的处理器结构示意图一;
图3为本发明实施例的Good-Thomas算法示意图;
图4为本发明实施例的输入数据与RAM地址关系示意图;
图5为本发明实施例的Goertzel算法运算单元的示意图;
图6为本发明实施例的3点DFT运算单元的示意图;
图7为本发明实施例的蝶形运算示意图;
图8为本发明实施例的基-2 1024点FFT算法流图;
图9为本发明实施例的1024点DFT运算数据与地址关示意图;
图10为本发明实施例的第一级1024点DFT计算之后写回的数据与地址关系示意图;
图11为本发明实施例的1024点DFT运算单元结构示意图;
图12为本发明实施例的输出数据与RAM地址关系;
图13是本发明实施例的基于3072点快速傅里叶变换的处理器的结构组成示意图二;
图14为本发明实施例的基于3072点快速傅里叶变换的处理器的结构组成示意图三。
具体实施方式
为了能够更加详尽地了解本发明实施例的特点与技术内容,下面结合附图对本发明实施例的实现进行详细阐述,所附附图仅供参考说明之用,并非用来限定本发明实施例。
图1为本发明实施例的基于3072点快速傅里叶变换的数据处理方法的流程示意图,该方法应用于基于3072点快速傅里叶变换的处理器,如图1所示,所述基于3072点快速傅里叶变换的数据处理方法包括以下步骤:
步骤101:将3072点数据按照预定的映射关系,存储至数据存储模块中。
本发明实施例中,按照Good-Thomas算法对3072点数据进行排序,依据排序结果将3072点数据依次存储至数据存储模块中;所述数据存储模块由32个96×36(深度行数×位宽bit)的RAM组成。
本发明实施例中,基于3072点快速傅里叶变换的处理器具有3点DFT运算单元和1024点DFT运算单元。
本发明实施例采用Good-Thomas算法将3072点FFT分解为3点DFT运算以及1024点DFT运算。
本发明实施例中,基于3072点快速傅里叶变换的处理器还具有数据存取控制模块、数据存储模块;其中,数据存取控制模块的主要功能是对输入数据,运算中间结果以及需要输出的数据进行统筹管理,是数据输入、运算处理以及最终输出的核心管理模块。数据存储模块的主要功能是对输入数据以及中间运算结果进行存储,由32个96×36(深度行数×位宽bit)的RAM组成。
本发明实施例中,根据并行度的要求由数据存取控制模块统一对输入数据的存储位序进行映射,具体地,按照Good-Thomas算法对数据排列,将3072点数据分散存储到由32个96×36(深度行数×位宽bit)的RAM 组成的数据存储模块中。
步骤102:从所述存储模块中并行读取16个数据,进行3点DFT运算,运算完成后将结果按原位存储至数据存储模块中。
本发明实施例中,从所述存储模块中并行读取16个数据,采用Goertzel算法进行3点DFT运算。
本发明实施例采用Goertzel算法进行3点DFT运算的公式如下:
Figure PCTCN2016085423-appb-000001
其中,x(n)为输入数据,X(k)为运算结果,W为蝶形旋转因子。
对上述公式进行变换,得到如下公式:
Figure PCTCN2016085423-appb-000002
可见,处理3点DFT运算的3点DFT运算单元包括3次复数加法以及2次复数乘法。
步骤103:从所述数据存储模块中并行读取32个数据,进行1024点DFT运算,运算完成后将结果按原位存储至数据存储模块中,直至完成3072点数据的快速傅里叶变换。
本发明实施例中,从所述数据存储模块中并行读取32个数据,进行10级FFT运算。具体采用Cooley-Tukey算法进行1024点DFT运算,运算完成后将结果按原位存储至数据存储模块中,直至完成3072点数据的快速傅里叶变换。
本发明实施例的1024点DFT运算单元包含有16个蝶形运算单元以及旋转因子生成单元,采用Cooley-Tukey算法对读取到的数据进行1024点DFT运算。
本发明实施例中,每个蝶形运算单元中包含一次复数乘法运算,而每个复数乘法又可以拆分为5次实数加法和3次实数乘法。
本发明实施例中,为了保证不产生数据地址冲突影响并行性,数据存 取控制模块需要对数据进行完第一级FFT运算之后写回数据存储模块时对写回地址进行调整,这种调整只需要进行一次。如图10所示,图中第一行标有“顺序”对应列表示第一级运算完写回数据仍然存储在原来位置的位置上;标有“逆序”对应的列表示第一级运算完写回的数据存储顺序需要颠倒,如RAM0地址1的数据经过第一级FFT运算完之后将写到RAM31地址1中。
本发明实施例中,在进行完3轮1024点DFT运算之后,数据存取控制模块再次按照Good-Thomas算法,从数据存储模块中依次读取数据输出,完成整个3072点快速傅里叶变换。
本发明实施例的技术方案中,3072点数据首先依次输入到数据存取控制模块,数据存取控制模块根据Good-Thomas算法将3072点数据按一定位序关系分散存储到数据存储模块中,待3072点数据全部存储完成之后再由数据存取控制模块发出读使能从数据存储模块取出数据送入到3点DFT运算单元,运算结果再由数据存取控制模块写回到数据存储模块,完成全部的3点DFT运算之后,数据又全部重新写回到数据存储模块,这时数据存取控制模块再次发出读使能从数据存储模块读取数据送入到1024点DFT运算单元中进行运算,运算结果再由数据存取控制模块写回到数据存储模块,待1024点DFT运算完成之后,再由数据存取控制模块按照由Good-Thomas算法决定的顺序将存储在数据存储模块中的数据依次输出,从而完成一次完整的3072点快速傅里叶变换运算。本发明实施例的技术方案克服了现有技术中存在的快速傅里叶变换过程中为了对输入输出以及运算中间结果进行缓存以及重新排序而需要较多的存储容量的问题和缺陷,通过选用多重算法,并结合各种算法的不同特性,从整体上进行统筹考虑,即减少了运算量,又对于目前混合算法中常常出现需要大量缓存来存储运算中间结果的情况进行优化,做到了资源、性能的平衡。并且,硬件实现 简单,数据缓存消耗小、乘法器单元少,运算并行度高,运算精度灵活。
下面结合附图对本发明实施例的基于3072点快速傅里叶变换的数据处理方法进一步详细地说明。
本发明实施例采用的Good-Thomas算法,其流程如图3所示。将3072点快速傅里叶变换分解成3点DFT运算和1024点DFT运算。
首先对输入数据进行重新排列,之后每3点数据进行DFT运算,再将运算结果按图2所示的映射关系映射到1024点DFT运算单元进行运算。选用Good-Thomas做为顶层算法结构比Cooley-Tukey做为顶层算法结构可以省去2046次复数乘法。
首先将输入的数据x(n)分为3组,根据Good-Thomas算法,参见公式(1):
n=(N2n1+N1n2)mod N(0≤n1≤N1-1;0≤n2≤N2-1)      (1)
上式中N1=3,N2=1024,n为输入数据标号,mod为取余符号,N为DFT变换区间长度。
因此,输入的数据需要按照表1的顺序分成3组:
Figure PCTCN2016085423-appb-000003
表1
这三组数据中n1等于0,第一行输入数据将存储在数据存储模块32个RAM的第0~31号地址中;n1等于1,第二行输入数据将存储在数据存储模块32个RAM的第32~63号地址中;n1等于2,第三行输入数据将存储在数据存储模块32个RAM的第64~95号地址中。为了保证在1024点FFT运算时输入的数据顺序按比特(bit)位逆序反转输入,表1中每行写入到 RAM中的地址顺序还应当提前在数据存取控制模块中按照比特(bit)位逆序进行反转,输入数据与RAM地址关系如图4所示,上图中三块分别代表32个RAM的第0~31、32~63以及64~95地址空间。每块第一行数字表示具体的RAM地址,共96个地址;每块最右侧第一列表示RAM的编号,共32块RAM。每个地址中的数字序号表示输入数据的顺序号,第一个输入的数据序号为0,例如图中第一块中第3行第2列1536表示的是输入的第1537个数据存储在RAM1的第0号地址中。
将输入数据按照图4所示的映射关系存储到数据存储单元的32个RAM之后,开始进行3点DFT运算,共进行1024次,由于本发明3点DFT运算单元中采用了16个并行的3点DFT运算单元,所以一轮处理16次3点DFT运算,1024次3点DFT运算共需要16轮。数据存取控制单元从数据存储单元的32个RAM的第0~15号RAM和第16~31号RAM乒乓取出数据送入3点DFT运算单元进行运算,运算完的结果原位写回到相应的RAM地址中去。需要注意的是,根据Goertzel算法数据取出的顺序是地址从高到底依次取出。
本发明中采用Goertzel算法,3点DFT运算的公式如下:
Figure PCTCN2016085423-appb-000004
其中,x(n)为输入数据,X(k)为运算结果,W为蝶形旋转因子。
公式(2)可以为变换成公式(4)的形式:
Figure PCTCN2016085423-appb-000005
从上式可以看出X(k)的结果可以进行递归运算,如图5所示,Goertzel算法的结果见表2。
Figure PCTCN2016085423-appb-000006
Figure PCTCN2016085423-appb-000007
表2
其中,y(n)为每次步骤得到的结果,y(n)=x(n)+REG。x(n)和REG别为两项中间结果。REG代表寄存器中的数据。
为了提高数据处理能力,本发明在3点DFT运算单元中采用3个Goertzel运算单元并行运算的结构,处理时间可以提高3倍,其结构如图6所示。以第一轮运算,第0号3点DFT运算单元为例,从数据存储模块RAM0的第2048号、1024号以及0号地址中取出数据输入到3点DFT运算单元中,待运算完成后X(0)、X(1)以及X(2)的结果分别再写回到RAM0的第0、1024以及2048号地址中去达到原位运算的效果。其他轮次的其他3点DFT都依照类似的顺序从通过数据存取控制单元从数据存储单元中取出数据进行运算并原位写回。每个Goertzel运算单元除了需要输入x值之外还需要配合输入蝶形旋转因子
Figure PCTCN2016085423-appb-000008
以及
Figure PCTCN2016085423-appb-000009
在运算单元中输入12bit数据(1bit符号位、2bit小数位、9bit小数位)经过如图5所示的3次迭代之后输出13bit数据(1bit符号位、3bit小数位、9bit小数位),共需要进行2次复数乘法以及2次复数加法。为了防止数据溢出导致输出数据符号位发生错误,在最后一次迭代输出时需要对输出数据进行溢出保护处理。具体方法是,首先判断截止位是否每一bit都为全“0”或者全“1”,如果是则表示没有出现溢出,数据按照规则输出截取的13bit数据;如果不是全“0”或者全“1”,则表示数据超出13bit表示存在溢出,这时判断截止位的最高位,如果是“0”,则输出13’b0111111111111;如果是“1”,则输出13’b1000000000000.
在进行完所有的3点DFT运算之后就可以按照Good-Thomas算法对表1中每一行进行1024点DFT运算,共进行3轮。本发明采用Cooley-Tukey时间抽取基-2FFT算法,其算法流图如图7所示。1024点DFT运算共需要进行10级运算,每级进行512次蝶形运算,共计5120次蝶形运算;图7 虚线框中表示的就是一个蝶形运算单元,其主要完成如图8所示的蝶形运算,蝶形运算器输入端口1的数据加上或者减去输入端口2的数据与旋转因子的乘积。每个蝶形运算中包含一次复数乘法,故整个1024点DFT运算总共需要进行5120次复数乘法;本发明实施例采用16个蝶形并行运算的硬件结构,故每一级1024点DFT运算需要进行32次运算,整个1024点DFT需要进行320次运算。
1024点DFT运算单元用来完成1024点DFT的运算,为了提高数据处理能力,增加运算的并行度,DFT运算单元包含16个蝶形运算单元,以及一个旋转因子产生单元用来产生运算时需要的旋转因子(如图11所示),其中蝶形运算单元主要完成如图8所示的蝶形运算,其中输入的X1(k)与X2(k)对应蝶形运算单元的a与b端口(参照图11中的蝶形运算单元),tw端口(参照图11中的蝶形运算单元)输入的是每次蝶形运算所需要的旋转因子,其由旋转因子产生单元生成。为了提高数据处理性能,蝶形运算单元采用全流水线结构,每4个时钟(clk)可以完成一次蝶形运算。完成一级运算需要32个时钟(clk)。由于1024点DFT运算需要进行10级,为了在保证性能的前提下尽可能的节省硬件存储资源,所以需要对每一级运算的结果进行截位处理。每一级截位以及定点情况如表3所示。
级数 符号 整数 小数 总位宽
第1级 1 3 9 13
第2级 1 4 9 14
第3级 1 4 9 14
第4级 1 5 9 15
第5级 1 5 9 15
第6级 1 6 9 16
第7级 1 6 9 16
第8级 1 7 9 17
第9级 1 7 9 17
第10级 1 8 9 18
表3
与Goertzel运算单元一样,蝶形运算单元也需要对运算结果进行溢出 保护,其方法也与Goertzel运算单元类似,判断截止位是否每一bit都为全“0”或者全“1”,再根据判断结果对输出数据进行相应的调整。
本发明实施例对3072点数据分3轮进行1024点快速傅里叶变换运算,下面以第一轮1024点DFT运算为例介绍数据存取控制模块对运算中间结果的存取操作。第一轮需要进行1024点DFT运算的数据存储在数据存储单元中RAM0~31的地址0~31中,一共1024个数据,为描述方便,数据与地址关系如图9所示,这里每个地址存储的数据都是刚经历过3点DFT运算之后写回到对应地址的结果。在进行1024点DFT运算时每个时钟周期通过数据存取控制单元从数据存储单元的32个RAM中分别读取一个数据送入到1024点DFT运算单元中进行运算。由之前分析可知,1024点DFT运算共需要10级才能运算完成,而每一级又由于需要进32次运算,每次运算由16个蝶形运算单元并行完成,也就是每个时钟周期从RAM中读取32个数据进行运算,只要不发生地址冲突,则整个运算过程就可以流水线进行,为了保证整个运算过程中不发生读写地址冲突,在每一次1024点快速傅里叶运算的第一级运算完成之后数据写回时需要对数据存储地址进行调整,如图10所示,图中第一行标有“顺序”对应列表示第1级运算完写回数据仍然存储在原来位置的位置上;标有“逆序”对应的列表示第1级运算完写回的数据存储顺序需要颠倒,如RAM0地址1的数据经过第1级FFT运算完之后将写到RAM31地址1中。这种写回数据地址的调整仅在每次1024DFT运算的第1级完成之后进行,之后的运算则严格按照原位运算的原则从RAM中读写数据。做这次调整的目的是因为写回的数据还是按照之前输入数据与RAM地址对应关系存储,在运算进行到第6级时会发生地址冲突,影响运算的并行性。而这次调整仅在写回时进行,不会打断算法的流水处理。
1024点DFT运算需要进行10级,每一级运算都需要从数据存储单元 的RAM中按图7所示排列顺序将数据出入到1024点DFT模块中进行运算。除了在进行第一级运算需要调整位序之外,之后的每一级运算数据都可以进行原位运算,这样的好处是使得整个运算的数据存储相对简单,也节省了大量的数据缓存空间,使得这个3072点快速傅里叶变换在整个运算过程中只使用3072个存储空间成为可能,使用RAM的资源达到最少。
在所有3072个数据在进行完3次1024点DFT运算之后需要根据Good-Thomas算法:
Figure PCTCN2016085423-appb-000010
对输出数据序列再次进行索引变换,其中,
Figure PCTCN2016085423-appb-000011
Figure PCTCN2016085423-appb-000012
表示小于或者等于n中与n互为质数的个数,故
Figure PCTCN2016085423-appb-000013
Figure PCTCN2016085423-appb-000014
所以输出数据的映射关系如表4所示:
Figure PCTCN2016085423-appb-000015
表4
当k1等于0对应存储在数据存储单元的32个RAM的第0~31号地址中的数据;k1等于1对应存储在数据存储单元的32个RAM的第32~63号地址中的数据;k1等于2对应存储在数据存储单元的32个RAM的第64~95号地址中的数据。由于在进行1024点DFT运算时对数据存储的地址进行了一次调整,所以最终数据输出的顺序也响应的进行了调整,最终RAM中数据输出地址关系如图12所示,与图类似图中三块分别代表32个RAM的第0~31、32~63以及64~95地址空间。每块第一行数字表示具体的RAM地址,共96个地址;每块最右侧第一列表示RAM的编号,共32块RAM。每个 地址中的数字序号表示输出数据的顺序号,第一个输出的数据序号为0,例如图中第一块中第3行第2列2049表示的是输出的第2050个数据存储在RAM1的第0号地址中。数据存取控制模块按照图所示的地址数据映射关系按顺序从数据存储单元中读取数据输出,最终完成3072点快速傅里叶变换。
基于上述具体实施方式,本发明实施例提供的基于3072点快速傅里叶变换的处理器具体为如图13所示的结构,包括:
数据存储模块,由32个96×36(深度行数×位宽bit)RAM组成。
数据存取控制模块,负责读写数据。
3点DFT运算单元,由16组3个并行运算的Goertzel运算单元组成。
1024点DFT运算单元,由16个并行运算的蝶形运算单元以及旋转因子产生单元组成。
图14为本发明实施例的基于3072点快速傅里叶变换的处理器的结构组成示意图三,如图14所示,所述处理器包括:
映射单元11,配置为将3072点数据按照预定的映射关系,存储至数据存储模块中;
3点DFT运算单元12,配置为从所述存储模块中并行读取16个数据,进行3点DFT运算,运算完成后将结果按原位存储至数据存储模块中;
1024点DFT运算单元13,配置为从所述数据存储模块中并行读取32个数据,进行1024点DFT运算,直至完成3072点数据的快速傅里叶变换。
本发明实施例中,所述映射单元11,还配置为按照Good-Thomas算法对3072点数据进行排序,依据排序结果将3072点数据依次存储至数据存储模块中;所述数据存储模块由32个96×36(深度行数×位宽bit)的RAM组成。
本发明实施例中,所述3点DFT运算单元12,还配置为从所述存储模 块中并行读取16个数据,采用Goertzel算法进行3点DFT运算。
本发明实施例中,所述1024点DFT运算单元13,还配置为从所述数据存储模块中并行读取32个数据,进行10级FFT运算。
本发明实施例中,所述1024点DFT运算单元13,还配置为从所述数据存储模块中并行读取32个数据,采用Cooley-Tukey算法进行1024点DFT运算,直至完成3072点数据的快速傅里叶变换。
本领域技术人员应当理解,图14所示的基于3072点快速傅里叶变换的处理器中的各单元的实现功能可参照前述基于3072点快速傅里叶变换的数据处理方法的相关描述而理解。图14所示的基于3072点快速傅里叶变换的处理器中的各单元的功能可通过运行于处理器上的程序而实现,也可通过具体的逻辑电路而实现。
具体实现时,上述映射单元11、3点DFT运算单元12、1024点DFT运算单元13,可采用中央处理器(CPU,Central Processing Unit)、数字信号处理器(DSP,Digital Singnal Processor)或可编程逻辑阵列(FPGA,Field-Programmable Gate Array)实现。
本发明实施例还记载了一种存储介质,所述存储介质中存储有计算机程序,所述计算机程序配置为执行前述各实施例的基于3072点快速傅里叶变换的数据处理方法。
本发明实施例所记载的技术方案之间,在不冲突的情况下,可以任意组合。
在本发明所提供的几个实施例中,应该理解到,所揭露的方法和智能设备,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部 分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元,即可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本发明各实施例中的各功能单元可以全部集成在一个第二处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。
工业实用性
本发明通过选用多重算法,并结合各种算法的不同特性,从整体上进行统筹考虑,即减少了运算量,又对于目前混合算法中常常出现需要大量缓存来存储运算中间结果的情况进行优化,做到了资源、性能的平衡。并且,硬件实现简单,数据缓存消耗小、乘法器单元少,运算并行度高,运算精度灵活。

Claims (11)

  1. 一种基于3072点快速傅里叶变换的数据处理方法,所述方法包括:
    将3072点数据按照预定的映射关系,存储至数据存储模块中;
    从所述存储模块中并行读取16个数据,进行3点离散傅里叶变换DFT运算,运算完成后将结果按原位存储至数据存储模块中;
    从所述数据存储模块中并行读取32个数据,进行1024点DFT运算,运算完成后将结果按原位存储至数据存储模块中,直至完成3072点数据的快速傅里叶变换。
  2. 根据权利要求1所述的基于3072点快速傅里叶变换的数据处理方法,其中,所述将3072点数据按照预定的映射关系,存储至数据存储模块中,包括:
    按照Good-Thomas算法对3072点数据进行排序,依据排序结果将3072点数据依次存储至数据存储模块中;所述数据存储模块由32个96×36(深度行数×位宽bit)的随机存取存储器RAM组成。
  3. 根据权利要求1所述的基于3072点快速傅里叶变换的数据处理方法,其中,所述从所述存储模块中并行读取16个数据,进行3点DFT运算,包括:
    从所述存储模块中并行读取16个数据,采用Goertzel算法进行3点DFT运算。
  4. 根据权利要求1所述的基于3072点快速傅里叶变换的数据处理方法,其中,所述从所述数据存储模块中并行读取32个数据,进行1024点DFT运算,包括:
    从所述数据存储模块中并行读取32个数据,进行10级快速傅里叶变换FFT运算。
  5. 根据权利要求1至4任一项所述的基于3072点快速傅里叶变换的 数据处理方法,其中,所对从所述数据存储模块中并行读取32个数据,进行1024点DFT运算,运算完成后将结果按原位存储至数据存储模块中,直至完成3072点数据的快速傅里叶变换,包括:
    从所述数据存储模块中并行读取32个数据,采用Cooley-Tukey算法进行1024点DFT运算,运算完成后将结果按原位存储至数据存储模块中,直至完成3072点数据的快速傅里叶变换。
  6. 一种基于3072点快速傅里叶变换的处理器,所述处理器包括:
    映射单元,配置为将3072点数据按照预定的映射关系,存储至数据存储模块中;
    3点DFT运算单元,配置为从所述存储模块中并行读取16个数据,进行3点DFT运算,运算完成后将结果按原位存储至数据存储模块中;
    1024点DFT运算单元,配置为从所述数据存储模块中并行读取32个数据,进行1024点DFT运算,运算完成后将结果按原位存储至数据存储模块中,直至完成3072点数据的快速傅里叶变换。
  7. 根据权利要求6所述的基于3072点快速傅里叶变换的处理器,其中,所述映射单元,还配置为按照Good-Thomas算法对3072点数据进行排序,依据排序结果将3072点数据依次存储至数据存储模块中;所述数据存储模块由32个96×36(深度行数×位宽bit)的RAM组成。
  8. 根据权利要求6所述的基于3072点快速傅里叶变换的处理器,其中,所述3点DFT运算单元,还配置为从所述存储模块中并行读取16个数据,采用Goertzel算法进行3点DFT运算。
  9. 根据权利要求6所述的基于3072点快速傅里叶变换的处理器,其中,所述1024点DFT运算单元,还配置为从所述数据存储模块中并行读取32个数据,进行10级FFT运算。
  10. 根据权利要求6至9任一项所述的基于3072点快速傅里叶变换的 处理器,其中,所述1024点DFT运算单元,还配置为从所述数据存储模块中并行读取32个数据,采用Cooley-Tukey算法进行1024点DFT运算,运算完成后将结果按原位存储至数据存储模块中,直至完成3072点数据的快速傅里叶变换。
  11. 一种存储介质,所述存储介质中存储有计算机程序,所述计算机程序配置为执行权利要求1至5任一项所述的基于3072点快速傅里叶变换的数据处理方法。
PCT/CN2016/085423 2015-06-29 2016-06-12 基于3072点快速傅里叶变换的数据处理方法及处理器、存储介质 WO2017000756A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/561,980 US10152455B2 (en) 2015-06-29 2016-06-12 Data processing method and processor based on 3072-point fast Fourier transformation, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510369088.9 2015-06-29
CN201510369088.9A CN105045766B (zh) 2015-06-29 2015-06-29 基于3072点快速傅里叶变换的数据处理方法及处理器

Publications (1)

Publication Number Publication Date
WO2017000756A1 true WO2017000756A1 (zh) 2017-01-05

Family

ID=54452324

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/085423 WO2017000756A1 (zh) 2015-06-29 2016-06-12 基于3072点快速傅里叶变换的数据处理方法及处理器、存储介质

Country Status (3)

Country Link
US (1) US10152455B2 (zh)
CN (1) CN105045766B (zh)
WO (1) WO2017000756A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992741A (zh) * 2019-03-15 2019-07-09 西安电子科技大学 一种混合基2-4串行fft实现方法及装置
CN112163184A (zh) * 2020-09-02 2021-01-01 上海深聪半导体有限责任公司 一种实现fft的装置及方法
CN113157637A (zh) * 2021-04-27 2021-07-23 电子科技大学 一种基于fpga的大容量可重构的fft运算ip核

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045766B (zh) * 2015-06-29 2019-07-19 深圳市中兴微电子技术有限公司 基于3072点快速傅里叶变换的数据处理方法及处理器
CN105893328A (zh) * 2016-04-19 2016-08-24 南京亚派科技股份有限公司 一种基于Cooley-Tukey的FFT算法
CN107358165A (zh) * 2017-06-15 2017-11-17 深圳市泰和安科技有限公司 基于fft滤波的方法、终端设备及计算机可读存储介质
CN109144793B (zh) * 2018-09-07 2021-12-31 合肥工业大学 一种基于数据流驱动计算的故障校正装置和方法
CN109543137B (zh) * 2018-11-20 2022-11-11 中国人民解放军国防科技大学 一种云中并行快速傅里叶变换数据处理方法及装置
CN113111300B (zh) * 2020-01-13 2022-06-03 上海大学 具有优化资源消耗的定点fft实现系统
CN112800386B (zh) * 2021-01-26 2023-02-24 Oppo广东移动通信有限公司 傅里叶变换处理方法和处理器、终端、芯片及存储介质
CN112905110B (zh) * 2021-01-29 2023-03-24 展讯半导体(成都)有限公司 数据存储方法及装置、存储介质、用户设备、网络侧设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102104773A (zh) * 2009-12-18 2011-06-22 上海华虹集成电路有限责任公司 用于实现可变数据个数的fft/ifft处理器的基4模块
CN102238348A (zh) * 2010-04-20 2011-11-09 上海华虹集成电路有限责任公司 一种可变数据个数的fft/ifft处理器的基4模块
CN103020015A (zh) * 2012-11-30 2013-04-03 桂林卡尔曼通信技术有限公司 点数为非2次幂的离散傅里叶变换快速计算的实现方法
CN105045766A (zh) * 2015-06-29 2015-11-11 深圳市中兴微电子技术有限公司 基于3072点快速傅里叶变换的数据处理方法及处理器

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5293330A (en) * 1991-11-08 1994-03-08 Communications Satellite Corporation Pipeline processor for mixed-size FFTs
KR100510551B1 (ko) * 2003-10-10 2005-08-26 삼성전자주식회사 Ofdm 신호 심볼의 공통 위상 에러(cpe)를 제거하는ofdm 디모듈레이터 및 그 cpe 제거 방법
CN102831099B (zh) * 2012-07-27 2015-04-22 西安空间无线电技术研究所 一种3072点fft运算的实现方法
CN103218348B (zh) * 2013-03-29 2016-01-27 北京创毅视讯科技有限公司 快速傅里叶变换处理方法和系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102104773A (zh) * 2009-12-18 2011-06-22 上海华虹集成电路有限责任公司 用于实现可变数据个数的fft/ifft处理器的基4模块
CN102238348A (zh) * 2010-04-20 2011-11-09 上海华虹集成电路有限责任公司 一种可变数据个数的fft/ifft处理器的基4模块
CN103020015A (zh) * 2012-11-30 2013-04-03 桂林卡尔曼通信技术有限公司 点数为非2次幂的离散傅里叶变换快速计算的实现方法
CN105045766A (zh) * 2015-06-29 2015-11-11 深圳市中兴微电子技术有限公司 基于3072点快速傅里叶变换的数据处理方法及处理器

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992741A (zh) * 2019-03-15 2019-07-09 西安电子科技大学 一种混合基2-4串行fft实现方法及装置
CN112163184A (zh) * 2020-09-02 2021-01-01 上海深聪半导体有限责任公司 一种实现fft的装置及方法
CN113157637A (zh) * 2021-04-27 2021-07-23 电子科技大学 一种基于fpga的大容量可重构的fft运算ip核

Also Published As

Publication number Publication date
CN105045766B (zh) 2019-07-19
US20180165250A1 (en) 2018-06-14
CN105045766A (zh) 2015-11-11
US10152455B2 (en) 2018-12-11

Similar Documents

Publication Publication Date Title
WO2017000756A1 (zh) 基于3072点快速傅里叶变换的数据处理方法及处理器、存储介质
CN110765709B (zh) 一种基于fpga的基2-2快速傅里叶变换硬件设计方法
CN103699515B (zh) 一种fft并行处理装置和方法
CN103488459B (zh) 一种改进的高基cordic方法及基于其的复数乘法运算单元
WO2018027706A1 (zh) Fft处理器及运算方法
Wang et al. Design of pipelined FFT processor based on FPGA
US20140089369A1 (en) Multi-granularity parallel fft computation device
US20140330880A1 (en) Methods and devices for multi-granularity parallel fft butterfly computation
US9268744B2 (en) Parallel bit reversal devices and methods
CN105718424A (zh) 一种并行快速傅立叶变换处理方法
WO2013097436A1 (zh) 一种fft/dft的倒序排列系统与方法及其运算系统
CN115544438A (zh) 数字通信系统中的旋转因子生成方法、装置和计算机设备
CN101764778B (zh) 一种基带处理器和基带处理方法
US6963892B2 (en) Real-time method and apparatus for performing a large size fast fourier transform
CN112149046A (zh) 一种基于并行时分复用技术的fft处理器及处理方法
CN102346728B (zh) 一种采用矢量处理器实现fft/dft倒序的方法和装置
More et al. FPGA implementation of FFT processor using vedic algorithm
Zhang et al. Small area high speed configurable FFT processor
Mohan et al. Implementation of N-Point FFT/IFFT processor based on Radix-2 Using FPGA
CN107168928A (zh) 无需重新排序的八点Winograd傅里叶变换器
CN104572578B (zh) 用于显著改进微控制器中fft性能的新颖方法
CN112307423B (zh) 基于基2sdf流水线型的fft处理器及其在aco-ofdm系统的实现方法
Kumar et al. Hardware Implementation of 64-Bits Data by Radix-8 FFT/IFFT for High Speed Applications
CN102238348A (zh) 一种可变数据个数的fft/ifft处理器的基4模块
Naresh et al. A Novel Architecture for Radix-4 Pipelined FFT Processor using Vedic Mathematics Algorithm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16817120

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15561980

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16817120

Country of ref document: EP

Kind code of ref document: A1