CN118013176A

CN118013176A - FFT (fast Fourier transform) computing method and device of GPU (graphics processing unit) system and electronic equipment

Info

Publication number: CN118013176A
Application number: CN202410059172.XA
Authority: CN
Inventors: 李东泽
Original assignee: Faw Beijing Software Technology Co ltd; FAW Group Corp
Current assignee: Faw Beijing Software Technology Co ltd; FAW Group Corp
Priority date: 2024-01-16
Filing date: 2024-01-16
Publication date: 2024-05-10

Abstract

The application discloses an FFT computing method of a GPU system, an FFT computing method of the GPU system and electronic equipment, wherein the method comprises the steps of merging sub-cores of the computing frame based on an iterative Stockham FFT computing frame by a Cooley-Tukey FFT algorithm to obtain the FFT computing frame of the GPU system; performing complex to complex power of 2, multi-batch 1D and 2D FFT calculation based on the FFT calculation frame of the GPU system; wherein FFT computation is performed based on the steps of creating a plan, transmitting data, executing sub-kernels and merging kernels and obtaining results. Through the scheme, a calculation framework for efficiently decomposing problem scale is designed, so that the framework supports complex to power of complex 2 multi-batch 1D and 2D FFT calculation. And the data layout in the memory is redesigned, and the memory and the thread resources are effectively utilized. By the additional step-by-step calculation method, the characteristics of the GPU multi-level memory are fully utilized to relieve the memory access pressure caused by butterfly operation and the memory access bottleneck of the large-scale FFT.

Description

FFT (fast Fourier transform) computing method and device of GPU (graphics processing unit) system and electronic equipment

Technical Field

The present application relates to the field of GPU operations, and in particular, to a method for calculating an FFT of a GPU system, and an electronic device.

Background

With the development and progress of intelligent vehicles, the problems that automobiles with automation and intelligent functions need to be processed are more and more complex, and higher requirements are put forward on the calculation instantaneity of vehicle-mounted computers. Heterogeneous many-core platforms with CPU and GPU co-computing have become one of the preferred platforms for high performance computing. The heterogeneous many-core architecture combines the advantages of the CPU and the GPU, so that the advantages of the CPU in terms of logic task processing can be fully exerted, and the massive parallel computing capability of the GPU can be fully utilized.

However, the current research and achievement mainly focuses on optimizing the FFT by using a large number of accelerator platforms, but the special mathematical property of the FFT still has some problems in realizing and optimizing on the GPU, for example, the FFT algorithm has the characteristics of computationally intensive and memory intensive, the arithmetic strength is moderate, the memory access mode is special, and the memory is easy to become the bottleneck of the FFT algorithm; the parallelism is limited, the access locality is low, and a plurality of computing resources are difficult to fully utilize; for example, based on the strong computation power of the heterogeneous many-core platform, how to design parallel algorithms and optimization strategies according to the characteristics of a processor architecture, and effectively utilizing various types of memories on a GPU to support the special operation of FFT becomes a problem to be solved; for example, different kernels are used for different scales of FFT, and a large scale of FFT sequence kernels may be more complex, and may not achieve a relatively consistent optimization effect over multiple scales of FFT or multidimensional FFT.

Therefore, an FFT computation scheme of a GPU system is provided, based on an iterative Stockham FFT computation framework, sub-cores of the GPU system are combined according to a Cooley-Tukey FFT algorithm, and relatively consistent performance improvement is obtained in various data scales or 1D FFT (one-dimensional FFT) and 2D FFT (two-dimensional FFT) through a unified scale decision and core allocation mechanism.

Disclosure of Invention

The invention aims to provide a FFT computing method of a GPU system, a FFT computing method of the GPU system and electronic equipment, which at least solve one technical problem.

The invention provides the following scheme:

According to an aspect of the present invention, there is provided an FFT computation method of a GPU system, the FFT computation method of the GPU system including:

Based on the iterative Stockham FFT computing framework, combining the sub-kernels of the Cooley-Tukey FFT algorithm to obtain an FFT computing framework of the GPU system;

Performing complex to complex power of 2, multi-batch 1D and 2D FFT calculation based on the FFT calculation frame of the GPU system;

wherein FFT computation is performed based on the steps of creating a plan, transmitting data, executing sub-kernels and merging kernels and obtaining results.

Further, the FFT computation includes:

creating a plan handle according to the scale of the input data;

The plan handle comprises setting basic information of FFT calculation;

the basic information comprises data dimension, data scale, data precision and transformation type of an input sequence in FFT calculation;

According to the information in the plan handle, carrying out problem decomposition on the input data to generate a plurality of decomposition trees;

evaluating each decomposition tree to generate a decomposition plan;

the decomposition plan comprises the step of screening the optimal decomposition scheme in the decomposition tree and adding the optimal decomposition scheme into the plan handle.

Further, the method further comprises the following steps:

According to the generation of the decomposition plan, the CPU pre-calculates a common rotation factor;

And copying the information of the input sequence, the information of the decomposition plan and the information of the pre-calculated twiddle factors into a global memory of the GPU according to the fact that the CPU pre-calculates the common twiddle factors.

Further, the method further comprises the following steps:

The GPU side reads data from the global memory and starts a plurality of decomposition kernels to calculate according to the decomposition plan;

each decomposition kernel comprises a plurality of sub-kernels, and the sub-kernels calculate decomposed sub-problems;

judging whether the calculation performed by the decomposition kernel is finished;

If the decomposition kernel is calculated, the decomposition kernel merges the sub-kernels and exchanges data;

And iterating the GPU side for a plurality of times, reading data from a global memory, starting calculation performed by a plurality of decomposition kernels according to the decomposition plan, and generating a result of FFT calculation.

Further, the steps of creating a plan, transmitting data, executing a sub-kernel and merging the kernels and obtaining a result based on the FFT computation include:

The step of creating a plan comprises a step of scale decision and a step of pre-calculating twiddle factors;

The scale decision step comprises the steps of analyzing the scale of FFT input data, determining a decomposition scheme through a decomposition tree and forming the plan handle;

the pre-calculating the twiddle factor step includes pre-analyzing, by the CPU, input data characteristics prior to FFT computation, including generating common twiddle factors during computation based on twiddle factor symmetry and periodicity.

Further, the steps of creating a plan, transmitting data, executing a sub-kernel and merging the kernels and obtaining a result based on the steps, and performing the FFT computation further includes:

The step of transmitting data includes calling an executable function, copying information of the plan handle, information of an input sequence and information of a pre-calculated twiddle factor to a global memory of the GPU, and establishing a twiddle factor lookup table in a texture memory in the global memory.

The step of executing the sub-cores and merging the cores comprises the steps of iteratively decomposing the sub-cores and merging the sub-cores for multiple rounds according to the optimal decomposition scheme in the screening decomposition tree and performing data exchange calculation;

the algorithm of the calculation comprises:

wherein, as follows, the product of elements; A base n ₁ -DFT matrix of n ₁×n₁; /(I) Is a twiddle factor matrix of N ₁×N₂;

Where, when N ₁ =16, N ₁ =256, the thread resources of wavefront are utilized for parallel computation.

Further, the parallel computing using wavefront thread resources includes: optimizing wavefront thread resource measures;

The step of optimizing wavefront the thread resource measures comprises the steps of thread structure optimization, data storage structure optimization, row and column read-write optimization and memory bottleneck alleviation.

According to two aspects of the present invention, there is provided an FFT computation apparatus of a GPU system, the FFT computation apparatus of the GPU system comprising:

the computing framework module is used for combining the sub-cores of the computing framework based on the iterative Stockham FFT computing framework and acquiring an FFT computing framework of the GPU system by a Cooley-Tukey FFT algorithm;

the FFT calculation module is used for carrying out complex to the power of complex 2 and FFT calculation of multiple batches of 1D and 2D based on an FFT calculation frame of the GPU system;

And a calculation step module, which is used for carrying out FFT calculation based on the steps of creating a plan, transmitting data, executing the sub-kernel and combining the kernels and obtaining the result.

According to three aspects of the present invention, there is provided an electronic apparatus including: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

The memory has stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the FFT computation method of the GPU system.

Through the scheme, the following beneficial technical effects are obtained:

According to the application, by designing a calculation framework for efficiently decomposing problem scale, the framework is based on an iterative Stockham FFT calculation framework, and sub-kernels are combined according to a Cooley-Tukey FFT algorithm, so that the framework supports complex to complex power multiple batches of 1D and 2D FFT calculation.

The application ensures the merging and reading of the data by redesigning the data layout in the memory; according to the thread execution characteristics, the FFT calculation process is simplified to reduce floating point operation times, and memory and thread resources are effectively utilized.

According to the application, through an additional step-by-step calculation method, the characteristics of the GPU multi-stage memory are fully utilized to relieve the memory access pressure caused by butterfly operation and the memory access bottleneck of a large-scale FFT.

Drawings

FIG. 1 is a flowchart of a method for FFT computation of a GPU system according to one or more embodiments of the present invention.

FIG. 2 is a block diagram of an FFT computing device for a GPU system according to one or more embodiments of the present invention.

FIG. 3 is a schematic diagram of the FFT computation framework workflow of an embodiment of the invention.

Fig. 4 is a schematic diagram of an FFT computation framework in accordance with one embodiment of the present invention.

FIG. 5 is a schematic diagram of a merging process of radix-256 FFT kernels under an FFT computation framework according to an embodiment of the invention.

FIG. 6 is a diagram illustrating wavefront thread structure optimization in accordance with one embodiment of the present invention.

FIG. 7 is a schematic diagram of a data storage method and a computing process according to an embodiment of the present invention.

FIG. 8 is a schematic diagram of a Blocked Six-step FFT algorithm according to an embodiment of the invention.

Fig. 9 is a block diagram of an electronic device according to an FFT calculation method of a GPU system according to one or more embodiments of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The FFT computation method of the GPU system shown in fig. 1 includes:

Step S1, based on an iterative Stockham FFT computing framework, combining sub-cores of the Cooley-Tukey FFT algorithm to obtain an FFT computing framework of a GPU system;

Step S2, performing complex to complex power of 2, multi-batch 1D and 2D FFT calculation based on the FFT calculation frame of the GPU system;

step S3, wherein FFT calculation is performed based on the steps of creating a plan, transmitting data, executing sub-kernels and merging kernels and obtaining a result.

Specifically, in the vehicle-mounted platform, a heterogeneous many-core platform for collaborative computation of a CPU and a GPU has become one of the preferred platforms for high-performance computation. The heterogeneous many-core architecture combines the advantages of the CPU and the GPU, so that the advantages of the CPU in terms of logic task processing can be fully exerted, and the massive parallel computing capability of the GPU can be fully utilized. In order to efficiently decompose the problem scale, under a heterogeneous many-core platform of the cooperative computation of a CPU and a GPU, combining sub-cores of the heterogeneous many-core platform based on an iterative Stockham FFT computing framework by a Cooley-Tukey FFT algorithm to obtain an FFT computing framework of the GPU system; based on the FFT calculation framework of the GPU system, complex to complex, power of 2, multi-batch 1D (one-dimensional) and 2D (two-dimensional) FFT calculation is performed.

FFT computation is performed based on the steps of creating a plan, transmitting data, executing sub-kernels and merging kernels and obtaining results.

If, in the stage of creating a plan, analyzing the size of the FFT, determining a final decomposition scheme through a decomposition tree and generating a plan handle; before FFT calculation, a CPU at the host end generates a common twiddle factor in the calculation process in advance according to the data characteristics and according to the symmetry and periodicity of the twiddle factor.

For example, in the data transfer stage, the created plan call execution function is used to copy the plan handle, the original input sequence and the pre-computed twiddle factors to the global memory of the GPU, and a twiddle factor lookup table is built in the texture memory in the global memory.

For example, in the stage of executing the sub-cores and the merging cores, the merging cores are sequentially executed in a plurality of iterations according to a determined decomposition scheme.

For example, in the stage of obtaining the result, after all kernel calculations are completed, the FFT calculation result of the input sequence is obtained.

In this embodiment, the FFT computation includes:

creating a plan handle according to the scale of the input data;

the plan handle comprises setting basic information of FFT calculation;

evaluating each decomposition tree to generate a decomposition plan;

The decomposition plan comprises the steps of screening the optimal decomposition scheme in the decomposition tree and adding the optimal decomposition scheme into the plan handle.

Specifically, the handle is a generalized pointer that maintains the structure of a decomposition plan that can optimize the operating efficiency of the GPU memory. Because the data scale has a great influence on the GPU processing data capacity, a plan handle is created according to the scale of input data, namely, an optimal decomposition scheme can be obtained by a tree decomposition method based on the data dimension, the data scale, the data precision and the transformation type of the input sequence in FFT calculation, and the optimal decomposition scheme is added into the plan handle.

In this embodiment, further comprising:

according to the generation decomposition plan, the CPU pre-calculates a common rotation factor;

Specifically, the FFT is a fast algorithm of discrete fourier transform (Discrete Fourier Transform, DFT), which uses properties such as symmetry in the transform, so that the computation amount of DFT is greatly reduced, and the time complexity of the algorithm is reduced from O (n ²) to O (n logn). For a one-dimensional input sequence with a data size of N, the DFT calculation formula is as follows:

Wherein, Called twiddle factor,/>e^ix＝cos x+isin x，/> Is an imaginary unit. According to the optimal decomposition scheme, before FFT calculation, a CPU at the host end generates a common twiddle factor in the calculation process in advance according to the data characteristics (data scale) and according to the symmetry and periodicity of the twiddle factor.

After the CPU pre-calculates the common twiddle factors, copying the information of the input sequence, the information of the decomposition plan and the information of the pre-calculated twiddle factors into a global memory of the GPU.

In this embodiment, further comprising:

and the multi-round iterative GPU end reads data from the global memory, starts calculation performed by a plurality of decomposition cores according to the decomposition plan, and generates FFT calculation results.

Specifically, after all kernel calculations are completed, an FFT result of the input sequence is obtained. To accomplish the above-mentioned processes, some optimization processes, such as thread structure optimization, such as data storage structure optimization, such as rank read-write optimization, are required, such as alleviating memory bottlenecks.

In the present embodiment, based on the steps of creating a plan, transmitting data, executing sub-kernels and merging kernels and obtaining a result, performing FFT computation includes:

the scale decision step comprises the steps of analyzing the scale of FFT input data, determining a decomposition scheme through a decomposition tree and forming a plan handle;

the step of pre-computing the twiddle factor includes pre-analyzing, by the CPU, the input data characteristics prior to the FFT computation, including generating the common twiddle factor during the computation based on the twiddle factor symmetry and periodicity.

Specifically, in the step of scale decision, based on the design thought of the decomposition frame, the host side first analyzes the scale of the FFT, determines the final decomposition scheme by the decomposition tree, and generates the plan handle.

For example, according to the Cooley-Tukey FFT algorithm, the input sequence is recursively decomposed into smaller subsequences until no more decomposition is possible, and the decomposed data structure is a decomposition tree. Each branch of the tree represents one factorization of a given FFT size, with each branch of the decomposition tree also referred to as one FFT plan. The task of the create plan phase is to take care of reducing the search space of many potential FFT plans and to find the best plan for a given FFT size.

Based on the design thought of the decomposition framework, for a one-dimensional FFT or a two-dimensional FFT with a size of N, 256 may not be divided every time N, and multiple decomposition strategies may exist, but not all decomposition strategies may obtain better calculation performance. For example, an iterative Stockham FFT computational framework may be used to try to make the resolved sub-merge kernels uniform in size.

After the decomposition tree is generated, a descriptor, namely a plan handle, is established according to the input data. The descriptor includes basic information for setting up FFT computation, including data dimension, data size, data precision, and transform type of the input sequence in FFT computation. The data of this embodiment defaults to single-precision or double-precision complex data.

In the pre-calculation step, the twiddle factors are pre-calculated according to a decomposition scheme, that is, the common twiddle factors in the calculation process are pre-generated by a host CPU according to data characteristics (such as data scale) and according to symmetry and periodicity of the twiddle factors before FFT calculation.

In this embodiment, based on the steps of creating a plan, transmitting data, executing the sub-cores and merging the cores, and obtaining the result, performing FFT computation further includes:

the step of transferring data includes calling an executable function, copying the information of the plan handle, the information of the input sequence and the information of the pre-calculated twiddle factor to a global memory of the GPU, and establishing a twiddle factor lookup table in a texture memory in the global memory.

Specifically, as with constant memory, texture memory is also cached on-chip, which in some cases can reduce requests for memory and provide more efficient memory bandwidth. Texture caches are specifically designed for applications, such as image data, that have a large amount of spatial locality in memory access patterns. Using texture memory, in some computational reference program, it is stated that the thread reads very close to the read location of neighboring threads. By establishing a twiddle factor lookup table in a texture memory in a global memory, the running speed of the system is ensured.

The step of executing the sub-cores and the merging cores comprises the steps of iteratively decomposing the sub-cores and merging the sub-cores for multiple rounds according to the optimal decomposition scheme in the screening decomposition tree and performing data exchange calculation;

the algorithm of the calculation comprises:

Specifically, according to the determined optimal decomposition scheme, the merging kernel sequentially executes the steps in multiple iterations, wherein the key step is the merging kernel.

For example, the kernel of the radix-256 FFT algorithm contains two radix-16 FFT sub-kernels. The merging process can be represented by the formula:

Wherein +.is the product of element by element, A base n ₁ -DFT matrix of n ₁×n₁. When N ₁ =16, N ₁ =256, the thread resources of wavefront can be fully utilized. /(I)Is a twiddle factor matrix of N ₁×N₂.

Each sub-core, after reading the input, will load a pre-computed twiddle factor, and these matrices will be divided into segments, e.g., computed in parallel in multiple wavefront (64 threads). After the calculation is completed, all the sub-cores are combined, and batch calculation is performed on the sub-cores, for example, 256 FFT sequences with sizes are arranged into 16 sequences with sizes of 16. The process enables the merging kernel to be sequentially executed in a plurality of iterations, and the process of each iteration is as follows: for each merging Kernel with the size of 256, calculating 16-point FFT according to rows, and then calculating N ₁ batches of 16-point FFT according to column strides, wherein the merging process of radix-16FFT Kernel and the calculation process of N ₁ 256-point FFT are included. After all iterations are performed, the FFT result of the original sequence is written into the global memory.

Similarly, in the calculation of the two-dimensional FFT, the two-dimensional FFT is also realized by batch processing when one-dimensional FFT is performed for each dimension (calculating the multidimensional FFT is equivalent to calculating 1D FFT in each dimension in turn).

For example, for a two-dimensional FFT of an N ₁×N₂ matrix, it can be considered a one-dimensional FFT of N ₁ rows with a stride length of "1", and another is an N ₁ column FFT of batch size N ₂ with a stride length of N ₁, implemented by adjusting parameters in generating the FFT plan.

In this embodiment, the parallel computation using wavefront thread resources includes: optimizing wavefront thread resource measures;

The step of optimizing wavefront the thread resource measures includes the steps of thread structure optimization, data storage structure optimization, rank read-write optimization and memory bottleneck alleviation.

Specifically, in the step of optimizing the thread structure, for example, a base-16 FFT is selected as the smallest computing kernel, and the FFT is decomposed into a size of 16×16 preferentially when the FFT is decomposed in scale, so that the number of threads of each thread block is designed to be N/4. Wherein 4 threads are used for calculation for each kernel of the radix-16 FFT, so that one wavefront can complete the calculation of 256-point FFT. When the FFT is decomposed in scale, the FFT is preferentially decomposed into the size of 16 multiplied by 16, namely N is decomposed into N/256 parts, and each part is continuously decomposed to obtain 16 cores of the 16-point FFT, so that the 16 cores can be combined and calculated.

In the step of data storage structure optimization, the FFT input sequence is a double2 type one-dimensional structure, which contains two floating point numbers (real and imaginary parts). Taking 16-point FFT as an example, during calculation, continuous data is first moved from main memory to global memory, and then the latest 32 sets of data are read into shared memory at a time. In the shared memory, the real part and the imaginary part of the original data are discontinuously stored, and the distance between the real part and the imaginary part is fixed to 16 data. At this time, the memory access of the FFT calculation is continuous, and a total of 32 data of 16 points can be read in a vectorization way.

After the data is read into the shared memory, SIMD units (single instruction multiple data) in the GPU can read the data from the shared memory in batch and perform FFT butterfly computation, where the computation mode is fixed, for example, the registers marked with B series need to be additionally multiplied by a twiddle factor W ^k, then the registers marked with a series and the registers marked with B series perform addition operation first and store the result into the global memory, then perform subtraction operation and store the result into the global memory, and at this time, the data will be stored again in a double2 type manner (i.e. the real part and the imaginary part of each group of data are stored in order).

In the row-column read-write optimization step, the two-dimensional FFT may be naturally mapped into a two-dimensional matrix. For example, for a one-dimensional FFT with an input length of 32768, data may be stored in a two-dimensional matrix of 128×256, and a one-dimensional FFT operation may be performed for each dimension. The two-dimensional matrix is regarded as a row-first storage, wherein the first 256 data of the first row of the first dimension are continuously stored in the memory, and the column data need to be read in a stride, and the stride is 128 data.

For example, C is read by column, C is written by column, R is read by row, R is written by row, and T is a matrix transposition operation. The FFT algorithm is performed by 3 steps as follows:

1. 256 128-point FFTs are performed. The FFT is performed column by column, and data is read and written column by column (simply referred to as C-C).

2. The matrix is transposed (T) into a 256×128 two-dimensional matrix.

3. 128 256-Point FFTs are performed. This FFT is also performed in column read-write (C-C).

The FFT algorithm described above can be expressed by the expression: C-C-T-C-C;

Since the data is stored in a row-first manner, the column data cannot be read in a merged manner, an additional matrix transpose operation can be added such that the column read becomes a row read:

The meaning of the expression above is matrix transpose-row first read-in, row FFT, row first write-in, matrix transpose-row first read-in, row FFT, transpose-row first write-in. T-R or R-T in brackets will be considered as one operation in the FFT calculation process and cannot be split into two operations to proceed.

In the step of alleviating the memory bottleneck, based on the original DFT algorithm, a Blocked Six-step FFT algorithm is used in specific calculation of each kernel, and the algorithm combines multi-column FFT and matrix transposition. Let n=n ₁×N₂(N₂ =256).

When 16 lines of data are transmitted from a two-dimensional array with the size of N ₁ multiplied by 256, the data are firstly transmitted to an auxiliary array with the size of 256 multiplied by 16 for calculation. The 256 x 16 matrix in the register is subjected to 16 256-point FFTs. Each data of the 256×16 matrix remaining in the register is multiplied by a twiddle factor, and then the matrix is restored to the position of the original N ₁ ×256 matrix while transposed by 16 rows. 256N ₁ point FFTs are performed on an N ₁ x 256 array. Each N ₁ point FFT may be performed in shared memory (L1 cache). Finally, 16 rows of the N ₁ ×256 matrix are transposed and stored in the 256×N ₁ matrix.

The FFT computation means of the GPU system as shown in fig. 2 includes: the device comprises a calculation framework module, an FFT calculation module and a calculation step module;

The FFT calculation module is used for carrying out complex to complex power of 2 and multi-batch 1D and 2D FFT calculation based on an FFT calculation frame of the GPU system;

And the calculation step module is used for carrying out FFT calculation based on the steps of creating a plan, transmitting data, executing the sub-kernels, combining the kernels and obtaining the result.

It should be noted that, although the present system only discloses a computing frame module, an FFT computing module, and a computing step module, the present invention is to be expressed in terms of meaning that, based on the above basic functional modules, one skilled in the art may add one or more functional modules arbitrarily in combination with the prior art to form an infinite number of embodiments or technical solutions, that is, the present system is open rather than closed, and the scope of protection of the claims of the present invention should not be limited to the above disclosed basic functional modules because the present embodiment only discloses individual basic functional modules.

Through the scheme, the following beneficial technical effects are obtained:

In one embodiment, taking a one-dimensional FFT as an example, the FFT computation framework workflow shown in fig. 3, when the algorithm starts to execute, a plan handle is created according to the size of the input data, where the plan handle contains basic information for setting FFT computation. The basic information includes the data dimension, data size, data precision, and transform type of the input sequence in the FFT computation. Then, the input data is subjected to problem decomposition (decomposition into a plurality of sub-problems) by attempting to solve a plurality of schemes through the information in the plan handle, the program is subjected to attempt to generate a plurality of decomposition trees, and each decomposition tree is evaluated to obtain an optimal decomposition scheme which is added into the plan handle. After the generation of the decomposition plan is completed, the CPU starts to calculate the common twiddle factors in advance, and after the calculation is completed, the input sequence, the decomposition plan and the twiddle factors calculated in advance are copied into the global memory of the GPU.

The GPU side firstly reads data from the global memory, and starts a plurality of decomposition cores according to a decomposition plan, each decomposition core comprises a plurality of sub-cores, and only the sub-problems after decomposition are calculated in the sub-cores. After all kernel calculations are completed, the decomposition kernel merges all sub-kernels and performs data exchange. After a plurality of iterations, all the decomposition kernels are calculated and combined again, and finally a transformation result is obtained.

In another specific embodiment, exemplified by an AMD GPU implementation, the threads in an AMD GPU are organized and allocated in thread groups (wavefront), each wavefront containing 64 threads.

According to the Cooley-Tukey FFT algorithm, for a sequence of input length N (n=2 ^k, k is an integer), it can naturally be mapped into the form of a two-dimensional matrix n=n ₁×N₂ (a matrix of N ₂ =256=16×16 is preferably used). The FFT computation process is as shown in fig. 4 for the FFT computation framework when the actual execution function is called.

The under-frame algorithm mainly comprises four steps: 1. creating a plan; 2. copying the input sequence to the GPU; 3. executing a plan; 4. and outputting a result.

1. Creating a plan, this step includes two sub-steps: scale decision and pre-calculation twiddle factors.

(1) And (5) making a scale decision.

The host end firstly analyzes the size of the FFT, determines a final decomposition scheme through a decomposition tree and generates a plan handle.

According to the Cooley-Tukey FFT algorithm, the input sequence is recursively decomposed into smaller sub-sequences until no further decomposition is possible, referred to herein as a decomposed data structure as a decomposition tree. Each branch of the tree represents one factorization of a given FFT size, and each branch of the decomposition tree is also referred to as one FFT plan. The task of the create plan phase is to take care of reducing the search space of many potential FFT plans and to find the best plan for a given FFT size.

According to the design thought of the decomposition frame, for a one-dimensional FFT or a two-dimensional FFT with the scale of N, 256 can not be divided every time N, and various decomposition strategies possibly exist, but not all the decomposition strategies can obtain better calculation performance. Embodiments use iterative Stockham FFT computational frameworks to try to make the resolved sub-merge kernels uniform in size.

The decision to decompose the policy needs to be done at the time of creating the plan. Since the decomposition tree may be large, the depth-first search is employed and unnecessary branches are pruned, the principle being that the number of additional cores of different sizes is as small as possible, while satisfying the requirement of decomposing more cores of the same size. If the decomposition of the base-256 kernel (16-point FFT sub-kernels) or the base-16 kernel cannot be satisfied, then an attempt is made to use a larger kernel (e.g., the base-1024 kernel) and an attempt is made to use the base-256 or kernel base-16 kernel again. And then, evaluating the trimmed tree by adopting a bottom-up dynamic programming method.

After the decomposition tree is generated, a descriptor, namely a plan handle, is established according to the input data. The descriptor includes basic information for setting up FFT computation, including data dimension, data size, data precision, and transform type of the input sequence in FFT computation. The data of the invention are all default to single-precision or double-precision complex data.

(2) The twiddle factor is pre-calculated.

The twiddle factors are pre-calculated according to a decomposition scheme, and before FFT calculation, the common twiddle factors in the calculation process are pre-generated by a CPU at the host end according to the data characteristics and according to the symmetry and periodicity of the twiddle factors.

2. The input sequence is copied to the GPU.

And copying the plan handle, the original input sequence and the pre-calculated twiddle factors to the global memory of the GPU by using the created plan call execution function, and establishing a twiddle factor lookup table in the texture memory in the global memory.

3. The plan is executed.

The merge kernels are executed in sequence in a plurality of iterations according to the decomposition scheme determined in the previous step.

In this embodiment, the key step of the algorithm is to merge the kernels. As shown, the kernel of the radix-256 FFT algorithm contains two radix-16 FFT sub-kernels. The merging process can be represented by the formula:

Each sub-core, after reading the input, will load pre-computed twiddle factors, and these matrices will be divided into segments, computed in parallel in multiple wavefront. After the calculation is completed, all the sub-cores are combined, and batch calculation is carried out on the sub-cores, namely 256 FFT sequences with the size are arranged into 16 sequences with the size of 16. The process enables the merging kernel to be sequentially executed in a plurality of iterations, and the process of each iteration is as follows: for each merging Kernel with the size of 256, calculating 16-point FFT according to rows, and then calculating N ₁ batches of 16-point FFT according to column strides, wherein the merging process of radix-16FFT Kernel and the calculation process of N ₁ 256-point FFT are included. After all iterations are performed, the FFT result of the original sequence is written into the global memory.

The merging process of the radix-256 FFT kernel under the FFT computation frame as shown in fig. 5 is listed as the merging process of the radix-256 FFT kernel in fig. 4. The base-256 kernel is composed of two base-16 kernels in combination. Rows 1-11 and 12-19 are two cores with a base of 16. The merging process includes four main steps: (1) In lines 2-3 of FIG. 5, the base-16 DFT matrix is loaded and the pre-computed twiddle factors are loaded when the input is read; (2) On lines 4-5 shown in FIG. 5, these DFT matrices are divided into 16X 16 segments and assigned wavefront for parallelization; (3) On lines 6-8 of FIG. 5, wavefront perform an FFT on these segments; (4) In line 9 shown in fig. 5, the result is stored in intermediate data.

The procedure is equally applicable to the calculation of a two-dimensional FFT, which is also implemented by batch processing when a one-dimensional FFT is performed for each dimension (calculating a multi-dimensional FFT is equivalent to calculating a 1DFFT in each dimension in turn). For example, for a two-dimensional FFT of N ₁×N₂ matrix, it can be considered a one-dimensional FFT of N ₁ rows with a stride length of "1", and another one of N ₁ columns of FFTs of batch size N ₂ with a stride length of N ₁, by adjusting parameters when generating the plan.

4. And outputting a result.

And after all kernel calculation is completed, obtaining FFT results of the input sequence.

In another embodiment, the process shown in FIG. 5 performs optimization based on the framework shown in FIG. 4. Mainly comprises the following steps: 1. thread structure optimization, 2, data storage structure optimization, 3, row-column read-write optimization, and 4, and memory bottleneck is relieved.

1. In the embodiment, the base-16 FFT is selected as the minimum computation core, and the FFT is decomposed into the size of 16×16 preferentially when the FFT is decomposed in scale, so that the number of threads of each thread block is designed to be N/4. Wherein 4 threads are used for calculation of each kernel of the radix-16 FFT, so that one wavefront can finish 256-point FFT calculation, as shown in FIG. 6, N is decomposed into N/256 parts first, and each part is decomposed continuously to obtain 16 kernels of the 16-point FFT according to the following stepsAnd the 16 kernels can be combined.

2. The data storage structure is optimized, and the FFT input sequence is a double2 type one-dimensional structure body which comprises two complex floating point numbers (a real part and an imaginary part). Taking 16-point FFT as an example, during calculation, continuous data is first moved from main memory to global memory, and then the latest 32 sets of data are read into shared memory at a time. In the shared memory, the real part and the imaginary part of the original data are discontinuously stored, and the distance between the real part and the imaginary part is fixed to 16 data. At this time, the memory access of the FFT calculation is continuous, and a total of 32 data of 16 points can be read in a vectorization way.

As shown in fig. 7, after the data is read into the shared memory, SIMD units (single instruction multiple data) in the GPU can read the data from the shared memory in batch and perform butterfly computation, the computation mode is fixed, the registers marked with B series require additional multiplication by twiddle factor W ^k, then the registers marked with a series and the registers marked with B series perform addition operation first and store the result into the global memory, then perform subtraction operation and store the result into the global memory, and at this time the data is stored again in a double2 type manner (i.e. the real part and the imaginary part of each group of data are stored sequentially).

3. The row-column read-write optimization, the two-dimensional FFT can be naturally mapped into a two-dimensional matrix during calculation. Specifically, for a one-dimensional FFT having an input length of 32768, data can be stored in a two-dimensional matrix of 128×256, and one-dimensional FFT operation can be performed for each dimension. The two-dimensional matrix is regarded as a row-first storage, wherein the first 256 data of the first row of the first dimension are continuously stored in the memory, and the column data need to be read in a stride, and the stride is 128 data. For convenience of description, C is set to read by column, C is set to write by column, R is set to read by row, R is set to write by row, and T is set to a matrix transpose operation. The FFT algorithm is performed by 3 steps as follows:

(1) 256 128-point FFTs are performed. The FFT is performed column by column, and data is read and written column by column (simply referred to as C-C).

(2) The matrix is transposed (T) into a 256×128 two-dimensional matrix.

(3) 128 256-Point FFTs are performed. This FFT is also performed in column read-write (C-C).

The FFT algorithm described above can be expressed by the expression:

C-c-T-C-c

Since the data is stored in a row-first manner, the column data cannot be read in a merged manner, where an additional matrix transpose operation is added such that the column read becomes a row read:

C-c-T-C-c＝T-R-T-T-r-T-T-T-R-T-T-r-T

＝T-r-r-T-T-R-r-T

＝T-R-r-r-T-r-R-T

＝(T-R)-r-(T-R)-(r-T)

The meaning of the expression above is: matrix transpose-row first read-in, row FFT, row first write-in, matrix transpose-row first read-in, row FFT, transpose-row first write-in. T-R or R-T in brackets will be considered as one operation in the FFT calculation process and cannot be split into two operations to proceed.

4. Based on the original algorithm, the Blocked Six-step FFT algorithm is used in the specific calculation of each kernel, and the algorithm combines multi-column FFT and matrix transposition. Let n=n ₁×N₂(N₂ =256). First, data needs to be loaded into the shared memory according to fig. 7, and the algorithm is shown in fig. 8.

In this algorithm, there are mainly the following 5 steps:

(1) When 16 lines of data are transferred from a two-dimensional array of size N ₁ by 256, the data are first transferred to an auxiliary array of size 256 by 16 for calculation.

(2) The 256 x 16 matrix in the register is subjected to 16 256-point FFTs.

(3) Each data of the 256×16 matrix remaining in the register is multiplied by a twiddle factor, and then the matrix is restored to the position of the original N ₁ ×256 matrix while transposed by 16 rows.

(4) 256N ₁ point FFTs are performed on an N ₁ x 256 array. Each N ₁ point FFT may be performed in shared memory (L1 cache).

(5) Finally, 16 rows of the N ₁ ×256 matrix are transposed and stored in the 256×N ₁ matrix.

As shown in fig. 9, the present application provides an electronic apparatus including: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

the memory stores a computer program that, when executed by the processor, causes the processor to perform the steps of a method for FFT computation of a GPU system.

The present application also provides a computer readable storage medium storing a computer program executable by an electronic device, which when run on the electronic device causes the electronic device to perform the steps of a FFT computation method of a GPU system.

The application also provides an on-board heterogeneous many-core platform, which comprises:

the electronic equipment is used for realizing the steps of the FFT calculation method of the GPU system;

A processor that runs a program, and performs steps of an FFT calculation method of the GPU system from data output from the electronic device when the program runs;

A storage medium storing a program that, when executed, performs steps of an FFT computation method of the GPU system on data output from the electronic device.

The communication bus mentioned above for the electronic device may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The electronic device includes a hardware layer, an operating system layer running on top of the hardware layer, and an application layer running on top of the operating system. The hardware layer includes hardware such as a central processing unit (CPU, central Processing Unit), a memory management unit (MMU, memory Management Unit), and a memory. The operating system may be any one or more computer operating systems that implement electronic device control via processes (processes), such as a Linux operating system, a Unix operating system, an Android operating system, an iOS operating system, or a windows operating system, etc. In addition, in the embodiment of the present invention, the electronic device may be a handheld device such as a smart phone, a tablet computer, or an electronic device such as a desktop computer, a portable computer, which is not particularly limited in the embodiment of the present invention.

The execution body controlled by the electronic device in the embodiment of the invention can be the electronic device or a functional module in the electronic device, which can call a program and execute the program. The electronic device may obtain firmware corresponding to the storage medium, where the firmware corresponding to the storage medium is provided by the vendor, and the firmware corresponding to different storage media may be the same or different, which is not limited herein. After the electronic device obtains the firmware corresponding to the storage medium, the firmware corresponding to the storage medium can be written into the storage medium, specifically, the firmware corresponding to the storage medium is burned into the storage medium. The process of burning the firmware into the storage medium may be implemented by using the prior art, and will not be described in detail in the embodiment of the present invention.

The electronic device may further obtain a reset command corresponding to the storage medium, where the reset command corresponding to the storage medium is provided by the provider, and the reset commands corresponding to different storage media may be the same or different, which is not limited herein.

At this time, the storage medium of the electronic device is a storage medium in which the corresponding firmware is written, and the electronic device may respond to a reset command corresponding to the storage medium in which the corresponding firmware is written, so that the electronic device resets the storage medium in which the corresponding firmware is written according to the reset command corresponding to the storage medium. The process of resetting the storage medium according to the reset command may be implemented in the prior art, and will not be described in detail in the embodiments of the present invention.

For convenience of description, the above devices are described as being functionally divided into various units and modules. Of course, the functions of the units, modules may be implemented in one or more pieces of software and/or hardware when implementing the application.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated by one of ordinary skill in the art that the methodologies are not limited by the order of acts, as some acts may, in accordance with the methodologies, take place in other order or concurrently. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform the method according to the embodiments or some parts of the embodiments of the present application.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The FFT computing method of the GPU system is characterized by comprising the following steps of:

2. The FFT computation method of the GPU system according to claim 1, wherein the FFT computation includes:

creating a plan handle according to the scale of the input data;

The plan handle comprises setting basic information of FFT calculation;

evaluating each decomposition tree to generate a decomposition plan;

3. The FFT computation method of the GPU system according to claim 2, further comprising:

4. The FFT computation method of the GPU system according to claim 3, further comprising:

5. The FFT computation method of a GPU system according to claim 4, wherein the steps of performing the FFT computation based on creating a plan, transmitting data, executing a sub-kernel and merging kernels and obtaining a result include:

6. The FFT computation method of a GPU system according to claim 5, wherein the steps of performing the FFT computation based on creating a plan, transmitting data, executing a sub-kernel and merging kernels and obtaining a result further comprise:

7. The FFT computation method of a GPU system according to claim 6, wherein the steps of performing the FFT computation based on creating a plan, transmitting data, executing a sub-kernel and merging kernels and obtaining a result further comprise:

the algorithm of the calculation comprises:

wherein, as follows, the product of elements; A base n ₁ -DFT matrix of n ₁×n₁; /(I) Is that

A twiddle factor matrix of N ₁×N₂;

8. The FFT computation method of the GPU system of claim 7, wherein the parallel computation using the thread resources of wavefront comprises: optimizing wavefront thread resource measures;

9. An FFT computation device of a GPU system, characterized in that the FFT computation device of the GPU system comprises:

10. An electronic device, comprising: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

the memory has stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the FFT computation method of the GPU system of any of claims 1 to 8.