CN114510268B

CN114510268B - GPU-based method for realizing single-precision floating point number accumulated error control in down-conversion

Info

Publication number: CN114510268B
Application number: CN202111601590.XA
Authority: CN
Inventors: 李超; 焦义文; 马宏; 吴涛; 高泽夫; 毛飞龙; 陈雨迪; 滕飞; 李冬; 卢志伟; 周扬
Original assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Current assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-09-20
Anticipated expiration: 2041-12-24
Also published as: CN114510268A

Abstract

The invention discloses a method for realizing accumulated error control in down-conversion based on a GPU, which relates to the technical field of communication. The invention does not use a lookup table to calculate the phase value, saves precious on-chip memory resources, has higher resource utilization rate, controls the accumulated error in a limited data point and has smaller error accumulation. The invention designs a method for calculating the accumulation of the down-conversion control error based on a GPU, aiming at the relation between the down-conversion local frequency and the sampling frequency, the accuracy requirement is met while the down-conversion frequency is ensured to be flexible and variable, and calculation results show that the accumulated error can be effectively controlled to be 1e-8 orders of magnitude.

Description

GPU-based method for realizing single-precision floating point number accumulated error control in down-conversion

Technical Field

The invention relates to the technical field of communication, in particular to a method for realizing single-precision floating point accumulated error control in down-conversion based on a GPU (graphics processing unit).

Background

In a conventional aerospace measurement and control system and a deep space exploration system, a receiving end generally includes a radio frequency receiving unit, an Analog-to-Digital Converter (ADC), a Digital Down Converter (DDC), a filtering and extracting unit, and a baseband processing unit. The radio frequency receiving unit receives and converts the electromagnetic signals into electric signals, and the electric signals are filtered and amplified to a certain amplitude range. And the analog-to-digital conversion unit converts the received analog signals into digital signals. The down-conversion unit converts the radio frequency signal into a zero intermediate frequency baseband signal. The filtering extraction unit extracts the code stream with high speed rate according to a certain relation and reduces the speed to the code stream with low speed rate. And the baseband completes the functions of synchronous demodulation and the like of the digital signal after speed reduction. The down-conversion unit is an important component in a communication system, occupies an important position and plays a role, and the performance of the down-conversion unit directly influences the completion quality of a task.

The digital down-conversion is a process of obtaining a baseband signal by filtering high-frequency components through a low-pass filter after point multiplication is carried out on the local down-conversion frequency and a received signal. The process is shown in figure 1.

Assuming that the received signal is a real signal s (t), the expression is:

s(t)＝a(t)cos[2πf ₀ t+φ ₀ ] (1)

where a (t) is amplitude information of the received signal, phi ₀ Is the initial phase value, f, of the received signal ₀ Is the carrier frequency of the signal.

The formula (1) is sampled and digitized, and the sampling period is T _s It is possible to obtain:

s(nT _s )＝a(nT _s )cos[2πf ₀ nT _s +φ ₀ ] (2)

the above equation is further simplified, with the result that:

s(n)＝a(n)cos[2πf ₀ n+φ ₀ ] (3)

a schematic diagram of the direct digital down conversion process is shown in fig. 1. Firstly, a signal s (n) received by a receiver is respectively added with an in-phase local down-conversion signal cos (omega) and a quadrature local down-conversion signal cos (omega) ₀ ) And-sin (ω) ₀ ) IntoAnd finally, considering that the rate of an output signal is generally greater than the Nyquist sampling rate, performing D-time extraction on the I (n) signal and the Q (n) signal obtained correspondingly, and outputting the I (m) signal and the Q (m) signal after speed reduction.

The signals I (m) and Q (m) carry all the information of the signal s (n), and instantaneous amplitude, phase and frequency information can be conveniently obtained through calculation. The specific calculation formula is as follows:

instantaneous amplitude:

instantaneous phase:

instantaneous frequency:

in the above equation, Ts is the sampling period of the two components, I (m) and Q (m).

As can be seen from the down-conversion process, generating a down-converted carrier frequency signal is a critical process. In a conventional digital down-conversion process, the digital down-conversion module is generally implemented by a hardware chip. Comprises an ASIC chip and an FPGA chip. ASICs are most typical of TI, ADI chips. The parameters of the chips are fixed when the chips leave a factory, the chips cannot meet the requirements of various bandwidths and various rates, when a system is upgraded and transformed or the parameters are changed, hardware chips need to be re-developed, the development period is long, the cost is high, and the ASIC chips are difficult to achieve good coordination among performance, cost and adaptability in the flexible and changeable communication system. The lack of flexibility makes it difficult for ASIC chips to complete design and tape-out at the first time in technology and protocol upgrades, and the cost of late upgrades is high. The FPGA is a special integrated circuit that is formed by further developing programmable devices such as Programmable Array Logic (PAL), Generic Array Logic (GAL), and Complex Programmable Logic Device (CPLD), and is capable of being flexibly programmed. Currently, the mainstream FPGA adopts a lookup Table (LooK Up Table, LUT) technology to construct a programmable logic unit. The phase resolution precision of the lookup table is restricted by the storage space on the FPGA chip and cannot be effectively improved. In recent years, with the increase of on-chip storage space, the lookup table method has been widely used due to the characteristics of less occupied computing resources and high speed, but the above problems have not been fundamentally solved. And due to the defects of high hardware development difficulty, high threshold, long period, high cost, limited gate circuit number and functions and the like, the method cannot adapt to the requirements of modern communication system development, which has flexible and configurable modern parameters, quick technology updating and upgrading and flexible and variable functions.

With the development of high-performance computing, the advantages of high-speed parallel processing capability and parameters of the GPU, such as dynamic flexible configurability, short development period, low threshold, low later maintenance and upgrade cost and the like, provide an effective solution for solving the inherent problems based on hardware development.

The CUDA can provide a high-efficiency high-precision sine function lookup table for realizing the digital local oscillator by utilizing the high-efficiency floating point arithmetic capability and the multi-level storage system. A solution for realizing a lookup table by using a GPU texture memory appears, in 2016, a university team in Sichuan adopts a lookup table method to design a digital down-conversion signal, and speed improvement of 4 times of direct calculation is realized, however, the frequency precision of the method is limited by the number of threads in the same block and is difficult to improve. Scott C.Kim and the like respectively use texture memory nearest neighbor and linear interpolation to realize output of any bandwidth, and the results show that the Mean Square Error (MSE) of texture interpolation and traditional resampling is about 4.11e-4, the MSE of nearest neighbor and linear interpolation is about 1e-5, and the MSE of linear interpolation is slightly superior to that of nearest neighbor interpolation, but the method does not solve the problem of phase accumulation Error and has lower precision. A university team of aerospace engineering in 2020 provides a texture cache lookup table based on a GPU (graphics processing Unit), NCO (NCO) output is realized, meanwhile, a comprehensive compensation algorithm based on phase cycle elimination whole cycle and floating point number phase accumulation is designed, and the method is to useThe accumulated error is controlled at 10 ^-5 Magnitude. The method has the design idea consistent with the idea of designing a lookup table based on FPGA, but utilizes the parallel operation capability of GPU, simultaneously calculates and corrects each point by error compensation, greatly increases the operation complexity, reduces the operation efficiency, normalizes the phase value at the tail of each section of data to be within 2pi by segmenting the data, and transmits the value as a parameter to the initial phase of the next section of data, so that the accumulated error of the upper section of data is still transmitted to the next section of data through the phase value, and the problem of accumulated error transmission in the phase transmission process is not thoroughly solved. Although the method limits error accumulation to a certain extent, the error accumulation is still large, the problem of error accumulation is not effectively solved, and certain limitation exists.

Although the method for realizing the digital down-conversion signal based on the GPU has the advantages of flexibility and high efficiency, the GPU has rounding errors due to the precision limitation of floating point numbers in the down-conversion calculation process, and unpredictable error results are caused by error accumulation of long-time calculation. The accumulated error is controlled within a certain precision by a proper algorithm needing to be researched pertinently, and the accuracy of the result is ensured.

In the process of realizing the down-conversion of the NCO based on the lookup table of the GPU, phase information is preset in a GPU cache in advance, and a large amount of on-chip resources are occupied.

In the GPU-based data segmentation + cycle elimination method, a random data segmentation mode is adopted, the phase value at the tail end of the previous segment of data is normalized to be within 2pi, the value is used as an initial phase and is transmitted to the next segment of data for phase calculation, and meanwhile, correction calculation is carried out on the data point by point. Although the tail value of each section of data is normalized to be within 2pi, the error accumulation still exists and is transmitted to the next section of data, the error accumulation is transmitted layer by layer between the data sections and is uncontrollable, in addition, the calculation result is corrected and calculated point by point, the calculation complexity is increased, and the calculation efficiency is reduced.

Therefore, a method for controlling the accumulated error in the down-conversion process of the GPU is not available.

Disclosure of Invention

In view of this, the invention provides a method for controlling accumulated errors in down-conversion based on a GPU, which can control the accumulated errors in the down-conversion process of the GPU, and the accumulated errors can be controlled within limited data points, so that the error accumulation is smaller.

In order to achieve the purpose, the technical scheme of the invention is as follows: the GPU carries out down-conversion processing on the received signals, and in the down-conversion processing process, the following steps are executed:

step 1: the GPU receives signals sent by the host, namely receiving signals, and the sampling frequency of the signals is F _s 。

Step 2: determining the frequency resolution as delta F according to the actual engineering _max If the GPU kernel has a data amount of N ═ F at a time _s /ΔF _max (ii) a And calculating the number data _ length ═ j × N, j ═ 1,2 and 3 … of one-time read-in of the GPU cache according to the requirement.

And step 3: selecting the down-conversion frequency F according to the actual engineering resolution requirement _L ＝mΔF _max (ii) a Wherein m is a positive integer.

And 4, step 4: according to F _L /F _s ＝mΔF _max /NΔF _max If m and N can not be reduced, carrying out phase zeroing operation on the data of each N point, and when m and N have a common divisor i, reducing the m/N to L/K, selecting K as the number of the zeroing points, namely carrying out the phase zeroing operation on the data of each K point.

And 5: the GPU kernel function calculates phase values and carries out phase zeroing operation on the received data points according to data of each K point, namely the phase is 2pi multiplied by F _L /F _s ×mod(N,K)。

Step 6: and (4) the GPU judges whether the data processing is finished, if so, the processing result is output, and if not, the step 1 is returned.

Further, the GPU performs down-conversion processing on the received signal, specifically:

the GPU graphic processor is characterized in that under a CUDA (compute unified device architecture), the smallest unit for GPU to execute operation is thread reads, a plurality of thread reads form a block, the thread reads in one block access a shared memory, the thread reads in different blocks cannot access the same shared memory, a plurality of blocks form a grid, the thread reads, the block blocks and the grid have different storage, and the GPU has the calculation core of threads reads.

The signal received by the GPU is s (n) ═ a (n) cos [ 2pi f ₀ n+φ ₀ ]Where a (n) is the amplitude of the received signal, f ₀ For the frequency of the received signal, phi ₀ N is the data point sampling point for the initial phase value of the received signal.

Each sampling point of the received signal is sent to a corresponding thread in the GPU for down-conversion processing.

Further, the GPU employs single precision floating point arithmetic.

Furthermore, the GPU kernel has a first-time processing data size of N ═ F _s /ΔF _max That is, the GPU kernel once processes the data amount N in inverse proportion to the frequency resolution.

Furthermore, the number data _ length of the GPU cache read once is calculated as j × N, j is 1,2,3 … according to the requirement, that is, the data length of the GPU cache read each time is j times of the number of data points N, and j is a positive integer.

Has the advantages that:

1. the invention provides a cyclic return-to-zero method for effectively controlling accumulated errors in a floating point number operation process in a down-conversion calculation process based on a GPU. The method analyzes the values of the frequency resolution required by engineering, the actual sampling frequency and the down-conversion frequency, and calculates the relation among the three, namely F _L /F _s ＝mΔF _max /NΔF _max It is found that the accumulated error can be tightly controlled between K finite data points, and the phase value will be zeroed every K points, preventing the propagation of the error accumulation. The method is simple and easy to operate, high in execution efficiency, strict in accumulated error control and capable of meeting the actual engineering requirements. Compared with the traditional method for realizing digital down conversion based on ASIC/FPGA, the method for realizing the down conversion by using the GPU has the advantages of high flexibility, short algorithm debugging and developing period, higher precision, higher reliability and lower cost. Compared with the method for realizing the digital NCO based on the GPU, the method does not use a lookup table to calculate the phase value, saves precious on-chip memory resources and resourcesThe source utilization is higher, and the accumulated error is controlled within a limited data point, and the error accumulation is smaller. The invention designs a method for calculating the accumulation of the down-conversion control error based on a GPU, aiming at the relation between the down-conversion local frequency and the sampling frequency, the accuracy requirement is met while the down-conversion frequency is ensured to be flexible and variable, and calculation results show that the accumulated error can be effectively controlled to be 1e-8 orders of magnitude.

2. The invention realizes digital down conversion based on the GPU, can fully utilize the high parallel processing capability of the GPU and the flexibility based on CUDA programming, flexibly configures resources according to requirements, overcomes the defects of the prior art, saves valuable on-chip resources compared with a lookup table-based realization method, realizes down conversion based on ASIC/FPGA hardware, and has the advantages of convenient development, flexible reconstruction, convenient upgrading and extension, lower later maintenance and upgrading cost and the like.

Drawings

FIG. 1 is a schematic diagram of an exemplary digital down conversion process;

FIG. 2 is a schematic diagram of a GPU-based digital down-conversion process according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the 100ms data phase accumulated error in accordance with one embodiment of the present invention;

FIG. 4 is a schematic diagram of an accumulated error of 100ms data amplitude in an embodiment of the present invention;

FIG. 5 is a flowchart of a method for controlling accumulated error of single-precision floating point numbers in down-conversion based on a GPU according to an embodiment of the present invention;

FIG. 6 is a flowchart of a method for controlling accumulated error of single-precision floating point numbers in down-conversion based on a GPU according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of the phase accumulated error after the zeroing process of the 5.6K data according to the embodiment of the present invention;

FIG. 8 is a schematic diagram of an amplitude accumulated error after zeroing processing of 5.6K data according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of an amplitude accumulated error of 100ms after optimization according to an embodiment of the present invention;

FIG. 10 is a diagram illustrating the 100ms phase accumulated error after optimization according to an embodiment of the present invention.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

The invention provides a method for realizing control error accumulation in a down-conversion process based on a GPU. In the down-conversion process based on the GPU, the local down-conversion signals and the received signals need to be subjected to dot multiplication, and the multi-core multi-thread advantage of the GPU is utilized, so that the large-scale parallel processing of the down-conversion data can be realized, the calculation real-time performance is improved, and the real-time performance requirements of modern aerospace measurement and control and deep space interferometry are met. However, due to the limitation of the precision of the floating point number of the GPU, when the time is accumulated to a certain degree, the error accumulation becomes very serious, and the result deviation is large. The invention designs a method for calculating the accumulation of the down-conversion control error based on a GPU, aiming at the relation between the down-conversion local frequency and the sampling frequency, the accuracy requirement is met while the flexibility and the variability of the down-conversion frequency are ensured, and calculation results show that the accumulated error can be effectively controlled to be 1e-8 magnitude, and compared with a lookup table based on the GPU, the method for realizing the control of the high-accuracy phase error to be 1e-6, and two magnitude orders are improved.

The method has the advantages of realizing digital down conversion based on the GPU, fully utilizing the high parallel processing capability of the GPU and the flexibility based on CUDA programming, flexibly configuring resources according to requirements, overcoming the defects of the prior art, saving valuable on-chip resources compared with a lookup table-based realization method, realizing down conversion based on ASIC/FPGA hardware, along with convenient development, flexible reconstruction, convenient upgrading and extension, lower later maintenance and upgrading cost and the like.

GPU-based digital down conversion

A gpu (graphics Processing unit) graphics processor, which is a microprocessor dedicated to image operation on personal computers, workstations, and the like. Under the CUDA architecture, the minimum unit for the GPU to perform operations is a thread (thread), a plurality of threads (threads) form a block (block), and the threads (threads) in a block (block) can access a shared memory and can perform synchronization operations quickly. Threads (threads) in different blocks (blocks) cannot access the same shared memory and therefore cannot directly communicate or synchronize. Several blocks (blocks) form a grid (grid), threads, blocks and grid have different storage, and the computational core of the GPU is the threads.

As can be seen from equation (3), the received signal is s (n) ═ a (n) cos [ 2pi f [ ] ₀ n+φ ₀ ]Where a (n) is the amplitude of the received signal, f ₀ For the frequency of the received signal, phi ₀ Is the initial phase value of the received signal. The GPU-based digital down-conversion is different from the traditional hardware serial implementation mode, the advantages of multi-core multithreading are fully utilized under the CUDA model, and the parallel down-conversion implementation mode is adopted. The process is shown in figure 2.

To facilitate understanding of the analysis problem, a one-dimensional grid (grid) and a one-dimensional block (block) are used in the figure. Each sampling point of the received signal is sent to a corresponding thread in the GPU for processing. For convenience of analysis, taking a deep space interferometry down-conversion calculation process as an example, an input signal adopts certain actual equipment to acquire a signal, wherein the intermediate frequency is 70MHz, the sampling frequency is 56MHz, the code rate is 1Msps, the local carrier signal is 14MHz, and the data length is 100 ms. The difference between the down-converted signal processed by the GPU and the true value is shown in fig. 3 and 4. As is clear from the figure, as time increases, since the GPU adopts single-precision floating-point operation, the solution phase phi is 2pi f _L n and amplitude ddcs signal (n)(s) (n) (cos (2 pi f)) _L n), the value of n is larger and larger, and the accumulated error of the phase and the amplitude is larger and larger due to the precision of single-precision floating point data.

Error analysis

The reason why the accumulated error is larger and larger is that the number of floating point numbers stored in a computer memory is limited. The storage structure of the single-precision floating-point number float in the memory according to the expression method of the floating-point number IEEE754 is shown in table 1.

TABLE 1

31	30	29-23	22-0
				Sign bit of real number	Sign bit of exponent	Exponent bit	Significant digit

Wherein sign bit 1 represents positive and 0 represents negative. The significand is 24 bits, one of which is the sign bit of the real number. The conversion is to decimal numbers, and the effective digit of the single-precision floating point number is 6-7 digits. Therefore, as the data volume is continuously increased, n is larger and is limited by the significand of the single-precision floating point number, and the accumulated error is larger and larger.

GPU-based method for realizing accumulated error control in down-conversion

As can be seen from the above analysis, the reason for the accumulation error is the increasing data amount n, which results in the phase value phase 2pi f _L n is increasingly larger. The precision of the floating-point number is limited, and in the floating-point number operation process, rounding errors generated by the order matching and normalization operations are accumulated and amplified. To control the accumulated error within an acceptable range, a certain approach must be taken.

The invention provides a control method for accumulative error of constraint data length

Through the above analysis, the time processing data amount n can be controlled within a certain length for effective control of error accumulation. Without loss of generality, firstly, according to actual engineering needs or index requirements, obtaining the frequency resolution of delta F _max From the relationship between the resolution and the number of data points, the number of points where the processed data can be obtained is equal to F _s /ΔF _max That is, the number of processing points N is inversely proportional to the frequency resolution, and the larger N, the smaller the frequency resolution, and vice versa. When the down-conversion frequency is integral multiple of the frequency resolution, the accurate down-conversion frequency can be obtained,i.e. F _L ＝mΔF _max 。

According to the determined value of N, the data amount data _ length ═ j × N read into the GPU cache each time can be calculated, that is, the data length read into the GPU cache each time is j times the number of data points N, and j is a positive integer.

Accumulated error control method for phase return to zero

By calculating the data length N, F _L /F _s ＝mΔF _max /NΔF _max When m/N is used, in the limit, m and N cannot be reduced, and the data should be zeroed every N points (when m and N have common divisor, the zero point may be smaller than N). Namely mod (N, N) operation, the accumulated error can be strictly controlled within N points, and when the phase of the second group of data is started from 0, the problem of accumulated error transmission is solved. When m and N have common divisor, m/N is L I/K I L/K, zero point number can be smaller, and the accumulated error is strictly controlled within K points. The process of controlling the accumulated error in the down-conversion process based on the GPU is illustrated in fig. 5 and 6.

And determining a down-conversion frequency value by analyzing the resolution required by the calculation engineering, and further determining the K value of the return-to-zero operation. By this step, the phase values can be strictly defined within the K phase values of 0- ((K-1) × L × 2 pi/K). The phase precision is ensured. In the calculation process, only one more module operation is needed, and the calculation real-time performance is hardly influenced. The high-precision and high-efficiency calculation of the phase is ensured while the calculation real-time performance is ensured.

Simulation verification

Taking the deep space interferometry down-conversion as an example, calculating according to the deep space interferometry bandwidth which is a multiple of 0.5M and the minimum bandwidth which is 0.5M, and considering that the frequency difference in the actual engineering is not more than 1% of the bandwidth, the frequency difference requirement can be met, namely, the frequency difference is not more than 5 KHz. According to this index, taking the sampling frequency 56MHz into account, the equation Δ F ═ F is given _s And 2N, N56 MHz/2 x 5KHz 5.6K. That is, when the down-conversion frequency is an integral multiple of 5KHz, the actual engineering requirements are met when the data blocks calculated at the same time in the actual calculation do not exceed 5.6K Samples.

According to the process of fig. 6, 5.6K is selected as N to perform the phase zeroing process, and the calculation is completedAs shown in fig. 7 and 8. As can be seen from FIG. 7 and FIG. 8, after the processing of the present invention, in the worst case, the accumulated error of the amplitude can be strictly controlled to 10 ^-7 Within the range, the phase error is controlled at 10 ^-4 Within range and the accumulated error does not increase over time.

In practical engineering, the down-conversion frequency is chosen to be 14MHz, i.e. 2800 integer times of 5 KHz. The frequency resolution can be satisfied.

Meanwhile, the value of m/N is 14MHz/56MHz 2800 KHz/11200 KHz 1/4. The phase of every four points is normalized to 0 phase, and the phase can be controlled to be 0, pi/2, pi,3pi/2]Four phase values, accumulated without error. The simulation results are shown in fig. 9 and 10. As can be seen from fig. 9 and 10, by analyzing a specific relationship between the sampling frequency, the down-conversion frequency and the frequency resolution. The phase values can be circularly zeroed at certain intervals, so that the transmission of accumulated errors is prevented, the accumulated errors are limited in limited data points, and the calculation precision is greatly improved. The present solution controls the error to 10 ^-8 Within.

In practical application, the difference between the down-conversion frequency and the carrier frequency is usually not more than 1 order of magnitude, so that the method can effectively control the accumulated error within several points in practical application, and prevent the propagation of the accumulated error under the condition of ensuring the precision.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for realizing accumulated error control in down-conversion based on a GPU is characterized in that the GPU performs down-conversion processing on a received signal, and in the down-conversion processing process, the following steps are executed:

step 1: the GPU receives the signal from the host, namely the received signal, and the sampling frequency is F _s ；

Step 2: according to the engineering practiceThe actual determined frequency resolution is Δ F _max If the GPU kernel has a data amount of N ═ F at a time _s /ΔF _max (ii) a Calculating the once read-in quantity data _ length of the GPU cache as j multiplied by N, j being 1,2 and 3 … according to the requirement;

and step 3: selecting a down-conversion frequency F according to the actual engineering resolution requirement _L ＝mΔF _max (ii) a Wherein m is a positive integer;

and 4, step 4: according to F _L /F _s ＝mΔF _max /NΔF _max If m and N can not be reduced, carrying out phase zeroing operation on data of each N point, and when m and N have a common divisor i, reducing the m/N to L/K, selecting K as the number of zeroing points, namely carrying out phase zeroing operation on the data of each K point;

and 5: the GPU kernel function calculates a phase value, and performs phase zeroing operation on the received data point according to data point per K (phase 2pi multiplied by F) _L /F _s ×mod(N,K)；

Step 6: the GPU judges whether the data processing is finished, if so, the processing result is output, otherwise, the step 1 is returned;

the GPU performs down-conversion processing on the received signal, specifically:

the GPU graphic processor is characterized in that under a CUDA (compute unified device architecture), the smallest unit for GPU to execute operation is thread reads, a plurality of thread reads form a block, the thread reads in one block access a shared memory, the thread reads in different blocks cannot access the same shared memory, a plurality of blocks form a grid, the thread reads, the block blocks and the grid have different storage, and the GPU has the calculation core of threads;

the signal received by the GPU is s (n) ═ a (n) cos [ 2pi f ₀ n+φ ₀ ]Where a (n) is the amplitude of the received signal, f ₀ For the frequency of the received signal, phi ₀ An initial phase value of a received signal, n being a data point sampling point;

2. The method of claim 1, wherein the GPU employs single precision floating point operations.

3. The method according to any one of claims 1-2, wherein the GPU-kernel has a one-time processing data size of N-F _s /ΔF _max That is, the GPU kernel once processes the data amount N in inverse proportion to the frequency resolution.

4. The method according to any one of claims 1 to 2, wherein the GPU cache is calculated according to a demand by reading an amount data _ length ═ j × N, j ═ 1,2,3 … once, that is, a data length of each time the GPU cache is read is j times of a number N of data points, and j is a positive integer.