CN112986944A

CN112986944A - CUDA heterogeneous parallel acceleration-based radar MTI and MTD implementation method

Info

Publication number: CN112986944A
Application number: CN202110238579.5A
Authority: CN
Inventors: 贾宗衡; 孙子棠
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2021-06-18
Anticipated expiration: 2041-03-04
Also published as: CN112986944B

Abstract

The invention relates to the technical field of radar signal processing, and provides a radar MTI and MTD implementation method based on CUDA heterogeneous parallel acceleration. The method comprises the following steps: setting radar signal processing parameter values in a CPU (Central processing Unit), and copying echo matrix data after pulse compression to a GPU (graphics processing Unit) video memory space; dividing the thread organization of the kernel function, and executing a secondary canceller MTI kernel function in a GPU; executing a matrix transfer kernel function in a GPU, finishing FFT parallel computation of a plurality of groups of Doppler channels by means of a CUFFT library function, finally executing the matrix transfer kernel function again to obtain an output result of an MTD parallel algorithm, and transmitting the output result back to the CPU; and optimizing the kernel functions of the MTI and the MTD by using a CUDA code optimization strategy, and drawing an optimized acceleration ratio curve. The optimized parallelization algorithm has the acceleration ratio reaching 142.66 times, can well meet the real-time property of radar signal processing, and is convenient to expand and transplant based on the development mode of a CUDA software system and a Visual Studio platform.

Description

CUDA heterogeneous parallel acceleration-based radar MTI and MTD implementation method

Technical Field

The invention belongs to the technical field of radar signal processing, and particularly relates to a radar MTI and MTD implementation method based on CUDA heterogeneous parallel acceleration, aiming at ensuring the instantaneity of MTI and MTD algorithms when a radar processes large-scale echo data volume by utilizing GPU parallel computing capacity and a CUDA heterogeneous programming mode, and being easy for platform transplantation.

Background

During signal processing, the radar realizes clutter suppression by means of a moving target display (MTI) technology and a Moving Target Detection (MTD) technology. The MTI processing utilizes the characteristic that clutter is expressed as smaller Doppler frequency in a frequency domain relative to a radar detection object, and utilizes a digital canceller to cancel each distance unit one by one to filter out the static clutter and improve the signal-to-noise ratio. However, the MTI cannot obtain the doppler frequency of the moving object in advance, and needs to perform MTD to suppress clutter outside the echo band. MTD processing is typically implemented by concatenating a set of adjacent narrow-band doppler filter banks after the MTI filter, which are matched to the coherent echo burst. With the modern battlefield electromagnetic environment becoming increasingly complex, the serial processing of the CPU is very time-consuming due to the increasing echo data volume, and the real-time performance of radar signal processing in the current battlefield environment is difficult to meet.

The GPU is used as a core component of the graphics card, and a hardware architecture thereof has a high parallelism degree, and is superior to the CPU in parallel resource calculation. The CUDA is called a Unified Device Architecture (CUDA), and is used as a general parallel computing platform introduced by NVIDIA corporation, and supports heterogeneous cooperative work of a CPU and a GPU, and a programming model thereof fully combines CPU-skilled logic control and GPU-skilled parallel operation. There have been some research efforts currently directed to the MTI and MTD algorithms of the CUDA platform.

The great strength of the electronics science and technology university is that in the Master graduate thesis 'alert radar signal processing software design based on GPU', an MTI algorithm realized by GPU is provided. The method has similar theoretical connection with the MTI parallelization process in the invention, and the main steps are as follows: copying echo data after pulse compression from a CPU to a GPU in a first-in first-out storage mode; the second step is that: a primary canceller is used at a GPU end to realize two-pulse cancellation operation in a pulse repetition period; the third step: and returning the result after the MTI processing to the CPU for scheduling. The method successfully reduces the time consumption of MTI processing, but has the defects that the designed MTI filter has a narrow stopband notch and poor clutter suppression effect, and estimation errors are large due to lack of data precision unification.

In a patent of 'an implementation method of an external radiation source radar signal processing rapid radar based on a GPU' (application number: CN 201310176310.4; application publication number: CN103308897B) applied by institute of electronics of Chinese academy of sciences), an MTD algorithm based on the GPU is disclosed. The method mainly comprises the following steps: performing cross recombination on echo data to be processed, dividing the whole echo data into N data blocks with equal length, subdividing each data block into L data segments with equal length, wherein each data segment comprises M data point numbers; secondly, splicing the data with the same data segment number in different data blocks together in sequence, splicing the tail data of the ith segment of the Nth data block and the initial data of the (i + 1) th segment of the 1 st data block together to form a new storage structure; the third step: copying echo data under the new storage structure to a GPU, and starting M multiplied by N threads; the fourth step: and aiming at the spliced data, performing MTD processing on each M multiplied by N data points at a GPU end. The method effectively improves the parallelism degree of the MTD algorithm, but has some defects, such as optimization of thread allocation and delay hiding is not considered.

Disclosure of Invention

The invention provides a radar MTI and MTD implementation method based on CUDA heterogeneous parallel acceleration by utilizing GPU equipment, a CUDA software system and a programming model, and the process also comprises the optimization design of code instructions and thread structures, so that the radar signal processing speed can be greatly improved.

The technical idea of the invention is that a radar signal processing algorithm and a GPU parallel processing technology based on a CUDA acceleration platform are combined, a CPU + GPU heterogeneous programming mode is adopted to realize high-efficiency MTI and MTD parallel algorithms, and the high-efficiency MTI and MTD parallel algorithms integrally comprise a Host end program executed by a CPU and a Device end program executed by a GPU. The Host end is responsible for logic control and data management, and specifically comprises setting of simulation parameters, configuration of GPU thread layers, opening and releasing of storage units, reading of radar echo data, copying of data to the GPU and calling of kernel functions. The Device end is responsible for the specific execution process of the kernel function and the CUDA function library corresponding to the MTI and MTD parallel algorithm.

In order to achieve the purpose, the invention adopts the following technical scheme:

the method for realizing the radar MTI and MTD based on CUDA heterogeneous parallel acceleration comprises the following steps:

step 1, setting radar signal processing parameter values in a CPU, and reading N after pulse compression_r×N_cD, maintaining an echo data matrix X, and copying the echo data matrix X as initial data before MTI processing to a well-developed GPU video memory one by one;

step 2, using a 2-dimensional thread index to allocate Grid (Grid) and thread Block (Block) sizes of a CUDA thread, executing a secondary canceller (MTI) kernel function in a GPU, and outputting echo data after static object and noise are filtered and a distance unit where a moving target is located;

step 3, the N obtained in the step 2_r×N_cDimension result matrix X_MTIFirstly executing a matrix transfer kernel function in a GPU, then executing a cuFFTExeC2C function in a CUFFT library to finish FFT parallel computation of a plurality of groups of Doppler channels, and finally executing the matrix transfer kernel function again to obtain N output by an MTD parallel algorithm_r×N_cDimension matrix X_MTDIt is copied from the GPU back to the CPU.

And 4, optimizing the kernel functions of the MTI and the MTD respectively realized in the steps 2 and 3 by adopting strategies of code instruction optimization, optimal thread allocation, alignment and global memory access and the like, and calculating the acceleration ratio of the optimized CUDA heterogeneous parallel algorithm to the CPU serial algorithm.

In the radar MTI and MTD implementation method based on CUDA heterogeneous parallel acceleration, target echo data simultaneously contain distance based on time delay and speed dimension information based on Doppler frequency shift. Firstly, storing an initial data matrix on GPU equipment by opening up a video memory space, and distributing a thread model of a CUDA (compute unified device architecture) in a two-dimensional index mode; then, the parallelized MTI kernel function realized based on the principle of the quadratic canceller is executed in the GPU. Before MTD processing, firstly transposing an output matrix of the previous link to enable Doppler dimension data addresses to be continuous, then completing parallel FFT calculation of Doppler dimensions on a GPU, and finally transposing a matrix once to restore expected target echo data.

Compared with the prior art, the invention has the following advantages: firstly, all kernel functions are optimized according to a CUDA optimization strategy, so that the operation speed of signal processing is fully increased; secondly, when the operation precision and the acceleration effect are balanced, aiming at a function with low arithmetic intensity in engineering, the method uses a single-precision floating point number which is better in graphic and flexible architecture GPU, and has higher cost performance as a whole; thirdly, the invention is based on the CUDA software system and the Visual Studio platform development mode, has the characteristics of software and modularization, and is convenient for expansion and transplantation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for implementing a CUDA heterogeneous parallel computing radar MTI and MTD provided by the invention;

FIG. 2 is a schematic diagram of a secondary canceller implementing the MTI algorithm according to the present invention;

FIG. 3 is a schematic diagram of a narrow-band Doppler filter bank structure for implementing the MTD algorithm provided by the present invention;

FIG. 4 is a simulation diagram of the result verification of executing only MTD kernel in the GPU provided by the present invention;

FIG. 5 is a simulation diagram of result verification of executing an MTI and an MTD kernel function in sequence in the GPU provided by the present invention;

FIG. 6 is an acceleration ratio curve of the optimized CUDA heterogeneous parallel algorithm and the CPU serial algorithm provided by the invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The flow chart of the CUDA heterogeneous parallel MTI and MTD algorithm provided by the embodiment of the invention is shown in FIG. 1. Specifically, the method comprises the following steps:

step 1, setting radar signal processing parameter values in a CPU, and reading N after pulse compression_r×N_cAnd (5) maintaining the echo data matrix X, and copying the echo data matrix X into a well-developed GPU video memory one by one as initial data before MTI processing.

Specifically, step 1 includes the following 2 sub-steps:

step 1.1, setting transmitting signal parameters in CPU, and compressing pulse to obtain N_r×N_cAnd (5) maintaining the echo matrix X, and opening up a GPU video memory space by using a cudaMalloc function.

And step 1.2, copying each pulse-compressed echo data from a CPU to a GPU by using a cudammcmpy function and a cudammcmpy HostToDevice parameter, wherein each thread of the CUDA stores a current sampling point value and a value of a sampling point after passing through a delay line.

And 2, using a 2-dimensional thread index to allocate the Grid (Grid) and the thread Block (Block) of the CUDA thread, executing a secondary canceller (MTI) kernel function in the GPU, and outputting echo data after static object and noise are filtered and a distance unit where a moving target is located. Fig. 2 is a schematic diagram of a quadratic canceller for MTI processing, where the output signal y (t) is equal to the convolution of the impulse response h (t) with the input x (t), and the calculation formula is:

Y(t)＝H(t)*X(t)＝X(t)-2X(t-T_r)+X(t-2T_r)

having a transfer function of

H(z)＝(1-z^-1)²＝1-2z^-1+z^-2

Specifically, step 2 includes the following 3 sub-steps:

and 2.1, dividing the Grid (Grid) and the thread Block (Block) sizes of the thread organization according to the length of the echo data copied to the GPU, wherein each thread on the GridDim.x dimension is responsible for finishing two subtraction operations of a group of three-pulse cancellation.

And 2.2, executing a secondary canceller MTI kernel function, and finishing two times of subtraction operations of sampling points of the same distance resolution unit in a pulse repetition period in the GPU by utilizing the thread index value.

Step 3, the N obtained in the step 2_r×N_cDimension result matrix X_MTIFirstly executing a matrix transfer kernel function in a GPU, then executing a cuFFTExeC2C function in a CUFFT library to finish FFT parallel computation of a plurality of groups of Doppler channels, and finally executing the matrix transfer kernel function again to obtain N output by an MTD parallel algorithm_r×N_cDimension matrix X_MTDIt is copied from the GPU back to the CPU. FIG. 3 is a schematic diagram of an MTD filter for forming an MTI cascaded FFT using a narrow-band Doppler filter bank, and the MTD filter has an amplitude-frequency characteristic of

Wherein N represents the number of target echo pulses, k represents the kth filter, T_rIs the pulse repetition period. Each distance unit and N-1T_rThe delay element covers the whole doppler frequency.

Specifically, step 3 includes the following 5 sub-steps:

and 3.1, dividing the sizes of grids (Grid) and thread blocks (Block) of the thread organization, processing distance dimensional data of a plurality of channels by using a Grid dim.x dimension, and processing Doppler dimensional data of the plurality of channels by using a Grid dim.y dimension.

And 3.2, aiming at the matrix data after the secondary cancellation of the MTI kernel function is executed, configuring the kernel function to map each Doppler channel data into a thread block, executing the matrix transfer kernel function in the GPU, and enabling the data addresses of the distance dimension and the Doppler dimension to be continuous.

And 3.3, creating a cuFFT handle, calling a CUDA library function cufftPlan2D to configure a 2-dimensional cuFFT plan, and executing a complex domain-to-complex domain FFT parallel algorithm by using a library function cufftExecC2C with a parameter of CUFFT _ FORWARD in the Doppler dimension.

Step 3.4, the matrix obtained after FFT parallel computation is executed with a matrix transfer kernel function again to obtain N output by the MTD parallel algorithm_r×N_cDimension matrix X_MTD. Obtaining Doppler frequency shift according to a Doppler channel where a moving target is located, and solving formulas of radial velocity and velocity resolution of the Doppler frequency shift are respectively as follows:

where c is the speed of light, f_cIs the carrier frequency, Δ f_dIs the Doppler resolution, f_rIs the pulse repetition frequency, mtd_FFTIs the number of FFT points selected for MTD.

And 3.5, copying the target echo data after MTD processing from the GPU to the CPU by using the cudammcmpy function and the cudammcmpy DeviceToHost parameter, calling the cuffDestroy function to destroy the cuFFT handle, and calling the free function and the cudaFree function to respectively release the memory resources occupied by the CPU and the GPU.

Specifically, step 4 includes the following 4 CUDA optimization strategies:

(1) and optimizing the code instruction. The invention replaces arithmetic operators in kernel functions with bit operators, such as binary left shift operator < replace multiplication operator x; as another example, with 2ⁿWhen modular calculation is carried out, the operation is replaced by bitwise AND operation&(2ⁿ-1). Meanwhile, the suffix 'f' is added to all float type variables in the invention, and unnecessary time consumption caused by hidden double-to-float forced type conversion is eliminated.

(2) Optimal thread allocation. The invention configures the number of threads opened by a single thread block to be an integral multiple of 32 and not more than 1024. Aiming at int type data occupying 4 bytes, one thread block is always stored with 256 threads; for the float type data occupying 8 bytes, by always storing 128 threads in one thread block, the execution unit can be better recycled and the efficiency of CUDA instruction flow can be improved.

(3) And aligning the global memory access. The global memory transaction head address of the GPU equipment is integral multiple of the cache granularity, the memory access is realized by 32-byte L2 cache or 128-byte L1 cache, the global memory access is always aligned, and a part of bandwidth is saved.

(4) And merging the global memory access. The invention enables the thread bundles to start from the aligned memory address, all 32 threads in each thread bundle access a continuous memory block, the data used for transmission processing are all required by the thread bundles, the merging degree of the memory access is 100%, and the maximization of the memory throughput is facilitated.

The implementation method of the radar MTI and MTD based on the CUDA heterogeneous parallel acceleration provided by the invention is finished.

The effect of the present invention will be further explained with the simulation experiment.

1. Simulation conditions are as follows:

in the simulation experiment of the invention, the computer hardware and software environment are configured as follows: the GPU equipment is an NVIDIA GeForce GTX 1660Ti video card and is provided with 6GB video memory and 1536 CUDA cores; the CPU model is Intel (R) Core i7-9750H processor, 6 Core 12 threads, and the main frequency is 2.6 GHz; the operating system is a 64-bit Windows 10 professional edition; the heterogeneous parallel platform is CUDA Toolkit 10.2; the CUDA programming platform is Microsoft Visual Studio 2019; the algorithm verification platform was MATLAB R2020 a.

The simulation parameters of the invention are as follows: the linear frequency modulation signal is used as a radar emission signal, the bandwidth B of the modulation signal is 20MHz, the pulse width tau is 10us, and the pulse repetition period T_r100us, sampling frequency f_sIs 100MHz, transmitter carrier frequency f_cIs 10 GHz. Setting 3 moving targets to be detected in simulation, wherein the distance of the moving target is 1 and 3Km, and the speed is 250 m/s; the distance of the moving target 2 is 6Km, and the speed is 25 m/s; the distance of the moving target 3 is 4Km, and the speed is 75 m/s; finally, a stationary target is set at a distance of 1 Km.

The invention sets 1000 pulse repetition periods T emitted by radar every time in simulation_rAs a data acquisition, the time required for the data acquisition is 100ms according to the sampling frequency f_sCan obtain a pulse repetitionPeriod T_rNumber of inner sampling points N_sIs 10⁴And (4) respectively. In CUDA programming, float type single-precision floating point numbers occupying 4 bytes are used, so that the size of data volume needing to be processed in one data acquisition time is 10³·N_s·4Bytes/1024²≈38.1MB。

2. Simulation content:

FIG. 4 is a simulation diagram of executing only MTD kernels in the GPU, and then loading the output data to the MATLAB platform for result verification.

Fig. 5 is a simulation diagram in which the MTI and MTD kernel functions are executed in sequence in the GPU, and then the output data is loaded to the MATLAB platform for result verification.

FIG. 6 is an acceleration ratio curve of the CUDA heterogeneous parallel algorithm and the CPU serial algorithm of MTI + MTD after the optimization strategy of step 4 is used.

3. And (3) simulation result analysis:

according to the simulation parameters of the invention, the maximum unambiguous distance R of the radar can be solved_maxThe distance resolution Δ R is c/2B is 7.5 m. The invention adopts 32-point FFT to carry out MTD processing in the GPU, so that the Doppler resolution delta f _d1/(PRI · 32), velocity resolution Δ V (c · Δ f)_d)/(2·f_c)≈4.7m/s。

In fig. 4 and 5, the x-axis represents distance in m; the y-axis represents velocity in m/s; the z-axis represents normalized amplitude in volts. As can be seen from fig. 4, in the case that only MTD processing is performed in the GPU and MTI processing is not performed, a total of 4 targets are detected, and the stationary target located at 1Km is not filtered out. As can be seen from fig. 5, under the condition that the GPU successively performs MTI and MTD processing, the stationary target is successfully filtered, and within the error allowable range, the MTD processing result output by the GPU corresponds to the simulation parameters of the 3 expected moving targets set by the present invention, which verifies the feasibility of the radar MTI and MTD implementation method based on CUDA heterogeneous parallel acceleration provided by the present invention.

In fig. 6, the horizontal axis represents the data size when the CPU or GPU is used for MTI + MTD processing; and the vertical axis is an acceleration ratio value obtained by dividing the average consumed time of the CPU serial algorithm by the average consumed time of the CUDA heterogeneous parallel algorithm. As can be seen from FIG. 6, when the processed data volume is large, the acceleration ratio tends to be saturated, the acceleration ratio of the optimized MTI and MTD parallelization algorithm is 142.66 times as a whole, the real-time performance of radar signal processing can be well met, and the method is based on the development modes of a CUDA software system and a Visual studio platform and is also beneficial to platform transplantation of the algorithm.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A radar MTI and MTD realization method based on CUDA heterogeneous parallel acceleration is characterized by comprising the following steps:

2. The method according to claim 1, characterized in that step 1 comprises in particular the following sub-steps:

3. The method according to claim 1, characterized in that step 2 comprises in particular the following sub-steps:

4. The method according to claim 1, characterized in that step 3 comprises in particular the following sub-steps: