CN112986944B

CN112986944B - Radar MTI and MTD implementation method based on CUDA isomerism parallel acceleration

Info

Publication number: CN112986944B
Application number: CN202110238579.5A
Authority: CN
Inventors: 贾宗衡; 孙子棠
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2023-09-08
Anticipated expiration: 2041-03-04
Also published as: CN112986944A

Abstract

The invention relates to the technical field of radar signal processing, and provides a radar MTI and MTD implementation method based on CUDA isomerism parallel acceleration. The method comprises the following steps: setting radar signal processing parameter values in a CPU, and copying echo matrix data after pulse compression to a GPU video memory space; dividing the thread organization of the kernel function, and executing a secondary canceller MTI kernel function in the GPU; executing matrix transposition kernel functions in the GPU, completing FFT parallel calculation of a plurality of groups of Doppler channels by means of CUFFT library functions, and finally executing the matrix transposition kernel functions again to obtain an output result of an MTD parallel algorithm and transmitting the output result back to the CPU; and optimizing the kernel functions of the MTI and the MTD by utilizing a CUDA code optimization strategy, and drawing an optimized acceleration ratio curve. The optimized parallelization algorithm acceleration ratio of the invention reaches 142.66 times, can well meet the real-time performance of radar signal processing, is based on a CUDA software system and a development mode of a Visual Studio platform, and is convenient for expansion and transplantation.

Description

Radar MTI and MTD implementation method based on CUDA isomerism parallel acceleration

Technical Field

The invention belongs to the technical field of radar signal processing, in particular to a method for realizing MTI and MTD of a radar based on CUDA heterogeneous parallel acceleration, which aims to ensure the instantaneity of MTI and MTD algorithm when the radar processes large-scale echo data volume by utilizing GPU parallel computing capability and CUDA heterogeneous programming mode and is easy for platform transplantation.

Background

The radar performs clutter suppression by means of a moving target display (MTI) technique and a Moving Target Detection (MTD) technique during signal processing. The MTI processing utilizes the characteristic that clutter is smaller than Doppler frequency of a radar detection object in a frequency domain, and utilizes a digital canceller to cancel each distance unit one by one to filter out static clutter so as to improve the signal to noise ratio. However, the MTI cannot obtain the doppler frequency of the moving object in advance, and further MTD is required to suppress clutter outside the echo band. The common practice of MTD processing is to concatenate a set of adjacent narrowband doppler filter banks that match the coherent echo bursts after the MTI filter. With the increasing complexity of modern battlefield electromagnetic environments, the increasing echo data volume makes CPU serial processing time-consuming, and the real-time performance of radar signal processing in the current battlefield environment is difficult to meet.

The GPU is used as a core component of the display card, and the hardware architecture of the GPU has high parallelism and is particularly superior to a CPU in parallel resource calculation. The CUDA is collectively called a unified computing device architecture (Compute Unified Device Architecture, CUDA), is a general parallel computing platform proposed by NVIDIA corporation, supports heterogeneous cooperation of the CPU and the GPU, and a programming model of the CUDA fully combines logical control good for the CPU and parallel operation good for the GPU. There have been some research efforts on MTI and MTD algorithms for CUDA platforms.

Chen Dajiang of the university of electronic technology gives a GPU-implemented MTI algorithm in its master graduation paper, "GPU-based alert radar signal processing software design". The method has similar theoretical connection with the MTI parallelization process in the invention, and comprises the following main steps: copying the echo data after pulse compression from the CPU to the GPU in a first-in first-out storage mode; and a second step of: a primary canceller is used at the GPU end to realize two-pulse cancellation operation in a pulse repetition period; and a third step of: and returning the result after the MTI processing to the CPU for scheduling. The method successfully reduces the time consumption of MTI processing, but has the defects that the designed MTI filter has a narrow stop band notch and poor clutter suppression effect, and the lack of uniformity of data precision causes larger estimation error.

In the patent of ' a rapid radar implementation method for processing external radiation source radar signals based on GPU ' (application number: CN201310176310.4; application publication number: CN 103308897B) applied by the institute of electronics of China academy of sciences ', an MTD algorithm based on GPU implementation is disclosed. The method mainly comprises the following steps: cross-reorganizing echo data to be processed, wherein the whole echo data is divided into N equal-length data blocks, each data block is subdivided into L equal-length data segments, and each data segment comprises M data points; secondly, splicing the data with the same data segment number in different data blocks together in sequence, and splicing the tail data of the ith segment of the Nth data block with the initial data of the (i+1) th segment of the 1 st data block to form a new storage structure; and a third step of: copying echo data under the new storage structure to the GPU, and starting M multiplied by N threads; fourth step: and carrying out MTD processing on every M multiplied by N data points at the GPU side aiming at the spliced data. The method effectively improves the parallelism degree of the MTD algorithm, but has some defects, such as optimization of thread allocation and delay hiding is not considered.

Disclosure of Invention

The invention provides a radar MTI and MTD realization method based on CUDA heterogeneous parallel acceleration by utilizing GPU equipment, a CUDA software system and a programming model, and the method also comprises the optimization design of code instructions and thread structures in the process, so that the radar signal processing speed can be greatly improved.

The technical idea of the invention is that the efficient MTI and MTD parallel algorithm is realized by adopting a heterogeneous programming mode of CPU and GPU by combining a radar signal processing algorithm and a GPU parallel processing technology based on a CUDA acceleration platform, and the whole system comprises a Host end program executed by the CPU and a Device end program executed by the GPU. The Host end is responsible for logic control and data management, and specifically comprises setting of simulation parameters, configuration of a GPU thread hierarchy, opening and releasing of a storage unit, reading of radar echo data, copying of data into a GPU and calling of a kernel function. The Device end is responsible for the specific execution process of the kernel function and the CUDA function library corresponding to the MTI and MTD parallel algorithm.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the method for realizing the radar MTI and MTD based on CUDA isomerism parallel acceleration comprises the following steps:

step 1, setting radar signal processing parameter values in a CPU, and reading N after pulse compression _r ×N _c The echo data matrix X is maintained, is used as initial data before MTI processing and is copied into a developed GPU video memory one by one;

step 2, using a 2-dimensional thread index to allocate the Grid (Grid) and the thread Block (Block) of the CUDA thread, executing a secondary canceller MTI kernel function in the GPU, and outputting echo data after filtering static object impurities and a distance unit where a moving target is located;

step 3, for N obtained in the step 2 _r ×N _c Dimension result matrix X _MTI Firstly executing matrix transposition kernel functions in a GPU, then executing cuFFTExeC2C functions in a CUFFT library to complete FFT parallel calculation of a plurality of groups of Doppler channels, and finally executing the matrix transposition kernel functions again to obtain N output by an MTD parallel algorithm _r ×N _c Dimension matrix X _MTD It is copied from the GPU back to the CPU.

And 4, optimizing the kernel functions of the MTI and the MTD respectively realized in the 2 nd and the 3 rd steps by adopting strategies such as code instruction optimization, optimal thread allocation, alignment and merging global memory access, and the like, and calculating the speed-up ratio of the CUDA heterogeneous parallel algorithm to the CPU serial algorithm after optimization.

In the method for realizing the radar MTI and the MTD based on CUDA heterogeneous parallel acceleration, target echo data simultaneously comprises distance based on time delay and speed dimension information based on Doppler frequency shift. Firstly, storing an initial data matrix on GPU equipment through a video memory space, and distributing a thread model of CUDA by adopting a two-dimensional index mode; next, parallelized MTI kernel functions implemented based on the secondary canceller principle are executed at the GPU. Before MTD processing, the output matrix of the last link is transposed to ensure that Doppler data addresses are continuous, then Doppler parallel FFT calculation is completed on the GPU, and finally, matrix transposition is performed again to restore expected target echo data.

Compared with the prior art, the invention has the following advantages: firstly, all kernel functions are optimized according to a CUDA optimization strategy, so that the operation speed of signal processing is fully improved; secondly, when the operation precision and the acceleration effect are balanced, aiming at the function with low arithmetic strength in engineering, the single-precision floating point number with higher intensity of the GPU of the Turing framework is used, and the whole has higher cost performance; thirdly, the invention is based on a CUDA software system and a Visual Studio platform development mode, has the characteristics of software and modularization, and is convenient for expansion and transplantation.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for realizing CUDA isomerism parallel computing radar MTI and MTD;

fig. 2 is a schematic diagram of a secondary canceller implementing an MTI algorithm according to the present invention;

fig. 3 is a schematic diagram of a narrowband doppler filter bank for implementing an MTD algorithm according to the present invention;

FIG. 4 is a verification simulation diagram of the result of executing only the MTD kernel function in the GPU provided by the invention;

FIG. 5 is a verification simulation diagram of the result of executing MTI and MTD kernel functions in succession in a GPU provided by the invention;

FIG. 6 is an acceleration ratio curve of the optimized CUDA heterogeneous parallel algorithm and the CPU serial algorithm provided by the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The flow chart of CUDA isomerism parallel MTI and MTD algorithm provided by the embodiment of the invention is shown in figure 1. Specifically, the method comprises the following steps:

step 1, setting radar signal processing parameter values in a CPU, and reading N after pulse compression _r ×N _c And (3) maintaining the echo data matrix X, and copying the echo data matrix X serving as initial data before MTI processing into a developed GPU video memory one by one.

Specifically, the step 1 includes the following 2 substeps:

step 1.1, setting a transmitting signal parameter in a CPU, compressing the pulse to obtain N _r ×N _c And (3) maintaining the echo matrix X, and opening up a GPU video memory space by using a cudamallloc function.

Step 1.2, copying the echo data after each pulse compression from the CPU to the GPU by using the cudaMemcpy function and the cudamemcpyHostTodevice parameter, and storing the current sampling point value and the value after the sampling point passes through the delay line by each thread of the CUDA.

And 2, distributing the Grid (Grid) and the thread Block (Block) size of the CUDA thread by using the 2-dimensional thread index, executing a secondary canceller MTI kernel function in the GPU, and outputting echo data after filtering static object impurities and a distance unit where a moving target is located. Fig. 2 is a schematic diagram of a secondary canceller used in MTI processing, where the output signal Y (t) is equal to the convolution of the impulse response H (t) and the input X (t), and the calculation formula is:

Y(t)＝H(t)*X(t)＝X(t)-2X(t-T _r )+X(t-2T _r )

the transfer function is

H(z)＝(1-z ^-1 ) ² ＝1-2z ^-1 +z ^-2

Specifically, the step 2 includes the following 3 sub-steps:

step 2.1, dividing the Grid (Grid) and the thread Block (Block) of the thread organization according to the echo data length copied to the GPU, wherein the thread on each GridDim.x dimension is responsible for completing two subtraction operations of a group of three-pulse cancellation.

And 2.2, executing a secondary canceller MTI kernel function, and completing two subtraction operations of sampling points of the same distance resolution unit in a pulse repetition period in the GPU by using the thread index value.

Step 3, for N obtained in the step 2 _r ×N _c Dimension result matrix X _MTI Firstly executing matrix transposition kernel functions in a GPU, then executing cuFFTExeC2C functions in a CUFFT library to complete FFT parallel calculation of a plurality of groups of Doppler channels, and finally executing the matrix transposition kernel functions again to obtain N output by an MTD parallel algorithm _r ×N _c Dimension matrix X _MTD It is copied from the GPU back to the CPU. An MTD filter schematic diagram for constructing MTI cascade FFT by using a narrow-band Doppler filter bank is shown in FIG. 3, and the amplitude-frequency characteristic is that

Wherein N represents the number of target echo pulses, k represents the kth filter, T _r Is the pulse repetition period. Each distance unit and N-1T _r The delay element covers the whole doppler frequency.

Specifically, the step 3 includes the following 5 sub-steps:

step 3.1, dividing the mesh (Grid) and the thread Block (Block) size of the thread organization, wherein the griddim.x dimension processes the distance dimension data of a plurality of channels, and the griddim.y dimension processes the Doppler dimension data of a plurality of channels.

And 3.2, aiming at matrix data after the execution of the twice-canceled MTI kernel function, configuring the kernel function to map each Doppler channel data into a thread block, executing the matrix transposed kernel function in the GPU, and enabling the data addresses of the distance dimension and the Doppler dimension to be continuous.

And 3.3, creating a cuFFT handle, calling a cuDA library function cufftPlan2D to configure a 2-dimensional cuFFT plan, and executing a complex domain-to-complex domain FFT parallel algorithm on a Doppler dimension by using a library function cufftExec2C with a parameter of CUFFT_FORWARD.

Step 3.4, performing matrix transposition kernel function again on the matrix obtained after FFT parallel calculation to obtain N output by MTD parallel algorithm _r ×N _c Dimension matrix X _MTD . Obtaining Doppler frequency shift according to a Doppler channel where a moving target is located, and solving formulas of radial speed and speed resolution of the moving target are respectively as follows:

wherein c is the speed of light, f _c Is the carrier frequency, Δf _d Is Doppler resolution, f _r Is pulse repetition frequency mtd _FFT Is the FFT point number selected by the MTD.

And 3.5, copying the target echo data processed by the MTD from the GPU to the CPU by using the cudaMemcpy function and the cudamemcpyDeviceToHost parameter, calling the cufftDestroy function to destroy the cuFFT handle, and calling the free function and the cudame function to release memory resources occupied by the CPU and the GPU respectively.

Specifically, the following 4 CUDA optimization strategies are included in step 4:

(1) Code instruction optimization. The present invention replaces arithmetic operators in kernel functions with bit operators, such as binary left shift operators < replace multiply operators×; for another example, at 2 ⁿ When doing modular computation, the method is replaced by bitwise and operator&(2 ⁿ -1). Meanwhile, the suffix 'f' is added to all float type variables in the invention, so that unnecessary time consumption caused by the hidden double-to-float forced type conversion is eliminated.

(2) Optimal thread allocation. The number of threads opened up by the configuration single thread block is an integral multiple of 32 and is not more than 1024. For the int type data accounting for 4 bytes, 256 threads are always stored in one thread block; for the float type data accounting for 8 bytes, 128 threads are always stored in one thread block, so that an execution unit can be better recycled, and the efficiency of CUDA instruction flow is improved.

(3) And aligning the full local memory access. In the invention, the global memory transaction head address of the GPU equipment is an integer multiple of the granularity of the cache, the memory access is realized by the L2 cache of 32 bytes or the L1 cache of 128 bytes, and the global memory access is always aligned, so that a part of bandwidth is saved.

(4) And merging global memory accesses. The invention enables the thread bundles to start from the aligned memory addresses, all 32 threads in each thread bundle access a continuous memory block, the data used for transmission processing are all needed by the thread bundles, the merging degree of the memory access is 100%, and the maximization of the memory throughput is facilitated.

The implementation method of the radar MTI and MTD based on CUDA isomerism parallel acceleration provided by the invention is finished.

The effects of the present invention will be further described with reference to simulation experiments.

1. Simulation conditions:

in the simulation experiment of the present invention, the computer hardware and software environments are configured as follows: the GPU equipment is a NVIDIA GeForce GTX 1660Ti video card and is provided with 6GB video memory and 1536 CUDA cores; the CPU model is an Intel (R) Core i7-9750H processor, 6 Core 12 threads, and the main frequency is 2.6GHz; the operating system is a 64-bit Windows 10 specialty; the heterogeneous parallel platform is CUDA Toolkit 10.2; CUDA programming platform is Microsoft Visual Studio 2019; the algorithm verification platform is MATLAB R2020a.

The simulation parameters of the invention are as follows: the linear frequency modulation signal is used as a radar transmitting signal, the bandwidth B of the modulating signal is 20MHz, the pulse width tau is 10us, and the pulse repetition period T _r 100us, sampling frequency f _s 100MHz, transmitter carrier frequency f _c 10GHz. Setting 3 moving targets to be detected in the simulation, wherein the distance of the moving targets is 3Km, and the speed is 250m/s; the distance of the moving object 2 is 6Km, and the speed is 25m/s; the distance of the moving object 3 is 4Km, and the speed is 75m/s; finally, a stationary object is set at a distance of 1 Km.

The invention sets 1000 pulse repetition periods T per transmission of the radar in the simulation _r As one data acquisition, the time required for one data acquisition is 100ms, according to the sampling frequency f _s A pulse repetition period T can be obtained _r Number of samples N in _s Is 10 ⁴ And each. The CUDA programming uses float type single-precision floating point number which occupies 4 bytes, so the data size to be processed in one data acquisition time is 10 ³ ·N _s ·4Bytes/1024 ² ≈38.1MB。

2. The simulation content:

fig. 4 is a simulation diagram of performing only MTD kernel functions in the GPU and then loading output data to the MATLAB platform for result verification.

Fig. 5 is a simulation diagram of sequentially executing MTI and MTD kernel functions in the GPU, and then loading output data to the MATLAB platform for result verification.

FIG. 6 is a graph of acceleration ratio of the CUDA heterogeneous parallel algorithm to the CPU serial algorithm for MTI+MTD using the optimization strategy of step 4.

3. Simulation result analysis:

according to the simulation parameters of the invention, the maximum unambiguous distance R of the radar can be solved _max = (c·pri)/2=15 km, distance resolution Δr=c/2b=7.5 m. The invention adopts 32-point FFT to process MTD in GPU, then Doppler resolution Deltaf _d =1/(pri·32), velocity resolution Δv= (c·Δf _d )/(2·f _c )≈4.7m/s。

In fig. 4 and 5, the x-axis represents distance in m; the y-axis represents speed in m/s; the z-axis represents normalized amplitude in volts. As can be seen from fig. 4, in the case where only MTD processing is performed and MTI processing is not performed in the GPU, 4 targets are detected in total, and the stationary target located at 1Km is not filtered out. As can be seen from fig. 5, under the condition that MTI and MTD processes are sequentially performed in the GPU, the static target is successfully filtered, and the MTD processing result output by the GPU corresponds to the simulation parameters of the 3 expected moving targets set by the invention within the range of error permission, so that the feasibility of the implementation method of the radar MTI and MTD based on CUDA heterogeneous parallel acceleration provided by the invention is verified.

In fig. 6, the horizontal axis represents the data size when the CPU or GPU is used for mti+mtd processing; the vertical axis is the acceleration ratio value obtained by dividing the average time consumption of the CPU serial algorithm by the average time consumption of the CUDA heterogeneous parallel algorithm. As can be seen from fig. 6, when the processed data volume is large, the acceleration ratio tends to be saturated, the acceleration ratio of the optimized MTI and MTD parallelization algorithm is 142.66 times as high as that of the whole, the real-time performance of radar signal processing can be well met, and the method is based on the development mode of the CUDA software system and the Visual studio platform, and is also beneficial to platform transplantation of the algorithm.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A radar MTI and MTD implementation method based on CUDA isomerism parallel acceleration is characterized by comprising the following steps:

step 3, for N obtained in the step 2 _r ×N _c Dimension result matrix X _MTI Firstly executing matrix transposition kernel functions in a GPU, then executing cuFFTExeC2C functions in a CUFFT library to complete FFT parallel calculation of a plurality of groups of Doppler channels, and finally executing the matrix transposition kernel functions again to obtain N output by an MTD parallel algorithm _r ×N _c Dimension matrix X _MTD Copying it from the GPU back to the CPU;

2. The method according to claim 1, characterized in that step 1 comprises in particular the sub-steps of:

step 1.1, setting a transmitting signal parameter in a CPU, compressing the pulse to obtain N _r ×N _c A dimensional echo matrix X is used for opening up a GPU video memory space by using a cudaMalloc function;

3. The method according to claim 1, characterized in that step 2 comprises in particular the sub-steps of:

step 2.1, dividing the Grid (Grid) and the thread Block (Block) of the thread organization according to the echo data length copied to the GPU, wherein the thread on each GridDim.x dimension is responsible for completing two subtraction operations of a group of three-pulse cancellation;

4. The method according to claim 1, characterized in that step 3 comprises the following sub-steps:

step 3.1, dividing the size of a Grid (Grid) and a thread Block (Block) of a thread organization, wherein the Grid dim.x dimension processes distance dimension data of a plurality of channels, and the Grid dim.y dimension processes Doppler dimension data of a plurality of channels;

step 3.2, aiming at matrix data after the execution of the twice-cancellation MTI kernel function, configuring the kernel function to map each Doppler channel data into a thread block, executing the matrix transposition kernel function in the GPU, and enabling the data addresses of the distance dimension and the Doppler dimension to be continuous;

step 3.3, creating a cuFFT handle, calling a cuDA library function cufftPlan2D to configure a 2-dimensional cuFFT plan, and executing a complex domain-to-complex domain FFT parallel algorithm on a Doppler-dimension library function cufftExec2C with a parameter of CUFFT_FORWARD;

step 3.4, performing matrix transposition kernel function again on the matrix obtained after FFT parallel calculation to obtain N output by MTD parallel algorithm _r ×N _c Dimension matrix X _MTD； Obtaining Doppler frequency shift according to a Doppler channel where a moving target is located, and solving formulas of radial speed and speed resolution of the moving target are respectively as follows:

wherein c is the speed of light, f _c Is the carrier frequency, Δf _d Is Doppler resolution, f _r Is pulse repetition frequency mtd _FFT The FFT point number selected by the MTD;