CN110515053B

CN110515053B - CPU and multi-GPU based heterogeneous platform SAR echo simulation parallel method

Info

Publication number: CN110515053B
Application number: CN201910794748.6A
Authority: CN
Inventors: 梁毅; 王文杰; 邢孟道; 孙昆
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2023-02-17
Anticipated expiration: 2039-08-27
Also published as: CN110515053A

Abstract

The invention discloses a heterogeneous platform SAR echo simulation parallel method based on a CPU and multiple GPUs, which comprises the following steps: acquiring all GPU equipment information at a CPU end, and distributing parallel tasks to a selected GPU equipment group capable of performing point-to-point transmission; transmitting the simulation parameters of the CPU end and the information of the reference picture to different storage areas of the GPU equipment; taking the reference picture as a target area, and dividing the target area; defining the functions, variables and allocation space of the kernel function on each GPU device, and determining dynamic variables capable of dynamically managing the memory; configuring a kernel function thread, and optimizing kernel function internal calculation; and calling a kernel function on the multi-GPU equipment, adding corresponding asynchronous and blocking operations to complete cross-equipment communication and data transmission, transmitting echo data to a CPU (central processing unit) end and writing the echo data into a file. Echo simulation data are processed in real time through the control of a CPU and the parallel design of multiple GPUs, and the space is utilized to the maximum extent according to the size of the storage space.

Description

Heterogeneous platform SAR echo simulation parallel method based on CPU and multiple GPUs

Technical Field

The invention relates to the field of radar signal processing, in particular to a heterogeneous platform SAR echo simulation parallel method based on a CPU and multiple GPUs.

Background

In the development of artificial intelligence, a large amount of data needs to be processed, wherein the efficient parallel computing power provides basic support for the development of the artificial intelligence. The development of artificial intelligence puts higher demands on computing power, and a chip with stronger computing power is required to process more data. In contrast to the computing power of various chips, the computing power of a GPU (graphics processor) is ahead of that of other chips. Both the GPU and the CPU (central processing unit) are good at floating point calculation, and generally, the GPU has about 10 times of capability of floating point calculation as the CPU. CPU internal computational cores typically have a few to a dozen or so, but the computational cores of GPUs can reach thousands. Through parallel design of algorithms by using CUDA, thousands of threads of GPU are called to process data in parallel, and more efficient processing speed can be achieved. The GPU can process a large amount of data in parallel, and the CPU has more complex logic control functions. The existing data processing efficiency is low, the occupied space is large, and the existing requirements cannot be met.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a heterogeneous platform SAR echo simulation parallel method based on a CPU and multiple GPUs.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme.

The heterogeneous platform SAR echo simulation parallel method based on the CPU and the multiple GPUs comprises the following steps:

step 1, acquiring information of all GPU equipment at a CPU (Central processing Unit) end, determining GPU equipment groups which are positioned on the same node and can perform point-to-point transmission, and selecting the GPU equipment groups used for SAR echo simulation; parallel task division is carried out on the SAR echo simulation overall flow, parallel tasks are distributed to a selected GPU equipment group, GPU equipment in the GPU equipment group is respectively arranged on work flows of corresponding parallel tasks through a CPU end, and task-level parallelism is achieved;

step 2, setting simulation parameters at the CPU end, and transmitting the simulation parameters from the CPU end to a constant storage area of the GPU equipment group; acquiring a reference picture at a CPU (central processing unit) end, selecting a reference picture from the reference picture, and transmitting the information of the reference picture from the CPU end to a global memory of a GPU (graphics processing unit) device group;

step 3, taking the reference picture as a target area, performing azimuth partitioning on the target area, and placing the azimuth partitioning in a workflow corresponding to a parallel task to realize data level parallelism;

step 4, defining the kernel function, the kernel function variable and the kernel function distribution space on each GPU device, and determining the dynamic variable capable of dynamically managing the memory;

step 5, configuring a thread for each kernel function to realize thread-level parallelism; defining the content of the index in the kernel and a data organization mode, and optimizing the internal calculation of the kernel function;

step 6, respectively calling kernel functions on corresponding GPU equipment according to the divided parallel tasks, and finishing the accumulation of SAR echo data through data transmission and communication in a GPU equipment group; and constructing a frequency domain matching function at the CPU end, transmitting the frequency domain matching function to the GPU equipment group, finishing final calculation, and finally writing the SAR echo data into a file through the CPU end.

The technical scheme of the invention has the characteristics and further improvements that:

preferably, step 1 comprises the following substeps:

substep 1.1, acquiring the number of the existing GPU equipment and the information of each GPU equipment at a CPU end;

step 1.2, GPU equipment on the same node forms a GPU equipment group, and GPU equipment in the GPU equipment group directly carries out communication and data transmission; setting a loop traversing all GPU equipment at a CPU end, judging whether the GPU equipment i and j can carry out point-to-point communication by using a CUDA function, representing the GPU equipment which can carry out point-to-point communication in the form of coordinates (i, j), and selecting a GPU equipment group used for SAR echo simulation;

substep 1.3, respectively placing the GPU equipment in the GPU equipment group on the workflow corresponding to the parallel task, and realizing synchronization and asynchronization between different GPU equipment by operating and blocking the workflow at different time; and placing the tasks which are independent and independent of each other on the corresponding GPU equipment in the GPU equipment group.

Preferably, in substep 1.1, the GPU device information includes a display card model, a device computing capability, a total amount of global memory, and upper and lower limits of grid block thread partition of the device.

Preferably, step 2 comprises the following substeps:

step 2.1, simulation parameters are set at the CPU end, a GPU device group used for SAR echo simulation is selected at the CPU end, the simulation parameters are transmitted to a constant storage area of corresponding GPU devices in the GPU device group in a structural body mode, and the simulation parameters are called by kernel functions for multiple times in the running process of the corresponding GPU devices;

substep 2.2, reading the gray value of the reference picture as the amplitude of the ground target, storing the gray value at the CPU end, and setting a random phase at the CPU end; and transmitting the amplitude and the random phase information of the ground target from the CPU end to a global memory of the corresponding GPU equipment in the GPU equipment group.

Further preferably, in sub-step 2.1, the simulation parameters include carrier frequency of the radar, pulse information, motion information, position information, target point number of a ground scene, scene size, distance interval, and position relation information between the radar and the ground.

Preferably, step 4 comprises the following substeps:

substep 4.1, defining variables and variable spaces distributed by each kernel function, and repeatedly calling the kernel functions to traverse all target points by using a loop;

and substep 4.2, determining a dynamic variable capable of dynamically managing the memory, wherein the dynamic variable is the size of a dynamic space, and the expression is as follows:

compared with the prior art, the invention has the following beneficial effects:

(1) Aiming at the problem of memory limitation of a single GPU, the invention provides a method for using multiple GPUs, successfully solves the problem of insufficient memory space caused by large data processing amount, and reduces simulation time consumption through parallel execution among multiple GPU devices.

(2) Through the analysis of the whole process and the memory use condition of each task, the variable capable of adjusting the memory size is determined, and the memory of the GPU is dynamically managed through the CPU end, so that the utilization rate of the memory resource of the GPU is maximized.

(3) GPU equipment is arranged in different workflows, and a waiting and blocking mechanism is added to the workflows, so that the multiple GPUs cooperate to complete the whole process.

(4) The processing flow is divided into tasks, the access delay is hidden through asynchronous processing among the GPUs, different tasks are operated in parallel, and the operation time is reduced.

(5) And performing data division on the processed data, circularly calling a kernel function to traverse all data domains, and transmitting the data processed in one cycle to another GPU for reprocessing. The reprocessing time is hidden in the next cycle processing time of the original GPU, and then the other GPU waits for the original GPU to finish all cycles and then carries out the next processing, so that the simulation time consumption is effectively reduced.

Drawings

The invention is described in further detail below with reference to the figures and specific embodiments.

Fig. 1 is a schematic flow chart of obtaining SAR echo data by serial processing on a CPU according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of a CPU and multi-GPU based heterogeneous platform SAR echo simulation parallel design provided in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a geometric model of data acquired by a striped SAR radar provided by an embodiment of the present invention;

FIG. 4 is a block diagram illustrating the orientation of a target area provided by an embodiment of the present invention;

FIG. 5 is a diagram illustrating a relationship between a GPU memory and a thread according to an embodiment of the present invention;

FIG. 6 (a) is a schematic diagram of thread allocation for kernel1 according to an embodiment of the present invention;

FIG. 6 (b) shows two data structure organization modes of AoS and SoA;

FIG. 7 is a schematic diagram of thread allocation for kernel2 according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of thread allocation for kernel3 and kernel4 according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a kernel5 unfolding technique according to an embodiment of the present invention;

FIG. 10 (a) is an echo signal generated with a reference map as a target region in the present invention;

fig. 10 (b) is an image obtained by processing an echo signal in the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only illustrative of the present invention and should not be construed as limiting the scope of the present invention.

Fig. 1 is a schematic flow chart of obtaining SAR echo data by serial processing on a CPU. As shown in fig. 1, when serial processing is performed on a CPU, firstly, simulation parameters required for SAR echo formation are set; reading image data from an external file, selecting a required picture from the image data as a reference picture, and determining the number of target points according to the reference picture. And then calculating the real-time distance of each target point in the radar running time according to the simulation parameters, judging whether the target point is exactly in the irradiation range of the beam emitted by the radar antenna according to the real-time distance, and calculating the echo of the target point in the beam irradiation range. And finally accumulating the echo data of all target points.

Compared with the CPU processing, when the GPU performs parallel processing, the GPU and the CPU do not share a memory, and the GPU is not suitable for directly reading a large amount of CPU memory data, the CPU data needs to be transmitted to the GPU memory, and under the condition of a plurality of GPUs, simulation parameters need to be transmitted to the GPU as required; in the process of calculating echo data, a large number of threads are distributed by combining the characteristics of a GPU, each thread corresponds to each target point at a certain moment, and the threads can simultaneously calculate to complete parallel processing of a large number of data; and finally, in the process of accumulating SAR echo data at the GPU end, in order to prevent memory access conflict of the GPU end, additionally configuring a kernel function to call a plurality of threads to complete summation calculation.

As shown in fig. 2, an embodiment of the present invention provides a parallel design method for SAR echo simulation of a heterogeneous platform based on a CPU and multiple GPUs, including the following steps:

step 1, acquiring information of all GPU equipment at a CPU (Central processing Unit) end, determining GPU equipment groups which are positioned on the same node and can perform point-to-point transmission, and selecting the GPU equipment groups used for SAR echo simulation; and parallel task division is carried out on the SAR echo simulation overall flow, the parallel tasks are distributed to the selected GPU equipment groups, GPU equipment in the GPU equipment groups are respectively arranged on the workflow corresponding to the parallel tasks through the CPU end, and task-level parallelism is realized.

Specifically, step 1 comprises the following substeps:

substep 1.1, at the CPU end, using a self-contained function in the CUDA to read the existing GPU device number cudaGetDeviceCount (& ngpus), store the GPU device number information on ngpus, set the device number dev, and select to use the device cudaSetDevice (dev).

The method comprises the steps of obtaining information of each GPU device (including information such as a display card model, device computing capacity, total global memory, upper and lower limits of grid block thread division of a device and the like) at a CPU end, establishing a device information structure variable cudaDeviceProp devicep, and storing all the information of the GPU device in the device information structure devicep, wherein, if dev device information is read into the device information structure devicep, all the information of dev devices is contained in the device information structure devicep.

The configuration kernel function is based on the device, the partition range of the grid block thread can be determined according to the acquired GPU device information in the subsequent steps, and dynamic memory management of the CUDA program running on the device is also operated according to the GPU device information.

And 1.2, dividing the communication between the GPU and the GPU into communication on the same node and communication on different nodes according to whether the communication is on the same node or not.

The GPUs located on different nodes are located on different PCI-e transmission lines, and when information is transmitted, the information needs to be forwarded through a CPU host, which is high in delay and high in communication overhead. Especially, when data is copied between devices located on different nodes in the case of data transmission with a large amount of data, the transmission takes a long time.

And GPU equipment positioned on the same node can directly carry out communication and data transmission. Therefore, the data coprocessing among the multiple GPUs is carried out on the basis of the same node. The method comprises the steps of setting circulation traversing all GPU equipment through a CPU host end, judging whether point-to-point communication can be carried out between GPU equipment i and j by using a CUDA function cudaDeviceCanesseeper (& peer _ access _ available, i, j), storing a result in the peer _ access _ available, and representing a GPU equipment group capable of carrying out point-to-point communication in a coordinate form (i, j). After a GPU device group which is positioned on the same node and can carry out point-to-point communication is obtained, a currently used GPU device, cudaSetDevice (int dev), is explicitly set; starting point-to-point communication between j equipment and current equipment, cudaDeviceEnablePeerAccess (j, 0); the enabled state is maintained until explicitly disabled, cudadevicedisablepeeeraccess (j).

Substep 1.3, placing multiple GPU devices in different workflows, may achieve synchronization and asynchrony between multiple GPU devices by operating and blocking workflows at different times. Synchronization and asynchronization among a plurality of GPU equipment hide partial access delay and realize concurrent execution of tasks, and reduce time for processing data. As shown in fig. 2, GPU0 was placed in Stream0, GPU1 was placed in Stream1, and GPUN was placed in StreamN; the task divided on the GPU0 is meant, and when a kernel function in the task is called, the fourth parameter of the kernel function is Stream0, and the operation on the GPU1 and other GPUs is the same as that on the GPU 0.

Due to the limitation of GPU device conditions, only two GPU devices are applied to the data processing. The expansion of the tasks located on GPU1 to processing on more GPU devices can be extended under the same or similar logic.

And analyzing the SAR echo simulation flow, determining the relation among all tasks, and dividing the tasks according to the dependency and the independence of the tasks. Dependency refers to the existence of a causal relationship between two tasks, with some or all of the data being used interchangeably. Independence means that the two tasks exchange processing order without any impact on the result. As shown in fig. 2, the tasks are divided, and kernel1 and kernel2 are independent from each other and can be distributed to GPU equipment groups located on the same node to run, so as to implement task-level parallel processing; kernel4, which is dependent on kernel3 and both are related to previous data, is placed on device GPU 1. Although the data of kernel5 is related to the data on GPU1, kernel5 needs to wait for kernel2, kernel3, and kernel4 to complete before performing operations. Kernel5 is placed on device GPU0, taking into account memory consumption and the overlapping computations that can be made between GPU0 and GPU 1. Through asynchronous transmission between the GPU0 and the GPU1, the kernel5 is operated, and meanwhile the kernel2, the kernel3 and the kernel4 are operated, and partial data processing time is hidden. And setting a blocking function after kernel5, and continuing to process data after the kernel function on the GPU1 is operated, so as to ensure that the data is not lost. Kernel6 and kernel7 are placed on device GPU0 in view of data dependencies and load balancing.

Step 2, setting simulation parameters at the CPU end, acquiring reference pictures at the CPU end, and selecting reference pictures from the reference pictures; according to the division of the tasks and the use characteristics of data in the simulation process in the step 1, simulation parameters are transmitted from the CPU end to a constant storage area of the GPU equipment group, and information of the reference picture is transmitted from the CPU end to a global memory of the GPU equipment group;

specifically, the step 2 specifically includes the following substeps:

and substep 2.1, setting simulation parameters at the CPU end, wherein the simulation parameters comprise carrier frequency, pulse information, motion information, position information of the radar, target point number of a ground scene, scene size, distance interval, mutual position relation between the radar and the ground and other information. And sequentially selecting GPU equipment to be used on the CPU host side, and respectively transmitting the simulation parameters to the constant storage areas of the GPU0 and the GPU1 in the form of structural bodies. The simulation parameters are called by the kernel function for many times in the operation process of the GPU, if the simulation parameters are stored in the global memory, repeated reading and writing are needed in the process of calling the kernel function, the memory access times are increased, the performance of the kernel function is influenced, the values of the parameters are easily and carelessly changed in the program implementation, and unknown errors are not easy to be perceived. The value stored in the constant storage area can ensure that the simulation parameters are stable and unchanged, and information can be shared among different kernel functions, and the transmission rate is higher than that of the global memory. When data are transmitted to the constant storage area in a structural body form, the parameters can be completely transferred only by calling data transmission of a CPU host side and a GPU equipment side once, so that repeated calling is avoided; and when the arguments are written into the kernel function, only one time of writing into the structure body is needed, so that the use of a plurality of arguments and arguments is avoided, and the programming difficulty is reduced.

And a substep 2.2, reading the reference picture of 2226 × 4007 into a data file by using a MATLAB tool at the CPU end, and setting the data type to be int. A picture with a starting point of (1000 ) and a size of 512 × 512 is selected from the reference pictures as a reference picture. And reading the gray value of the reference picture from the data file as the amplitude of the ground target, and storing the gray value at the CPU end. The random phase is set at the CPU using a random function [ -2 π,2 π ]. And transmitting the amplitude and random phase information of the ground target from the CPU end to a Global Memory (Global Memory) of the GPU1 for subsequent echo coefficient formation.

Step 3, taking the reference picture obtained in the step 2 as a target area, as shown in fig. 4, performing azimuth partitioning on the target area, and implementing data partitioning after azimuth partitioning, wherein black solid dots on an X axis in fig. 4 are defined as partition boundaries; and (4) placing the azimuth blocks in the workflow corresponding to the parallel tasks to realize data level parallelism.

In the process of beam scanning of the SAR radar, the beam coverage is influenced by the height H of the radar and the beam width BeamWide of 3dB of the radar. The area blocks P-Q can be irradiated in the whole azimuth operation time of the radar, and the areas P-Q cannot be irradiated at the rest time only in the azimuth time A-C shown in figure 4. Therefore, the corresponding radar running time is obtained through the region division calculation, the calculation of the non-irradiation time can be avoided, and the redundant calculation is removed.

The orientation is partitioned, so that data parallel and redundant calculation are reduced, dynamic management of space allocated to variables in subsequent steps can be facilitated, and dynamic management of a memory can be further facilitated.

And 4, determining the functions of the kernel functions, the kernel function variables and the kernel function distribution space on each GPU device, determining the variables capable of dynamically managing the memory, and realizing the maximization of GPU device resource utilization.

The step 4 specifically comprises the following substeps:

substep 4.1, according to the flow chart in fig. 2, specifies the variables and variable spaces allocated to each kernel. Because the information of each target point in the scene area and the information of the points are independent and have no dependency, the point size _ point calculated by calling the kernel function each time is controllable, and therefore the kernel function can be called repeatedly to traverse all the target points in a circulating mode. And because the controllable variable also affects the space occupied by a plurality of variables in a plurality of kernel functions, the variable is very suitable to be used as a variable for dynamically managing the memory.

And a substep 4.2, in the step 2, obtaining the global memory information of the GPU device, storing the global memory information in the structure devicepep, and reading the global memory size obtained by devicepep. The relationship between the memory distribution and the grid, block, and thread of the GPU device is shown in fig. 5. The kernel functions, kernel function variables, and kernel function allocation spaces used on the GPU1 are as follows.

The Kernel2 Kernel function is named point _ points and has the functions of determining the three-dimensional coordinates of the position of a target point in an azimuth block in a scene, determining the coordinates of a radar motion starting point corresponding to the azimuth block and determining the echo coefficient of the target point in the azimuth block. The array variables allocated to the function are d _ point _ x, d _ point _ y, d _ point _ z, point _ start and d _ ampc respectively; the allocated spaces are sizeof (double) num _ cut, sizeof (int), sizeof (double 2) num _ cut, respectively. num _ cut = cut _ imag _ num data _ nrn, representing the number of target points within the azimuth block.

The Kernel3 Kernel function is named as Para _ PerPul, and has the functions of calculating the space three-dimensional offset of each target point and each azimuth moment of the radar, the instantaneous slant distance, the instantaneous offset angle with the beam center, determining whether the target point is within the range of the beam, and determining which concentric ring the target point is in at a certain azimuth moment. Register variables deltaX, deltaY and deltaZ are established inside the kernel function and respectively represent the offset in three directions of the space (X, Y and Z). The array variables allocated to the function are respectively a one-dimensional array Rt of the instant skew, a one-dimensional array squint _ scene of the instant angle, a one-dimensional array jdge of the judgment angle, and a one-dimensional array in _ circle of the concentric circle position, and the allocated global memory space sizes are respectively size _ point _ L _ number _ cut _ size of (double), size _ point _ L _ number _ cut _ size of (int), and size _ point _ L _ number _ cut _ size of (int). The size of the space allocated by the Kernel3 Kernel function is related to the size _ point, and the size _ point controls the use of the global memory, so that the purpose of dynamically allocating the space is achieved.

The Kernel4 Kernel function is named echo _ format and has the functions of calculating the amplitude of formed echo signal and constructing function

Wherein scope is the amplitude of the signal, rt is the slant distance, and mT is the azimuth of the radarToward the operating time, temp (-) is at mT and R _t The echoes received by the radar controlled by variables, j being the imaginary unit, λ being the wavelength of the transmitted signal, R (-) being the signals subjected to mT and R _t The slant distance between the variable-control radar and the target point; and finally, dividing the function into an integer part and a decimal part, and performing superposition by applying atomic addition (atomic add) operation. The array variables allocated to the function are d _ temp1 and d _ temp2, respectively, and the allocated space size is Nrn Nan sizeof (int 2).

The kernel functions, the functions of the kernel functions, the kernel function variables, and the kernel function allocation space used on the GPU0 are as follows.

The Kernel1 Kernel function is named radar _ pos, and functions to determine the three-dimensional coordinates of the radar location. Three array variables allocated to the function are d _ pos _ x, d _ pos _ y and d _ pos _ z respectively, and the allocated space is the product of the number of sampling points of radar azimuth operation and the data type, namely, sizeof (double) × Nan.

The Kernel5 Kernel function is named Int2toDouble2_ Kernel, and functions to convert the data obtained in Kernel4 into floating point type. The array variable assigned to this function is d _ xa, and the size of the space assigned is Nan Nrn sizeof (double 2).

The Kernel6 Kernel function is named Sref format and functions as frequency modulation term in the frequency domain multiplied by the distance direction. And calling the CuFFT in the CUDA outside the kernel function to carry out inverse Fourier transform to obtain a complete time domain echo signal. The array variable assigned to this function is d _ sref, and the size of the space assigned is Nrn × sizeof (double 2).

The Kernel7 Kernel function is named complete _ escape, and has the function of separating a real part and an imaginary part of a formed floating-point echo signal, so that data can be read on MATLAB software conveniently. The array variable assigned to this function is d _ scalar, and the size of the space assigned is Nrn Nan sizerof (double) × 2.

And a substep 4.3 of determining a variable capable of being dynamically adjusted according to the information obtained by the arrangement, calculating the variable, determining the size of the variable and carrying out certain processing on the variable. Setting a dynamic memory variable (dynamic space size) as dynamic _ variable, wherein the dynamic memory size is calculated by the following method:

the specific calculation expression is as follows:

in the equation, 80% of the global memory is used for calculation because there is extra overhead and implicit use of the global memory, leaving a small portion of the global memory for its use. And performing mathematical calculation on the obtained dynamic memory variable dynamic _ variable to obtain a maximum power value which is smaller than the value and takes two as the base, controlling the upper limit of the maximum power value to ensure that the maximum power value does not exceed the maximum target point number in the block, storing the obtained final result into the size _ point, and then performing space allocation and cycle number control. The purpose is that when all points are traversed in a circulating manner, the obtained circulating times num _ size are positive integers but not decimal numbers, so that extra point calculation is not needed after circulation is completed, the waste of calculation resources is avoided, and the utilization rate of GPU hardware resources is maximized.

Step 5, configuring threads for each kernel function, realizing thread-level parallelism, and ensuring that the kernel functions have optimal performance on the basis of following a certain principle; and defining the index content and the data organization mode in the kernel, and optimizing the internal calculation of the kernel function to optimize the performance of the kernel function.

The step 5 specifically comprises the following substeps:

and substep 5.1, configuring a grid block thread for each kernel function, wherein the configuration follows the following principle, and the kernel function is ensured to have better performance.

(1) The length of the grid and the block configured for the kernel function in each dimension should not exceed the limit of the GPU device, but at the same time, the closer the configuration is to the limit of the GPU device, the better the performance is.

(2) The number of threads configured should be an integral multiple of 32 and preferably greater than 4 x 32 threads. Because the GPU processes threads on a hardware Streaming Multiprocessor (SM) in units of one thread bundle (warp). One thread bundle is composed of 32 threads, and four thread bundles are arranged on the SM to be processed in parallel, so that memory access delay can be effectively hidden, and the core performance is improved. The more bundles placed on the SM, the better the effect of hiding the memory access delay.

(3) The performance of each kernel is affected by the configuration, and multiple tests are required to obtain the optimal configuration. And running the kernel function by using different configurations, and testing the running time of the kernel function for multiple times by using a Visual Profiler tool to obtain the configuration with the shortest time consumption.

And a substep 5.2, configuring the Kernel1 Kernel function, wherein the thread in the block is thread _ Pos (512, 1), and the block in the grid is block _ Pos ((Nan + threads _ pos.x-1)/threads _ pos.x, threads _ pos.y), and the configuration of the block can be automatically adjusted according to the configuration of the thread, and the schematic diagram is shown in fig. 6 (a). The relationship of the three-dimensional position of the radar in the cartesian coordinate system is shown in fig. 3, and three-dimensional position information is obtained by calculation in the kernel. Establishing a thread index in a core, setting the thread index as tid, and determining the index by the deviation of the thread X direction, the deviation of the block X direction and the size of the block according to the configuration of the core; calculated as tid = blockidx.x blockdim.x + threadaidx.x, index range [0, nan ]. The index is used to traverse the variable address space, assign values to the array elements, and perform data computations.

When establishing three-dimensional coordinates for radar positions, there are two data organization modes, and different data organization modes have different influences on nuclear performance. The first way is a data structure (AoS) and the second way is a structure array (SoA), as shown in fig. 6 (b). The AoS data organization mode can cause non-aligned discontinuous access in the process of accessing the x, y and z arrays of the radar coordinate position, seriously affect the read-write speed of the memory and reduce the core performance. And the defects of the AoS are effectively avoided by using an SoA data organization mode, and aligned and continuous memory access can be performed while radar position coordinates are stored, so that the nuclear performance is more efficient. The kernel function not only carries out access optimization, but also completes one-time calculation of radar position information, avoids repeated calculation of subsequent steps, and can be called in blocks according to indexes when in use.

Substep 5.3, configuring Kernel2 Kernel function, wherein the thread in the block is configured as threads _ point (32, 16), and the block in the grid is configured as

As shown in fig. 7. Three indexes exist in the kernel, the first index thread is transversely distributed, represents a target point of azimuth blocking and is marked as tidx, and the calculation mode is tidx = blockidx.x block dim.x + threeadidx.x; the longitudinal distribution of the second index threads represents a target point of a target scene distance direction, and is marked as tidy, and the calculation mode is tidy = blockidx.y + blockdim.y + threeadidx.y; the third index is oriented to the coordinates of the start point of each block after the block, and the offset of the target point after the block in the whole is determined and recorded as tidx _ cut, and the offset is calculated in the way that tidx _ cut = tidx + count _ cut _ imag _ nan, wherein count _ cut represents the second block of the target area. Determining the three-dimensional coordinates of the target point by the index of the thread, and storing the calculation result in an array according to rows; and during calling, calling is carried out according to rows to form aligned continuous memory access, so that cross-memory access is reduced, and the kernel function has good kernel performance. Determining a start point coordinate pos _ start [0] of each azimuth block according to the thread index]. And finally, selecting the gray value of the reference picture which is transmitted into the GPU1 from the step 1 by using the three indexes to form the echo coefficient of the target point in the area.

Substep 5.4, configuring the Kernel3 Kernel function, wherein the threads configured in the blocks are reads _ Rt (256, 1), and the blocks configured in the grid are blocks _ Rt ((size _ point + reads _ Rt. X-1)/reads _ Rt. X, L _ number _ cut), and the division manner is shown in fig. 8. The thread horizontally represents the number of processed target points, is dynamically adjusted by size _ point, and vertically represents the number of sampling points of the radar in blocks in the azimuth direction. The indexes in the function are four, namely, the indexes tidx and tidy in the horizontal direction and the vertical direction of the thread, the index tidy _ cut of the radar azimuth coordinate corresponding to the block and the indexes tid of all target points in the block. The index tidy _ cut of the specific position of the radar is obtained by jointly calculating a radar starting point coordinate pos _ start [0] obtained by calculating by the kernel2 and a thread longitudinal index blockidx.y, the calculation formula is tidy _ cut = pos _ start [0] + blockidx.y, and the index is used for reading the radar position obtained by calculating in the kernel 1. And reading the position information of the target points in the scene blocks by the index tidx, and performing joint calculation with the radar position information to obtain three-dimensional offsets deltax, deltay and deltaz of the radar and the target points in the area at different azimuth moments. And calculating the instantaneous slant distance Rt and the instantaneous angle squint _ scene according to the geometric relationship by using the three-dimensional offset, and further judging whether the target point is positioned in the 3dB beam width angle or in which concentric circle. The information calculated in the kernel function is written into the specified arrays Rt, require _ scene, judge and in _ circle by the index tid.

And a substep 5.5, configuring the Kernel4 Kernel function, wherein the thread in the block is arranged as the blocks _ temp (256, 1), and the block in the grid is arranged as the blocks _ temp ((size _ point + blocks _ temp.x-1)/blocks _ temp.x, L _ number _ cut). The thread horizontally represents a target point in the block, the point number is dynamically adjusted by the size _ point, the thread vertically represents the point number of the azimuth direction of the corresponding radar block, and the thread division mode is the same as that of the Kernel3, as shown in fig. 8. The intra-core index is similar to the Kernel3 intra-core index and is not described in detail. In order to ensure the accuracy of the calculation result and the integrity of each piece of information in the superposition process, the superposition calculation is carried out on the echo information by using atomic addition operation. However, the atomic addition operation can only accumulate integers, so that in order to avoid loss of precision, echo information is divided into an integer part and a decimal part to be accumulated respectively, and the accumulated results are put into d _ temp1 and d _ temp2 respectively. When in superposition, points positioned in the same concentric ring are placed in the same storage area, and an index used in calculation consists of an array in _ circle obtained by calculation in kernel3 and a thread index tidy in the core.

Substep 5.6, configuring the Kernel5 Kernel function, wherein the thread in the block is arranged as threads _ temp1 (32, 16), and configuring the blocks in the grid according to the configured threads

The kernel-internal thread lateral index represents the distance direction [0, nrn-1 ]]Longitudinal direction ofThe orientation index represents the azimuth [0, nan-1 ]]. The thread division method is the same as that of Kernel2, except that the number of blocks in the horizontal and vertical directions is different. And integrating the overlapped integer part and the fractional part in the Kernel4 into a floating point number type. In order to increase the throughput of data during the integration process, a spreading technique is used to enable one thread to simultaneously operate 16 memory locations, which greatly improves the core performance, as shown in fig. 9. Shown in fig. 9 as a two-dimensional 16-fold spread, one thread can access 16 different addresses. For expansion, the two-dimensional joint index range of the indexes tidx and tidy is changed from one block to sixteen blocks, and the horizontal and vertical indexes are respectively changed to four times of the original indexes. Index tidx is represented by original tidx = blockidx

Becomes tidx = blockidx.x × blockdim.x 4+ threadadx

The index tidy is similar. After the index is modified, in order to enable one thread to index 16 different addresses, the upper limit of tidx and tidy needs to be limited at the same time, and the out-of-bounds access is prevented. In subsequent calculations, in order to index 16 different addresses, corresponding offsets need to be performed, 4 × 4=16 offsets need to be performed, and each offset needs to be sequentially increased by block dim.x or block dim.y.

Step 6, respectively calling kernel functions on GPU equipment according to task division, and finishing accumulation of echo information through data transmission and communication of cross equipment; and constructing a frequency domain matching function at the CPU end, transmitting the frequency domain matching function to GPU equipment, finishing final calculation, and finally writing the echo data into a file through the CPU end.

The step 6 specifically comprises the following substeps:

and substep 6.1, setting the current equipment as GPU0 at the CPU end, calling a Kernel function Kernel1, and setting an asynchronous transmission function thereafter to ensure that the data transmission is completed before the operation of the Kernel2 is completed. Changing the current device to be GPU1 at a CPU end, calling a Kernel function Kernel2 in a first-layer loop, and circularly traversing all blocks, wherein the number of times of the loop nan _ cut _ num is determined by the azimuth length of a read-in image and the length of block division, and nan _ cut _ num = data _ nan/cut _ imag _ num. And setting an internal loop in the loop, calling the kernel functions kernel3 and kernel4 in sequence, cycling through points in the blocks, wherein the cycle times num _ size can be dynamically adjusted by size _ point, and num _ size = num _ cut/size _ point. And after calling the kernel4, changing the current equipment, selecting a GPU0 to be used, transmitting the result in the kernel4 to the kernel5 through asynchronous transmission, changing the current equipment into a GPU1, and ensuring that the innermost layer circularly traverses the points in the blocks. And setting a blocking function after kernel5, blocking GPU0, and waiting for GPU1 to finish all operations.

Substep 6.2, constructing frequency modulation term of distance direction at CPU end

Where j is an imaginary unit, gamma is a chirp signal, f _γ Is the chirp signal frequency; and changing the current device into GPU0, and transmitting data to the global memory of the GPU0 through a PCIe line by using a CUDA function cudammcmpy. And calling kernel6 to complete all calculations and realize frequency domain multiplication.

And substep 6.3, transmitting the data obtained in the kernel6 to the kernel7 to realize the separation of a real part and an imaginary part, storing the data in a GPU global memory, and transmitting the data to a CPU (central processing unit) end through a PCIe (peripheral component interconnect express) line. The obtained data is written into a 'data' type file at the CPU end, and read and displayed in MATLAB software, as shown in FIG. 10 (a), and an image obtained by processing the echo signal is shown in FIG. 10 (b).

Although the present invention has been described in detail in this specification with reference to specific embodiments and illustrative embodiments, it will be apparent to those skilled in the art that modifications and improvements can be made thereto based on the present invention. Accordingly, it is intended that all such modifications and alterations be included within the scope of this invention as defined in the appended claims.

Claims

1. The heterogeneous platform SAR echo simulation parallel method based on the CPU and the multiple GPUs is characterized by comprising the following steps of:

2. The CPU and multi-GPU based heterogeneous platform SAR echo simulation parallel method according to claim 1, characterized in that step 1 comprises the following substeps:

3. The CPU and multi-GPU based heterogeneous platform SAR echo simulation parallel method according to claim 2, characterized in that in substep 1.1, the GPU device information contains the type of a display card, the device computing capacity, the total amount of global memory, and the upper and lower limits of grid block thread division of the device.

4. The CPU and multi-GPU based heterogeneous platform SAR echo simulation parallel method according to claim 1, characterized in that step 2 comprises the following substeps:

step 2.1, simulation parameters are set at the CPU end, a GPU device group used for SAR echo simulation is selected at the CPU end, the simulation parameters are transmitted to a constant storage area of a corresponding GPU device in the GPU device group in a structural body mode, and the simulation parameters are called by kernel functions for multiple times in the running process of the corresponding GPU device;

step 2.2, reading the gray value of the reference picture as the amplitude of the ground target, storing the amplitude at the CPU end, and setting a random phase at the CPU end; and transmitting the amplitude and the random phase information of the ground target from the CPU end to a global memory of corresponding GPU equipment in the GPU equipment group.

5. The CPU and multi-GPU based parallel method for SAR echo simulation of heterogeneous platform is characterized in that in substep 2.1, the simulation parameters comprise radar carrier frequency, pulse information, motion information, position information, target point number of ground scene, scene size, distance interval and mutual position relation information between radar and ground.

6. The CPU and multi-GPU based parallel method for heterogeneous platform SAR echo simulation according to claim 1, characterized in that step 4 comprises the following substeps: