CN113885028A

CN113885028A - Multi-server multi-GPU-based satellite-borne SAR imaging real-time processing method

Info

Publication number: CN113885028A
Application number: CN202111096539.8A
Authority: CN
Inventors: 梁毅; 付少雄; 王文杰; 刘恒
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2022-01-04

Abstract

The invention discloses a multi-server multi-GPU-based satellite-borne SAR imaging real-time processing method, which comprises the following steps: firstly, reading satellite-borne SAR (synthetic aperture radar) echo data on any one CPU (central processing unit) host, and dividing the radar echo signal data to a plurality of CPU host by using secret-free communication among a plurality of CPU host; reading corresponding radar echo signal data parameters and information of a plurality of GPU equipment ends at each CPU host end, and setting the thread number of the CPU host end; then, each GPU equipment end reads the partitioned satellite-borne SAR radar echo data, the distributed memory and the complex echo of the radar echo signal aiming at the azimuth direction from the corresponding CPU host end; finally, realizing the range-oriented pulse compression of the complex echoes at the GPU equipment end and realizing BP imaging; and writing the BP images on all GPU equipment ends into a radar echo data processing file, and analyzing an imaging result.

Description

Multi-server multi-GPU-based satellite-borne SAR imaging real-time processing method

Technical Field

The invention belongs to the field of Radar signal processing, and particularly relates to a multi-server multi-GPU-based high-resolution imaging real-time processing method for a satellite-borne Synthetic Aperture Radar (SAR).

Background

The satellite-borne synthetic aperture radar SAR has the characteristics of high working orbit, wide region coverage range and the like, and has wide application prospect in the aspects of natural disaster monitoring, forecasting and the like. The traditional SAR needs to go through a plurality of links of on-board storage, satellite-ground data transmission and ground receiving processing in the imaging process, has the problems of long time delay, slow response speed and the like, and is difficult to meet the requirement on the quick response capability of a satellite system when detecting major events. Therefore, the imaging time of the satellite-borne synthetic aperture radar SAR is shortened, and the timeliness of the application of the satellite-borne synthetic aperture radar SAR can be improved.

The radar back projection imaging BP (Back projection) algorithm executes reverse interpolation operation in pulse pressure echoes of radar echo signals according to the time delay of the radar echo signals, and can carry out target reconstruction on radar echo data under any imaging geometric configuration. The radar echo signals received by the radar are a group of reflected pulses from an imaging area, and contain scene information such as target positions and reflection characteristics. The radar back projection imaging BP algorithm establishes a mapping relation between a radar echo signal and an imaging scene through the slant range, imaging grid setting can be directly carried out on a region in a target area, and then the contribution of a current echo pulse grid point is reversely searched by taking the distance between the grid point and the radar as a link. And once back projection operation is performed, one description of the radar echo signal on the scene information at the grid point can be obtained. And traversing all the azimuth pulses, and performing coherent accumulation on the projection results of different echo pulses to reconstruct a target scene and finish image focusing. The radar back projection imaging BP algorithm carries out back projection operation on the basis of pitch pulse by pulse grid point, and distortion-free image focusing of radar echo data can be realized as long as the distance (instantaneous pitch) of a radar antenna phase center at a target grid point P and the coordinates of an imaging grid point can be accurately obtained. Therefore, the radar back projection imaging BP algorithm is not only suitable for various imaging occasions such as any flight path, any imaging configuration and the like, but also can realize three-dimensional imaging by combining scene elevation information.

The radar back projection imaging BP algorithm serial imaging process is shown in figure 1, and is a flow diagram for completing SAR radar echo data processing in series by a CPU host end, and the basic serial radar back projection imaging BP algorithm imaging process is divided into reading of radar echo data, pulse compression in distance direction and BP formationCalculating parameters required by the image, utilizing interpolation to realize up-sampling of echo pulse, realizing BP imaging and outputting the processed radar echo data imaging result. However, the computation amount of the radar back projection imaging BP algorithm is proportional to the imaging scale. When the number of pulses included in the synthetic aperture is N and the number of imaging grid points is NXN, the operation complexity of the back projection imaging algorithm is close to O (N)³). The BP algorithm for realizing the radar back projection imaging in a serial mode is difficult to be suitable for occasions with large data scale and high real-time requirement.

Meanwhile, the conventional satellite-borne synthetic aperture radar SAR is often realized by hardware design based on multiple FPGAs (field Programmable Gate array) or DSPs (digital Signal processing), and due to the limitation of the number of computing units of the FPGAs and the DSPs, the satellite-borne synthetic aperture radar SAR designed based on the architectures of the FPGAs and the DSPs also has the problem of long radar echo Signal processing time, and is difficult to meet the engineering requirements.

Therefore, how to reduce the calculation amount of the back projection imaging BP algorithm and improve the imaging speed becomes an important research direction of the time domain algorithm.

Compared with FPGA and DSP, the GPU (Graphics Processing Unit) has more computing units and is more suitable for ultra-large-scale parallel computing. For data which can be processed in parallel in radar echo signal data processing, the data processing time can be compressed to more than thousand times by using multiple GPUs for data processing, and the problems that the radar echo signal processing time is long and engineering requirements are difficult to meet are solved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a multi-server multi-GPU-based satellite-borne SAR imaging real-time processing method. The method realizes high-parallelism processing of radar back projection imaging BP data of SAR echoes of the satellite-borne synthetic aperture radar by using secret-free transmission among multiple CPU host terminals, control of the CPU host terminals on a GPU device terminal with multiple GPUs and a GPU data calculation parallel structure design technology in the GPU device terminal; and the CPU host end can utilize the space of the GPU equipment end to the maximum extent according to the size of the storage space of the GPU equipment end, and the radar echo signal data is accelerated and processed at the optimal speed.

Wherein, the server is called CPU host end or server CPU host end; the GPU is referred to as the GPU device side.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme.

A multi-server multi-GPU based satellite-borne SAR imaging real-time processing method comprises the following steps:

step 1, reading satellite-borne SAR (synthetic aperture radar) echo data on any one server CPU (central processing unit) host end, and dividing the radar echo signal data into a plurality of server CPU host ends by using secret-free communication among the plurality of server CPU host ends;

step 2, reading corresponding radar echo signal data parameters and corresponding information of a plurality of GPU equipment ends at each server CPU host end, and setting the thread number of the server CPU host end;

step 3, each GPU equipment end reads the satellite-borne SAR echo data divided for the direction of the GPU equipment end from the corresponding CPU host end, and the allocated memory of the GPU equipment end;

step 4, each GPU equipment end is accessed into a corresponding server CPU host end to transmit a complex echo of a radar echo signal;

step 5, each GPU equipment end realizes pulse compression of the distance direction of the complex echoes;

step 6, realizing BP imaging on a GPU device end controlled by each server CPU host end;

and 7, acquiring BP images on all GPU equipment ends corresponding to all the service CPU host ends, writing the BP images into a radar echo data processing file, reading the processed radar echo data by using MATLAB, drawing according to the processed radar echo data, and analyzing an imaging result.

Compared with the prior art, the invention has the following advantages:

firstly, aiming at the problem that memory limitation exists in the processing of the echo data of the SAR (synthetic aperture radar) on the satellite by a single CPU (central processing unit) host end and a single GPU (graphic processing unit) device end, the invention provides a method for processing the echo data of the SAR on the satellite by using multiple CPU host ends and multiple GPU device ends, and successfully solves the problem that the memory space is insufficient due to the large data volume of the processed echo data of the SAR;

secondly, the parallel calculation of BP imaging is selected at the GPU equipment end aiming at the operation process which has no data dependency and larger loop volume, so that the simulation time consumption is reduced; aiming at the operation that the GPU is not suitable for being accelerated during data processing due to the fact that data have dependency, the multi-thread data acceleration processing is carried out on the multi-CPU host, and therefore simulation time consumption is further reduced.

Thirdly, the invention analyzes the whole process of the radar back projection imaging BP algorithm and the condition that each data processing task uses the memory of the GPU device end, determines the variable of the adjustable memory size of the GPU device end, and dynamically manages the memory of the GPU device end through the CPU host end, so that the memory resource utilization rate of the GPU device end reaches the maximum.

Fourthly, the invention divides the task of the calculation flow processed by the GPU in the radar back projection imaging algorithm, hides the access delay of each GPU equipment end through asynchronous processing between the GPU equipment ends, and runs different tasks of the calculation flow without data dependency in parallel at the GPU equipment end, thereby reducing the running time.

Fifthly, the invention divides the processed radar echo signal data into data in azimuth direction, and circularly calls GPU kernel function to traverse all data fields in the radar echo signal, namely: 1) transmitting the processed radar echo signal data obtained by processing of the GPU equipment end G1 in the primary cycle to the other GPU equipment end G2 for reprocessing, wherein the processing result is used for calculating parameters required by BP imaging; 2) hiding the reprocessing time of G2 into the time of processing the radar echo signal in the next cycle of the original GPU device end G1; 3) and the other GPU device G3 waits for the original GPU device end G1 and G2 to finish all the cycles, and then the next imaging processing operation is carried out after the radar signal echo data are processed, so that the simulation time consumption is effectively reduced.

Drawings

The invention is described in further detail below with reference to the figures and specific embodiments.

FIG. 1 is a schematic flow chart of SAR radar echo data processing performed in series by a CPU host;

FIG. 2 is an overall connection diagram of a multi-server CPU multi-GPU based satellite-borne SAR imaging real-time processing method;

FIG. 3 is a schematic flowchart of a heterogeneous platform BP imaging based on multiple server CPUs and multiple GPUs;

FIG. 4 is a schematic diagram of an OpenMP working model;

FIG. 5 is a schematic diagram of azimuth block division of radar echo signal data according to data size

FIG. 6 is a flow chart showing the detailed operation of BP imaging;

FIG. 7 is a schematic diagram of a geometric model of data acquired by a strip SAR;

FIG. 8 is a diagram illustrating a relationship between a GPU memory and a thread;

fig. 9(a) is a schematic diagram of two data structure organization modes of AoS and SoA, corresponding to a distance direction and an orientation direction priority storage mode;

FIG. 9(b) is a schematic diagram of the thread assignment for Kernel functions Kernel1-Kernel 8;

FIG. 9(c) is a schematic diagram of the thread assignment of Kernel function Kernel 9;

FIG. 10 is a simulation diagram of measured data of a multi-server multi-GPU based spaceborne SAR imaging real-time processing method without the adoption of the invention;

fig. 11(a) shows radar echo signals obtained by the spaceborne SAR of the present invention;

fig. 11(b) is an image obtained by processing a radar echo signal in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 2, an overall connection diagram of a multi-server multi-GPU based spaceborne SAR imaging real-time processing method is shown, and the method is divided into three modules: the system comprises a secret-free communication module, a satellite-borne SAR data preprocessing module and a GPU equipment end BP imaging module among multiple CPU host terminals, and specifically comprises the following functions:

1) secret-free communication module between multiple CPU host ends: and a secret-free communication module between the multiple CPU host ends realizes secret-free communication of the CPU host ends through MPI.

2) The satellite-borne SAR data preprocessing module comprises: the satellite-borne SAR data preprocessing module mainly comprises a CPU host end for reading radar echo signal data parameters and GPU equipment end information, a GPU equipment end for reading radar echo signal data from the CPU host end, a GPU equipment end for generating complex echoes and a distance direction pulse compression function of the GPU equipment end.

3) GPU equipment end BP imaging module: the BP imaging module of the GPU equipment end is completely carried out at the GPU equipment end and mainly comprises the following steps: (1) constructing a coordinate system aiming at the radar and the target, namely calculating the position of the radar target so as to realize projection aiming at the target and the radar; (2) selecting an effective imaging area of BP imaging so as to determine an echo coherent accumulation area subsequently; (3) dividing the ground grid, namely performing the same projection and accumulation operation on all grid points to realize image focusing; (4) dividing a BP imaging effective area, including the selection of an initial effective area, a radar position effective area and the selection of an effective aperture; (5) carrying out interpolation calculation by utilizing sinc interpolation to determine the contribution of the echo pulse to each grid point; (6) and (5) performing distance phase compensation to realize BP imaging.

In the invention, one CPU host end can control a plurality of GPU equipment ends, one CPU host end can read information of the plurality of GPU equipment ends, and the number of the GPU equipment ends specifically controlled by one CPU host end depends on the number of interfaces on the CPU host end host.

Each GPU equipment end reads radar echo signal data (distance direction priority storage) with the size of a corresponding data block and parameter information of a radar back projection imaging BP algorithm from the corresponding CPU host end in a partitioning mode according to the azimuth direction according to a memory allocated for the GPU equipment end, reads simulation parameters from the corresponding CPU host end to a constant storage area of the GPU equipment end, and reads adaptive partitioning radar echo signal data from the corresponding CPU host end to a global memory of the GPU equipment end; the method comprises the steps that through a plurality of GPU equipment ends controlled by a plurality of CPU host ends, distance direction pulse compression, parameter calculation required by BP imaging and up-sampling of echo pulses are completed on each GPU equipment end, the CPU host ends distribute parameter calculation results in a radar back projection imaging BP algorithm and the up-sampled echo pulses to different GPU equipment ends, and finally, BP imaging is achieved through the GPU equipment ends.

In order to meet the requirement of faster imaging processing, considering that the data volume of satellite-borne SAR processing is large, a multi-server multi-GPU based satellite-borne SAR imaging real-time processing method is used, and referring to FIG. 3, a flow diagram of multi-CPU and multi-GPU based heterogeneous platform BP imaging is shown, and the method comprises the following specific steps:

step 1: the method comprises the steps of reading satellite-borne SAR (synthetic aperture radar) echo data on any one server CPU (central processing unit) host end, and dividing the radar echo signal data to a plurality of server CPU host ends by using secret-free communication among the plurality of server CPU host ends.

The method comprises the specific steps that secret-free communication between any one server CPU host end and other server CPU host ends is realized through MPI (message Passing interface). (1) Logging in each CPU host end through ssh (secure Shell) network security protocol; (2) generating the same radar echo signal data storage path at each server CPU host end; (3) the same execution configuration file is placed in the same directory of the same storage path of the CPU host end of the server; (4) when different server CPU host end execute the same execution configuration file in the step (3), starting each server CPU host end simultaneously to obtain more server CPU host end threads; (5) tasks in the radar back projection imaging BP algorithm are distributed to different server CPU host end threads by utilizing OpenMP, and task level parallelism is achieved.

OpenMP provides a high-level abstract description of a parallel algorithm, and is particularly suitable for parallel program design on a multi-core CPU machine. The compiler automatically processes the program in parallel according to the pragma instruction added in the program, and the difficulty and complexity of parallel programming are reduced by using OpenMP. When the compiler does not support OpenMP, the program may degrade into a normal (serial) program. The normal compiling and running of the program cannot be influenced by the OpenMP instruction existing in the program.

A specific OpenMP working model is shown in fig. 4, which is a schematic view of an OpenMP working model, and is a multithread programming scheme for a shared memory parallel system, and the supported programming languages include C, C + + and Fortran. Only one main thread exists at the beginning, and when parallel computing is needed, a plurality of branch threads are derived to execute parallel tasks. When the parallel code execution is complete, the branch threads rendezvous and pass control flow to a separate main thread. OpenMP provides a high-level abstract description of parallel algorithms, particularly suitable for parallel programming on multi-core CPU machines. The compiler automatically processes the program in parallel according to the pragma instruction added in the program, and the difficulty and complexity of parallel programming are reduced by using OpenMP. When the compiler does not support OpenMP, the program may degrade into a normal (serial) program. The normal compiling and running of the program cannot be influenced by the OpenMP instruction existing in the program. OpenMP employs a fork-join execution mode.

The secret-free communication configuration between the CPU host ends of the servers can be completed through the MPI technology, the secret-free communication between the CPU host ends of the servers is realized, so that the CPU host ends of the servers can call more threads to control the GPU equipment end, and the data transmission efficiency of the CPU host ends of the servers and the GPU equipment end is improved.

Step 2: and reading corresponding radar echo signal data parameters and corresponding information of a plurality of GPU equipment ends at each server CPU host end, and setting the thread number of the server CPU host end.

The information of each corresponding GPU device end read by each server CPU host end comprises: the method comprises the steps of display card model, equipment computing capacity, total global memory, upper and lower limits of grid block thread division of equipment and range direction points of satellite-borne SAR (synthetic aperture radar) echo data to be processed (assuming that the range direction Nr and the direction Na of the echo data are represented respectively by the range direction Nr and the direction Na of the echo data, wherein the value represents multiplication operation).

The server CPU host end sets host end threads with the same number as the GPU device ends according to the read GPU device end information, namely one server CPU host end thread controls one GPU device end, and the OpenMP allocates radar echo signal data processing tasks and parameter calculation tasks in the radar back projection imaging BP algorithm to different CPU host end threads to realize task level parallelism.

Specifically, according to the performance requirement of the radar back projection imaging algorithm, related parameters of the radar back projection imaging BP algorithm are packaged into classes, and the imaging related parameters are changed according to the specific imaging requirement.

And step 3: and each GPU equipment end reads radar echo signal data divided for the azimuth direction from the corresponding server CPU host end and the allocated memory of the GPU equipment end.

And each GPU equipment end reads radar echo signal data of the radar azimuth blocks corresponding to the server CPU host end and parameter information required by a radar back projection imaging algorithm.

According to the step 2, the server CPU performs azimuth-oriented satellite-borne SAR echo data division and equipment-side memory allocation on the basis of the acquired corresponding GPU computing capacity, the total global memory amount and the range direction point number of the satellite-borne SAR echo data to be processed (assuming that the range direction point number of the satellite-borne SAR echo data to be processed is Nr Na, the range direction Nr and the azimuth direction Na of the echo data are respectively represented).

When satellite-borne SAR radar echo data are divided, the distance Fourier change of radar echo signal data is needed before the distance-oriented pulse compression operation is realized in the subsequent steps, and a GPU equipment end memory is additionally opened up due to the distance Fourier change, so that the data space which is really processed by the GPU equipment end and subjected to the Fourier change is temporarily expanded to be three times of the space which is originally allocated for the radar echo signal data, and therefore, a GPU memory space with a certain size needs to be reserved at the GPU equipment end when azimuth-oriented data partitioning of the radar echo signal data is carried out at the CPU host end.

The specific operation is as follows: and the CPU host end of the server reads the information corresponding to each GPU device, and the total amount of the GPU memory which is really available is set by multiplying the total amount of the global memory of the information of the GPU device end by 0.3 according to the total amount of the global memory of the information of the GPU device end. Specific azimuth blocks are shown in fig. 5, and fig. 5 is an azimuth block diagram of radar echo signal data according to the size of data volume. Since the echo data are two-dimensional, divided into a range direction and an azimuth direction, the range direction is from top to bottom in the figure, and the azimuth direction is from left to right in the figure, in the invention, the incoming radar echo data are preferentially stored in the range direction. In order to avoid time consumption caused by skip reading of the memory, the radar echo data in the azimuth direction is partitioned into n blocks, denoted as blk _ num, according to the azimuth direction, and the specific number of the blocks depends on the size of the radar echo data, the operation performed in the subsequent steps and the size of the GPU memory.

The CPU host starts a host thread number CPU _ trd _ num according to the GPU device number GPU _ num, and as shown in fig. 5, the GPU device number is equal to 4, and each host thread is responsible for n/4 pieces of radar echo data. And (3) according to the GPU equipment end information read in the step (2), the CPU host end completes self-adaptive block division aiming at the azimuth direction of the radar echo signal data according to the number of the GPU equipment ends in a self-adaptive manner, different blocks are distributed to different threads of the CPU host end, and each thread of the CPU host end controls and transmits the different blocks into different GPU equipment ends.

And 4, step 4: and each GPU equipment end is accessed into a complex echo of a radar echo signal transmitted by a corresponding server CPU host end.

The method comprises the following steps that data processing is carried out on each server CPU host according to different types of received radar echo signal data, and the radar echo signal data are transmitted into the server CPU host under two conditions, namely, the real part and the imaginary part are transmitted separately, and the complete complex signal is transmitted.

(1) If the radar echo signals are transmitted separately in a real part and an imaginary part, a CPU host end of the server adaptively opens up a memory at a GPU equipment end according to the size of transmitted satellite-borne SAR radar echo signal data, transmits the real part and the imaginary part of the radar echo signals to the GPU equipment end respectively, and then completes the splicing of the real part and the imaginary part of the radar echo signals at the GPU equipment end by utilizing a kernel function GnrEcho at the GPU equipment end; (2) if the radar echo signal data is a complete complex echo signal, this step is skipped.

And 5: and each GPU equipment end realizes pulse compression of the distance direction of the complex echo.

The GPU equipment terminal calls a Fast Fourier Transform (FFT) correlation function in an English-Weida CUDA (computer Unified Device architecture) library, distance-to-fast Fourier FFT conversion is carried out on complex echo data which are read by the GPU equipment terminal, the radar echo complex data are changed to a distance frequency domain corresponding to an azimuth time domain, and radar echo signal data in the distance frequency domain are multiplied by a matched filter, so that radar echo signal pulse compression is realized.

The method comprises the following specific steps: based on the radar echo signal data division in the step 3, multithreading control of the CPU is realized by using openMP at the CPU host end of the server, and a plurality of GPU equipment ends are further controlled to perform related parallel calculation in the CPU host end thread of the server: (1) calling a cuda own library function cuFFT at a GPU device end, performing range-direction FFT on radar echo signal data which is read from a server CPU host end and has finished azimuth-direction blocking, and converting echo signals into a range frequency domain to prepare for subsequent pulse compression; (2) according to the radar simulation parameters, performing matched filtering on echo data in an azimuth time domain and a distance frequency domain, namely generating a matched filter in each distance direction nrn (which is an index of the distance direction Nr), wherein the formula is exp (j × pi f/Bp Tp), and f is nrn/Nr fr; wherein f is the distance sampling frequency, Bp is the signal bandwidth, Tp is the pulse repetition interval, fr is the radar pulse frequency; (3) and (3) finally, performing complex multiplication on the radar echo data subjected to the operation in the step (1) and a matched filter to complete the distance pulse pressure. And realizing the distance pulse pressure of radar echo signal data by using a kernel function Pul _ press at a GPU (graphics processing unit) device end.

Step 6: and BP imaging is realized on a GPU device side controlled by each server CPU host side.

The calculation of parameters required for BP imaging is performed on the GPU device side controlled on the CPU host side of each server, where fig. 6 is a schematic diagram of a specific operation flow related to BP imaging.

The BP imaging process is completed at the GPU equipment end: (1) firstly, a coordinate system aiming at the radar and the target is constructed, namely the calculation of the radar target position is carried out, so that the projection aiming at the target and the radar is realized; (2) selecting an effective imaging area of BP imaging so as to determine an echo coherent accumulation area subsequently; (3) dividing a ground grid, and executing the same projection and accumulation operation on all grid points to realize image focusing; (4) selecting a radar position effective area; (5) selecting an effective aperture; (6) calculating the contribution of the echo pulse to the grid point by slnc interpolation, and realizing the up-sampling of the echo pulse by utilizing interpolation before BP imaging; (7) and distributing the parameter calculation result of the CPU host end of the server and the echo pulse after up-sampling to different GPU equipment ends, and realizing BP imaging by repeatedly executing projection and accumulation of the GPU equipment ends.

Specifically, step 6.1, BP imaging parameters and settings.

The relation of the three-dimensional position of the radar in the cartesian coordinate system is shown in fig. 7, which is a schematic diagram of a geometric model for acquiring data of the strip-type SAR radar, and in fig. 6, which is a schematic diagram of a Kernel3 Kernel for calculating the three-dimensional position information of the radar, which is shown in a specific operation flow diagram related to BP imaging.

When a three-dimensional coordinate is established for a radar position, there are two data organization modes, and different data organization modes have different influences on the nuclear performance, as shown in fig. 9(a), which is two data structure organization modes of AoS and SoA, and a schematic diagram of a corresponding distance direction and azimuth direction priority storage mode is shown: the first mode is a data structure (AoS) which is preferentially stored corresponding to the azimuth direction of radar echo data; the second method is a structure array (SoA), which is stored preferentially according to the distance direction of the radar echo data. The AoS data organization mode can cause non-aligned non-continuous memory access in the process of accessing an x array, a y array and a z array of a radar coordinate position, the read-write speed of a memory is seriously influenced, and the performance of a core is reduced. And the defects of the AoS are effectively avoided by using an SoA data organization mode, and aligned continuous memory access can be performed while radar position coordinates are stored, so that the nuclear performance is more efficient.

As shown in fig. 9(b), a schematic diagram is allocated to threads of Kernel functions Kernel1-Kernel8, where the first is a horizontal distribution of index threads, and represents a target point of a target scene distance direction, denoted by idx, and the calculation method is nrn ═ blockidx.x ═ blockdim.x + threadaidx.x, where blockidx.x corresponds to each thread block in fig. 9(b), blockdim.x corresponds to a dimension corresponding to each thread block in fig. 9(b), threadaidx.x corresponds to a thread bundle index in each dimension thread block in fig. 9(b), and the number of thread bundles is a multiple of 16; the second is the longitudinal distribution of the index threads, which represents the target point of the target scene azimuth block, and is marked as idy, and the calculation mode is nan (block idx.y block dim.y + threadaidx.y); wherein blockidx.y corresponds to fig. 9(c), each thread block in the schematic diagram is allocated to a thread of the Kernel function Kernel9, blockdim.y corresponds to a dimension corresponding to each thread block in fig. 9(c), threadaidx.y corresponds to a thread bundle index in each dimension thread block in fig. 9(c), and the number of thread bundles is a multiple of 16; the third method is to determine the offset of each target in the whole radar Echo data after the block division through the start point coordinates of each block after the block division in the direction of the index direction, and the offset is recorded as Echo _ offset, and the calculation method is that Echo _ offset is Na _ offset × len [ i ], wherein len [ i ] represents the several blocks of the radar signal Echo target area, Na _ offset represents the azimuth offset of each block, and the azimuth offsets of different blocks are obtained by multiplying len [ i ].

And determining the three-dimensional coordinates of the radar target points by the indexes of the threads, and storing the calculation results in an array well opened in the GPU video memory according to lines. When the kernel function is called for calculation, the kernel functions are called according to lines to form aligned and continuous memory access, cross-memory access is reduced, and the kernel functions which process distance and preferentially store radar echo data have good kernel performance. And determining the coordinates of the starting point of each radar echo data azimuth block according to the thread index.

And packaging the related parameters of the specific BP imaging into classes in a program, and changing the related parameters of the imaging according to the specific imaging requirements.

And 6.2, finishing the BP imaging.

Parameters required by pulse compression and BP imaging of satellite-borne SAR radar echo data in the distance direction are analyzed, up-sampling of echo pulses is achieved through interpolation, dependency before and after BP imaging processing is achieved, and multithreading acceleration is conducted on an operation process which has data dependency and is large in for-loop volume through a CPU host end by using openMP; and conversely, the CUDA is selected to be used on the GPU equipment side for accelerating the operation flow which has no data dependency and larger loop size.

The common interpolation modes include linear interpolation after upsampling, spline interpolation, sinc interpolation and the like, and the appropriate interpolation mode can be selected by comprehensively considering the imaging quality and the real-time requirement. In the present invention, sinc interpolation is selected.

And (3) up-sampling of echo pulses is realized by utilizing sinc interpolation before BP imaging. And the CPU host end distributes the calculation results of each parameter and the echo pulse after the up-sampling to different GPU equipment ends, and the GPU equipment ends repeatedly execute projection and accumulation to realize BP imaging.

In addition, aiming at the BP imaging and the operation without data dependency in the parameter calculation process before the BP imaging, the operation is realized by the following kernel functions at the GPU equipment end;

kernel 1: GnrEcho: if the satellite-borne SAR radar echo data is in a form of separating a real part and an imaginary part, the real part and the imaginary part of the satellite-borne SAR radar echo data are combined through the kernel function;

kernel 2: pul _ Press: multiplying the radar echo data subjected to range-direction Fourier change by a matched filter to realize range-direction pulse compression of the radar echo data;

kernel 3: InitSensorPos: calculating radar position information, and constructing a coordinate system of the radar and the target;

kernel4, Init _ Sig _ stav: selecting an initial position and a radar position effective area;

kernel5, Kernel 6: init _ Grdx and Init _ Grd _ dlt _ y: dividing a ground grid;

kernel7, Kernel 8: init _ Senpos and Init _ AzSE: selecting an effective aperture;

kernel 9: ProcessPoint: completing sinc interpolation of echo data to realize up-sampling; completing the calculation of the center phase of the radar antenna and the instantaneous target slope distance; meanwhile, distance phase compensation is completed; and finally BP imaging is completed.

As shown in fig. 3, which is a flow diagram of heterogeneous platform BP imaging based on multiple CPUs and multiple GPUs and fig. 6, which is a specific operation flow diagram related to BP imaging, task division is performed on BP imaging in a BP imaging stage, and data dependency is analyzed, and if data processed by Kernel1, Kernel2, Kernel3, and Kernel4 have data dependency, the data can be distributed to a GPU device side and sequentially run through Kernel functions one by one;

the input data of the Kernel5 and the Kernel6 are related to data processed by four Kernel functions of Kernel1, Kernel2, Kernel3 and Kernel4, and after the four Kernel functions of Kernel1, Kernel2, Kernel3 and Kernel4 are operated, the operation result is placed on the GPU device side: kernel5 and Kernel6 need to wait for the Kernel1-Kernel4 operations to complete before performing the operations, and it is considered that blocking is added before the Kernel5 Kernel is executed;

the Kernel7 and the Kernel8 need to wait for the Kernel3 to finish and then perform operation, and the Kernel3, the Kernel7 and the Kernel8 can be placed under the same working stream, so that the time consumption of data transmission of the GPU device end and the CPU host end is reduced as much as possible;

after the Kernel function operation is completed, the BP imaging is completed by using Kernel 9. In consideration of the consumption of the memory and the overlapped calculation among the GPU equipment ends, the data migration which is preferentially stored based on the azimuth direction can be calculated, including the migration of the rows and the columns, the parallel processing of multiple GPUs is realized, and the algorithm efficiency is further improved.

And 7: and acquiring BP images on all GPU equipment terminals corresponding to all the service CPU host terminals, writing the BP images into a radar echo data processing file, reading the processed radar echo data by using MATLAB, drawing according to the processed radar echo data, and analyzing an imaging result.

The effect of the present invention will be further explained with the simulation experiment.

(1) Actual measurement data simulation of the multi-server multi-GPU-based satellite-borne SAR imaging real-time processing method is not adopted.

Simulation experiment parameters: 12288 pulse echo points, 2.5e-6s pulse echo period, 2.997e +9 pulse echo bandwidth, 3.5975e +5 pulse echo frequency and 7400m/s radar speed.

The simulation steps are as follows:

1) reading echo data of a host end;

2) compressing the distance direction pulse of the host end;

3) calculating parameters of BP imaging at a host: the method comprises the steps of constructing a coordinate system for radar and a target, selecting an effective imaging area of BP imaging, dividing a ground grid, wherein the division of the effective imaging area of the BP imaging comprises the selection of an effective aperture, and the selection of effective areas of an initial position and a radar position;

4) host side BP imaging.

As shown in fig. 1, which is a schematic flow diagram of serially completing SAR echo data at a CPU host, when radar echo signal data is serially processed at the CPU host, 1) first reading simulation parameters, encapsulating the simulation parameters required for SAR echo formation into classes, and directly calling header files included in a program; 2) the method comprises the steps that echo data are read from a satellite-borne SAR echo data file by a CPU host, a required imaging area is selected from the echo data to serve as an imaging effective area, and the number of target points is determined according to specific imaging requirements; 3) after the calculation of all target points in the direction and the distance direction is finished, the pulse pressure in the distance direction is realized; 4) importing the echo data into a distance frequency domain and an azimuth time domain, and further continuing simulation parameters; 5) and then calculating the real-time distance of each target point in the radar running time according to the simulation parameters, judging whether the target point is exactly in the irradiation range of the radar antenna emission beam according to the real-time distance, calculating the echo of the target point in the beam irradiation range, performing phase compensation in the distance direction, and accumulating the echo data of the target point subjected to phase compensation in the azimuth direction.

The simulation data result is shown in fig. 10, which is a simulation diagram of the measured data of the spaceborne SAR imaging real-time processing method based on the multi-server and multi-GPU without the adoption of the invention.

(2) The invention relates to actual measurement data simulation of a multi-server multi-GPU-based satellite-borne SAR imaging real-time processing method.

The parameter settings in this simulation are the same as those in simulation 1. The simulation steps are as follows:

1) reading wave data from a CPU host;

2) the GPU equipment end performs range-direction pulse compression;

3) the GPU equipment side carries out BP imaging parameter calculation: the method comprises the steps of constructing a coordinate system for radar and a target, selecting an effective imaging area of BP imaging, dividing a ground grid, wherein the division of the effective imaging area of the BP imaging comprises the selection of an effective aperture, and the selection of effective areas of an initial position and a radar position;

4) and the GPU equipment end completes BP imaging. The method comprises the following specific steps: (1) reading equipment end information at a host end, finishing self-adaptive blocking of data at the host end, calling a kernel function Gnrecho, and circularly traversing all blocks, wherein the circulating times are determined by the size of data; (2) after the data is read, a kernel function pul _ press is called to finish distance-oriented pulse compression, asynchronous blocking is set, and all echo data are guaranteed to finish distance-oriented pulse compression before next processing; (3) sequentially calling kernel functions InitSensorPos to the data blocks after pulse compression based on the completion distance to construct a coordinate system of the radar and the target; (4) the kernel function Init _ Sig _ stav completes the selection of the initial position and the effective area of the radar position; (5) the division of the ground grid is completed through a kernel function Init _ Grdx and a kernel function Init _ Grd _ Slt _ y; (6) the construction of an imaging coordinate system and the selection of an effective aperture are completed through a kernel function Init _ Senpos and a kernel function Init _ AzSE; (7) and (3) selecting an optimal parallelization mode according to the condition of the GPU at the equipment end, the data size and the parameter calculation, reducing the time consumption of data transmission and realizing the optimal processing result at the BP imaging equipment end.

Specifically, a kernel function development introduction is performed on the whole process, and the division of the GPU device end on grids and blocks in each kernel function is mainly introduced: TrdNum and BlkNum; and defining the meaning of each kernel function and the encapsulation of each interface of the kernel function, and defining the size of the memory of the GPU equipment terminal.

The number of the GPU equipment ends is 4, the size of a video memory of the GPU is 32G, the configuration is smaller than the size of radar echo data volume needing to be processed, the auxiliary data can be calculated temporarily in the Fast Fourier Transform (FFT), phase compensation and interpolation calculation, the auxiliary data occupies extra GPU memory in calculation, the maximum utilization rate of the video memory needs to be considered, and the GPU equipment end memory is reasonably distributed by utilizing GPU equipment end information read by a CPU host end.

Configuring a thread for each kernel function, realizing thread-level parallelism, and ensuring the kernel functions to have optimal performance on the basis of following a certain principle; and defining the content of the index in the kernel and a data organization mode, and optimizing the internal calculation of the kernel function to ensure that the performance of the kernel function is optimal. Configuring a grid block thread for each kernel function, wherein the following principles are followed to ensure that the kernel function has better performance:

(1) the length of the grid and the block configured for the kernel function in the x dimension, the y dimension and the z dimension does not exceed the limit of the GPU equipment, and the performance is better when the configuration is closer to the limit of the GPU equipment.

(2) The number of threads configured for each kernel should be an integral multiple of 32 and preferably greater than 4 x 32 threads. Since the GPU processes threads on the hardware streaming multiprocessor SM in units of one thread bundle (warp). One thread bundle is composed of 32 threads, and the four thread bundles are arranged on the hardware stream multiprocessor SM to be processed in parallel, so that the memory access delay can be effectively hidden, and the core performance is improved. The more bundles placed on the hardware stream multiprocessor SM, the better the effect of hiding the memory access delay.

(3) The performance of each kernel is affected by the configuration, and multiple tests are required to obtain the optimal configuration. And running the kernel function by using different configurations, and testing the running time of the kernel function by using a tool for multiple times to obtain the configuration with the shortest time consumption.

Next, the relationship between the memory distribution of the GPU device and the grid, the block, and the thread is as shown in fig. 8, which is a schematic diagram of the relationship between the GPU memory and the thread: the number of thread grids, the number of thread blocks and the number of threads are set by declaring a kernel in advance. When a grid of thread blocks is started, the thread blocks in the grid are distributed in an sm (streaming multiprocessor). Once a thread block is dispatched onto an SM, the threads in the thread block are further divided into thread bundles. A thread block in one kernel may be scheduled into a different SM.A thread bundle is the basic execution unit in an SM. A thread bundle consists of a number of consecutive threads, in which all threads are executed simultaneously. In the GPU memory hierarchy, the two most dominant memories are global memory and shared memory. Globally resembles the system memory of a CPU and shared memory resembles the cache of a CPU. The shared memory of the GPU can be directly controlled by a kernel of the CUDAC, namely kernel function internal statement; and the global memory of the GPU moves the data at the host end of the CPU to the global memory of the GPU through the data operation instruction of the CUDAC, and then each thread of the kernel function is used for accessing the global memory of the GPU for parallel computation. The kernel functions, kernel function variables, and kernel function allocation space used on the GPU are as follows. The Kernel1 Kernel function is named GenEcho, and has the function of combining the real part and the imaginary part of the read radar echo signal data to generate a complete complex signal; the intra-block configuration thread is GenEchoThread (512,1), and the intra-grid configuration block is

Meaning that a size is created of

The unit of the thread grid is a thread block, 512 × 1 threads are configured in the block, and the configuration of the block can be automatically adjusted according to the configuration of the threads, as shown in fig. 9(b), which is a schematic diagram of the thread allocation of Kernel1-Kernel 8. Establishing a thread index in a core, setting the thread index as idx and idy, respectively corresponding to a distance index and an azimuth index, and according to the configuration of the core, determining the idx index by the deviation of the X direction of a thread, the deviation of the X direction of a block and the size of the block, and calculating the idx index of a core function in a mode of idx being block idx, block dim, X and threadadIdx, wherein the block idx corresponds to the deviation of each thread block in GenEchoblocks, the block dim corresponds to the corresponding dimension of each thread block in GenhoEcblocking, the threadadIdx corresponds to the thread bundle index in each dimension thread block in Genhoblocking, the number of the thread bundles is 16, and the index range is [0, Nr, multiple of the number of the thread bundles]Indexing for distance directions; idy the index is determined by the number of sizes of the block, kernel function iThe dy index calculation mode is idy ═ blockidx.y, wherein blockidx.y corresponds to the index of each thread block in GenEchoblocks.y, blockdim.y corresponds to the dimension corresponding to each thread block in GenEchoblocks.y, threadidx.y corresponds to the index of the thread bundle in each dimension thread block in GenEchoblocks.y, the number of the thread bundles is a multiple of 16, and the index range is [0, Na ]]Is an azimuth index and represents an azimuth. And traversing the variable address space in the GPU video memory by using the index, and assigning values and calculating data for array elements of an array developed by the GPU video memory. The array variables allocated to the Kernel1 Kernel function are d _ Recho, d _ Iecho, d _ Echo and Nr, wherein Recho represents a real data part, Iecho represents an imaginary data part, Echo represents complex data, and Nr represents the distance direction length of data storage; the allocated GPU display memory space is sizeof (float) Na Nr, sizeof (c _ float) Na Nr; cuda _ complex, cuh, with c _ float being the complex signal with float as the real part and imaginary part defined.

The Kernel2 Kernel function name is pul _ press, the intra-block configuration thread is pul _ pressed thread (512,1), and the intra-grid configuration block is

Meaning that a size is created of

The thread grid of (a), wherein the unit of the thread grid is a thread block, 512 × 1 threads are configured in the block, and the configuration of the block can be automatically adjusted according to the configuration of the threads, as shown in fig. 9(b), which is a schematic diagram of the thread allocation of Kernel functions Kernel1-Kernel 8. Establishing a thread index in a core, setting the thread index as idx and idy, respectively corresponding to a distance index and an azimuth index, and according to the configuration of the core, determining the idx index by the deviation of the thread X direction, the deviation of the block X direction and the size of the block, and calculating the idx index of a kernel function in a manner that idx is block IdxThe amount is a multiple of 16, and the index range is [0, Nr ]](ii) a The kernel function idy has an index calculation mode idy ═ blockidx.y, where blockidx.y corresponds to each thread block index in pul _ reservations.y, blockdim.y corresponds to the dimension corresponding to each thread block in pul _ reservations.y, threadbundle idx.y corresponds to the threadbundle index in each dimension thread block in pul _ reservations.y, the number of the threadbundles is a multiple of 16, and the index range is [0, Na ] or [0, Na ] or [ 16 ]]. And traversing the variable address space in the GPU video memory by using the index, and assigning values and calculating data for array elements of an array developed by the GPU video memory.

The function is to complete the radar echo data in the range direction pulse pressure, generate a temp _ MatchFun in a kernel function Pul _ press, according to the calculation formula of a matched filter, the value is exp (i pi f/Bp Tp), wherein f is nrn/Nr fr, the matched filter is used for each range direction, and the matched filter of each range direction is multiplied by the frequency domain signal of each range direction in a complex manner, so as to complete the range direction pulse pressure. The array variables distributed for the function are d _ echo, Nr, Fr, Br and Tr, wherein echo is echo data, Nr is azimuth length, Fr is radar pulse frequency, and Br is radar pulse bandwidth; the distributed space is sizcof (c _ float) × Na × Nr, sizeof (c _ float) × Nr, c _ float represents complex float signals, Nr and Na respectively represent the distance direction and the azimuth direction, and the azimuth direction matched filtering is carried out on radar echo data by using a calculation formula of a matched filter so as to complete distance direction pulse compression.

The Kernel3 Kernel function is named InitSensorPos, the intra-block configuration thread is InitSensorPosThread (512,1), and the intra-trellis configuration block is

Meaning that a size is created of

The unit of the thread grid is a thread block, 512 × 1 threads are configured in the block, and the configuration of the block can be automatically adjusted according to the configuration of the threads, as shown in fig. 9(b), which is a schematic diagram of the thread allocation of Kernel1-Kernel 8. Establishing thread index in coreSetting the thread indexes as idx and idy, respectively corresponding to distance indexes and azimuth indexes, wherein according to the configuration of a core, the idx indexes are determined by the X-direction deviation of a thread, the X-direction deviation of a block and the size of the block, and the idx index of the kernel function is calculated in the mode of idx being block Idx, block Dim.x and threadadIdx.x, wherein the block Idx.x corresponds to the index of each thread block in InitSensorPosblock.x, the block Dim.x corresponds to the dimension corresponding to each thread block in InitSensorPosblock.x, the threadadX.x corresponds to the thread bundle index in each dimension thread block in InitSensorPosblock.x, the number of the thread bundles is 16, and the index range is [0, Nr, wherein]Indexing for distance directions; the index calculation mode of the kernel function idy is idy ═ blockidx.y, wherein blockidx.y corresponds to the index of each thread block in initsensposblocks.y, blockdim.y corresponds to the dimension corresponding to each thread block in initsensposblocks.y, threadadx.y corresponds to the index of the thread bundle in each dimension thread block in initsensposblocks.y, the number of the thread bundles is a multiple of 16, and the index range is [0, Na]Is an azimuth index and represents an azimuth. And traversing the variable address space in the GPU video memory by using the index, and assigning values and calculating data for array elements of an array developed by the GPU video memory.

The function is to complete the construction of the coordinate system for BP imaging as shown in fig. 6 for a specific operational flow diagram for BP imaging. The relation of the three-dimensional position of the radar in a Cartesian coordinate system is shown in FIG. 7, a geometric model schematic diagram of data acquired by the strip SAR radar is shown, and the three-dimensional position information of the radar is obtained through calculation in a kernel. The meaning of each variable is: sensor _ pos is stored as a coordinate system, Vr is azimuth speed of the radar, H is flight height of the radar, PRF is pulse repetition sampling frequency, the space size allocated for the sensor _ pos is sizeof (double) 3, the generated sensor _ pos respectively stores instantaneous slant distance, imaging size of each azimuth and the flight height H of the radar according to the sequentially stored directions, and construction of the coordinate system based on the target and the radar is completed through the kernel function.

The Kernel4 Kernel function name is Init _ Sig _ StaV, the intra-block configuration thread is Init _ Sig _ StaVThread (512,1), and the intra-grid configuration block is Init _ Sig _ StaVThread

Meaning that a size is created of

The unit of the thread grid is a thread block, 512 × 1 threads are configured in the block, and the configuration of the block can be automatically adjusted according to the configuration of the threads, as shown in fig. 9(b), which is a schematic diagram of the thread allocation of Kernel1-Kernel 8. Establishing a thread index in a core, setting the thread index as idx and idy, respectively corresponding to a distance index and an azimuth index, and according to the configuration of the core, determining the idx index by the deviation of the thread X direction, the deviation of the block X direction and the size of the block, and calculating the idx index of a kernel function in a mode of idx being block idx, X block dim, X + threadadidx, wherein the block idx corresponds to each thread block index in Init _ Sig _ staVPosblock, the block dim corresponds to the dimension corresponding to each thread block in Init _ Sig _ staVPosblock, the threadadx corresponds to the thread bundle index in each dimension thread block in Init _ Sig _ staVPosblock, the number of the thread bundles is a multiple of 16, and the index range is [0, Nr, and the number of the thread bundles is a multiple of [0, Nr ]]Indexing for distance directions; the index calculation mode of the kernel function idy is idy ═ blockidx.y, wherein blockidx.y corresponds to the index of each thread block in Init _ Sig _ stavposblocks.y, blockdim.y corresponds to the dimension corresponding to each thread block in Init _ Sig _ stavposblocks.y, threadbundle index in each dimension thread block in Init _ Sig _ stavposblocks.y corresponds to threadbundle idx.y, the number of the threadbundles is 16 times, and the index range is [0, Na]Is an azimuth index and represents an azimuth. And traversing the variable address space in the GPU video memory by using the index, and assigning values and calculating data for array elements of an array developed by the GPU video memory.

The function is to complete fig. 6, which is a module for selecting the effective area of BP imaging shown in the detailed operation flow diagram related to BP imaging. The array variables distributed to the kernel function are respectively Sig _ staV, LenNa, Sig _ start and Tr;

wherein, Sig _ staV represents the size of the azimuth imaging area, the size of the allocated space is sizeof (double) LenNa, LenNa is the effective length of the selected azimuth, Sig _ start is the index of the start of azimuth imaging, Tr is the radar pulse period, and the selection of the effective azimuth imaging area can be completed by the kernel function.

The Kernel5 Kernel function name is Init _ Grdx, the intra-block configuration thread is Init _ GrdxThread (512,1), and the intra-grid configuration block is

Meaning that a size is created of

The unit of the thread grid is a thread block, 512 × 1 threads are configured in the block, and the configuration of the block can be automatically adjusted according to the configuration of the threads, as shown in fig. 9(b), which is a schematic diagram of the thread allocation of Kernel1-Kernel 8. And establishing a thread index in the core, setting the thread index as idx, corresponding to the distance index, and determining the idx index by the deviation of the thread X direction, the deviation of the block X direction and the size of the block according to the configuration of the core. The calculation mode of the idx index of the kernel function is idx ═ blockidx.x × blockdim.x + threadaidx.x, wherein blockidx.x corresponds to the index of each thread block in Init _ grdxposblock.x, blockdim.x corresponds to the dimension corresponding to each thread block in Init _ grdxposblock.x, threadaidx.x corresponds to the thread bundle index in each dimension thread block in Init _ grdxposblock.x, the number of the thread bundles is 16 times, and the index range is [0, 2 × Hf _ az [](ii) a And traversing the variable address space in the GPU video memory by using the index, and assigning values and calculating data for array elements of an array developed by the GPU video memory.

The function is to complete fig. 6, which is the X-direction orientation partitioning module for the ground grid shown in the detailed operational flow diagram for BP imaging. The array variables allocated to the kernel function are Grd _ x, Cen _ pos2, Hf _ az and Pix _ az respectively; grd _ x means the created azimuth grid, and the allocated space is sizeof (double) 2 Hf _ az, where Hf _ az is the number of azimuth grid points, Cen _ pos2 indicates the azimuth distance of the grid center point from the radar, and Pix _ az indicates the azimuth resolution.

The Kernel6 Kernel function name is Init _ Grd _ Slt _ y, the intra-block configuration thread is Init _ Grd _ Slt _ yThread (512,1), and the gridThe internal configuration block is

Meaning that a size is created of

The unit of the thread grid is a thread block, 512 × 1 threads are configured in the block, and the configuration of the block can be automatically adjusted according to the configuration of the threads, as shown in fig. 9(b), which is a schematic diagram of the thread allocation of Kernel1-Kernel 8. And establishing a thread index in the core, setting the thread index as idx, corresponding to the distance index, and determining the idx index by the deviation of the thread X direction, the deviation of the block X direction and the size of the block according to the configuration of the core. The calculation mode of the idx index of the kernel function is idx ═ blockidx.x × + blockdim.x + threadadx.x, wherein blockidx.x corresponds to the index of each thread block in Init _ Grd _ Slt _ yblock.x, blockdim.x corresponds to the dimension corresponding to each thread block in Init _ Grd _ Slt _ yblock.x, threadadx.x corresponds to the index of the thread bundle in each dimension thread block in Init _ Grd _ Slt _ yblock.x, the number of the thread bundles is a multiple of 16, and the index range is [0, 2 × Hf _ rg](ii) a And traversing the variable address space in the GPU video memory by using the index, and assigning values and calculating data for array elements of an array developed by the GPU video memory.

The function is to complete fig. 6, a Y-direction orientation partitioning module for the ground grid shown in the detailed operational flow diagram for BP imaging. The array variables allocated to the kernel function Init _ Grd _ Slt _ y are Grd _ Slt, Grd _ y, Sqrt _ temp, Hf _ rg, Pix _ rg, and H, respectively, and the specific meanings of the parameters are as follows: grd _ Slt represents a diagonal plane grid, Grd _ y represents a distance grid, and the array space allocated by Grd _ Slt and Grd _ y is sizeof (double) 2 Hf _ rg; hf _ rg represents the number of distance direction grid points, Pix _ rg represents the distance direction resolution, H represents the height of the ground data radar, and the division of the ground grid is completed through a kernel function Init _ Grdx and a kernel function Init _ Grd _ Slt _ y.

The Kernel7 Kernel function name is Init _ SenPos, the intra-block configuration thread is Init _ SenPosThread (512,1), and the intra-grid configuration block is Init _ SenPosThread

Meaning that a size is created of

The unit of the thread grid is a thread block, 512 × 1 threads are configured in the block, and the configuration of the block can be automatically adjusted according to the configuration of the threads, as shown in fig. 9(b), which is a schematic diagram of the thread allocation of Kernel1-Kernel 8. And establishing a thread index in the core, setting the thread index as idx, corresponding to the distance index, and determining the idx index by the deviation of the thread X direction, the deviation of the block X direction and the size of the block according to the configuration of the core. The calculation mode of the idx index of the kernel function is idx ═ blockidx.x × blockdim.x + threadaidx.x, wherein blockidx.x corresponds to the index of each thread block in Init _ senpos.x, blockidx.x corresponds to the dimension corresponding to each thread block in Init _ senpos.x, threadaidx.x corresponds to the index of the thread bundle in each dimension thread block in Init _ senpos.x, the number of the thread bundles is 16 times, and the index range is [0, 2 × len](ii) a And traversing the variable address space in the GPU video memory by using the index, and assigning values and calculating data for array elements of an array developed by the GPU video memory.

The function is to complete fig. 6, which is a module for selecting the effective area of the starting radar position shown in the detailed operation flow diagram related to BP imaging. The array variables allocated to the kernel function are respectively SensPos, sen _ sor _ pos, StaEchoAz, len and Na; the spaces allocated by sensPos and sensor _ pos are sizeof (double) 3 × len, where sensPos and sen _ sensor _ pos are used to represent position information of radar scanning at each time, StaEchoAz is the effective start of azimuth direction of data, len is the effective length of selected azimuth direction, Na is the azimuth direction length of echo data, and the generation of effective imaging coordinate system is completed through the kernel function.

The Kernel8 Kernel function name is Init _ AzSE, the intra-block configuration thread is Init _ AzSEThread (512,1), and the intra-grid configuration block is

Can be configured according to the threadThe diagram of the dynamic pair block configuration adjustment is shown in FIG. 9(b), which is a diagram of the thread assignment for Kernel functions Kernel1-Kernel 8. And establishing a thread index in the core, setting the thread index as idx, corresponding to the distance index, and determining the idx index by the deviation of the thread X direction, the deviation of the block X direction and the size of the block according to the configuration of the core. The idx index calculation mode of the kernel function is idx ═ blockidx.x × blockdim.x + threadaidx.x, wherein blockidx.x corresponds to the index of each thread block in Init _ azblocks.x, blockdim.x corresponds to the dimension corresponding to each thread block in Init _ azblocks.x, threadaidx.x corresponds to the thread bundle index in each dimension thread block in Init _ azblocks.x, the number of thread bundles is a multiple of 16, and the index range is [0, 2 × len ]](ii) a And traversing the variable address space in the GPU video memory by using the index, and assigning values and calculating data for array elements of an array developed by the GPU video memory.

The function is to complete fig. 6, a selection module for the effective aperture shown in the detailed operational flow diagram for BP imaging. The array variables allocated to the kernel function Init _ AzSE are Apt _ StaAz, Apt _ EndAz, SensPos, Rc, Ant _ Squant, La _ hf and len respectively; len is the effective length of the selected azimuth direction, and Apt _ StaAz and Apt _ EndAz respectively represent the starting aperture and the cut-off aperture; SensPos represents position information at the time of radar scanning, the size of a space allocated by SensPos is sizeof (doule) 3 × len, Ant _ Squint represents a radar Squint angle, La _ hf represents half of a synthetic aperture antenna length, and Rc represents a scene center distance.

The selection of the effective aperture is completed through the kernel function Init _ SenPos and the kernel function Init _ AzSE.

The final key step is to complete fig. 6, which is the operation of the GPU device side BP imaging part shown in the detailed operation flow diagram related to BP imaging. The Kernel9 Kernel function name is ProcessPoint, the intra-block configuration thread is ProcessPoint thread (512,1,1), and the intra-grid configuration block is

Meaning that a size is created of

The thread grid of (2), wherein the unit of the thread grid is a thread block, and 512 x 1 threads are configured in the block; the configuration of the block may be automatically adjusted according to the configuration of the thread, as shown in fig. 9(c), which is a schematic diagram of the thread allocation of kernel 9. Establishing thread indexes in the core, setting the thread indexes to idx, idy and idz, and determining each thread index by thread direction deviation, block direction deviation and block size according to the configuration of the core; the idx index of the kernel function is calculated in the mode of idx ═ blockidx.x ═ blockdim.x + readidx.x, idy ═ blockidx.y, idz ═ blockidx.z + offset _ Na, and the index range is [0, 2 × len]And len is an effective length selected in the azimuth direction, wherein the chunk indexes of different dimensions in chunk Idx.x, chunk Idx.y and chunk Idx.z correspond to ProcessPoint chunks.x, ProcessPoint chunks.y and ProcessPoint chunks.z, the chunk indexes of different dimensions in chunk Dim.x and chunk Dim.y correspond to ProcessPoint chunks.x and dimensions corresponding to thread chunks in ProcessPoint chunks.y, and the chunk indexes of different dimensions in threadchunks in threadchunk indexes of different dimensions in ProcessPoint chunks.x and ProcessPoint chunks.y. And traversing the variable address space in the GPU video memory by using the index, and assigning values and calculating data for array elements of an array developed by the GPU video memory.

The function is to realize sinc interpolation (up-sampling) of radar echo data, complete calculation of radar and target instantaneous slant distance, complete distance phase compensation at the same time and finally complete BP imaging.

And analyzing the BP imaging process of the SAR image, determining the relation among tasks, and dividing the tasks according to the dependency and independence of the tasks. Dependency refers to the existence of a causal relationship between two tasks, with some or all of the data being used interchangeably. Independence means that the two tasks exchange processing order without any impact on the result.

The radar echo data subjected to the BP imaging processing is read to a CPU host end from a GPU equipment end, the obtained data is written into a data type file at the CPU end, and the data is read and displayed in matlab software, as shown in fig. 11(a), which is shown in a radar echo signal obtained by the satellite-borne SAR in the invention. Fig. 11(b) shows an image obtained by processing the echo signal, which is an image obtained by processing the radar echo signal according to the present invention.

Compared with the method for processing the radar echo data in series at the CPU host end, when the GPU device end processes the radar echo data in parallel, the GPU device end and the CPU host end do not share a memory, and the GPU is not suitable for directly reading a large amount of CPU memory data, so that the radar echo data needs to be transmitted to the memory at the GPU device end through the CPU host end, and under the condition that a plurality of GPU device ends exist, simulation parameters need to be transmitted to the global memory at the GPU device end from the CPU host end according to needs. In the radar echo data calculation process, a GPU kernel function is called and a large number of device end threads are distributed to the kernel function by combining the characteristics of a GPU, each thread corresponds to radar echo data of each radar target point at a certain radar scanning moment, and the threads can perform parallel calculation inside the GPU kernel function to complete parallel processing of a large number of data. In the process of processing SAR echo data at the GPU equipment end, in order to prevent memory conflict at the GPU equipment end, a kernel function is additionally configured at the CPU host end to call a plurality of threads to complete related calculation at the GPU equipment end.

And (3) comparing simulation effects: fig. 10 is a comparison of the imaging result of the measured data simulation graph of the multi-server multi-GPU-based satellite-borne SAR imaging real-time processing method not adopted in the present invention with fig. 11(b), which shows that there is no significant difference in the images obtained by processing the radar echo signals in the present invention. Aiming at SAR radar echo data simulation with the same point number, the serial BP imaging consumes 5434.18s, the multi-server multi-GPU consumes 4s, and the acceleration ratio is 1358.5 times, so that the simulation time consumption is effectively reduced.

Although the present invention has been described in detail in this specification with reference to specific embodiments and illustrative embodiments, it will be apparent to those skilled in the art that modifications and improvements can be made thereto based on the present invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A multi-server multi-GPU based satellite-borne SAR imaging real-time processing method is characterized by comprising the following specific steps:

2. The multi-server multi-GPU based satellite-borne SAR imaging real-time processing method according to claim 1, characterized in that in the step 1, secret-free communication between a plurality of server CPU host terminals is utilized, and the secret-free communication between any one server CPU host terminal and other server CPU host terminals is realized through MPI (message passing interface).

3. The multi-server multi-GPU based spaceborne SAR imaging real-time processing method according to claim 2, wherein in the step 2, the corresponding information of each GPU device end read by each server CPU host end comprises: the method comprises the following steps of (1) displaying card type, equipment computing capacity, total global memory, upper and lower limits of grid block thread division of equipment and range direction points of satellite-borne SAR (synthetic aperture radar) echo data to be processed (assuming that the range direction Nr and the azimuth direction Na of the echo data are represented respectively by the range direction Nr and the azimuth direction Na of the echo data, wherein the value represents multiplication operation); the server CPU host end sets host end threads with the same number as the GPU device ends according to the read GPU device end information, namely one server CPU host end thread controls one GPU device end, and the OpenMP allocates radar echo signal data processing tasks and parameter calculation tasks in the radar back projection imaging BP algorithm to different CPU host end threads to realize task level parallelism.

4. The multi-server multi-GPU based spaceborne SAR imaging real-time processing method according to claim 3, wherein in the step 3, a server CPU host end reads corresponding GPU device information, and the total amount of the GPU memory which is really available is set by multiplying the total amount of the GPU device information global memory by 0.3 according to the total amount of the GPU device information global memory.

5. The multi-server multi-GPU based on spaceborne SAR imaging real-time processing method according to claim 4, characterized in that in the step 4, each server CPU host end performs data processing according to different types of received radar echo signal data, and there are two cases when the radar echo signal data is transmitted into the server CPU host end, one is a real part and an imaginary part separated transmission, and the other is a complete complex signal transmission: (1) if the radar echo signals are transmitted separately in a real part and an imaginary part, a CPU host end of the server adaptively opens up a memory at a GPU equipment end according to the size of transmitted satellite-borne SAR radar echo signal data, transmits the real part and the imaginary part of the radar echo signals to the GPU equipment end respectively, and then completes the splicing of the real part and the imaginary part of the radar echo signals at the GPU equipment end by utilizing a kernel function GnrEcho at the GPU equipment end; (2) if the radar echo signal data is a complete complex echo signal, this step is skipped.

6. The multi-server multi-GPU based satellite-borne SAR imaging real-time processing method according to claim 5, characterized in that in the step 5, the GPU device side calls a fast Fourier FFT transform correlation function in an Invitta CUDA library, distance-to-fast Fourier FFT transform is performed on complex echo data read by the GPU device side, the radar echo complex data are changed to a distance frequency domain corresponding to an azimuth time domain, and radar echo signal data in the distance frequency domain are multiplied by a matched filter to realize radar echo signal pulse compression:

specifically, based on the radar echo signal data division in step 3, the openMP is used at the server CPU host to implement the multithread control of the CPU, and the multiple GPU device ends are further controlled in the server CPU host thread to perform the related parallel computation: (1) calling a cuda own library function cuFFT at a GPU device end, performing range-to-Fast Fourier Transform (FFT) on radar echo signal data which is read from a server CPU host and has finished azimuth blocking, and transforming the echo signal to a range frequency domain to prepare for subsequent pulse compression; (2) according to the radar simulation parameters, performing matched filtering on echo data in an azimuth time domain and a distance frequency domain, namely generating a matched filter in each distance direction nrn (which is an index of the distance direction Nr), wherein the formula is exp (j × pi f/Bp Tp), and f is nrn/Nr fr; wherein f is the distance sampling frequency, Bp is the signal bandwidth, Tp is the pulse repetition interval, fr is the radar pulse frequency; (3) and (3) finally, carrying out complex multiplication on the radar echo data subjected to the operation in the step (1) and a matched filter to finish range-oriented pulse pressure, and realizing the range-oriented pulse pressure of the radar echo signal data by using a kernel function Pul _ press at a GPU (graphics processing unit) device end.

7. The multi-server multi-GPU based satellite-borne SAR imaging real-time processing method according to claim 6, characterized in that in the step 6, parameters required by pulse compression and BP imaging of satellite-borne SAR echo data in the distance direction are analyzed in the process of completing BP imaging, up-sampling of echo pulses is realized by interpolation, dependency before and after BP imaging processing is realized, and multithreading acceleration is performed through a CPU host end for selecting an operation flow with data dependency and large for-loop volume; otherwise, selecting an operation flow without data dependency and with a large cycle size to accelerate by using the CUDA at the GPU equipment end:

specifically, up-sampling of echo pulses is realized by utilizing sinc interpolation before BP imaging, a CPU host end distributes each parameter calculation result and the up-sampled echo pulses to different GPU equipment ends, and the projection and accumulation are repeatedly executed through the GPU equipment ends to realize the BP imaging;

kernel5, Kernel 6: init _ Grdx and Init _ Grd _ Slt _ y: dividing a ground grid;

kernel 9: ProcessPoint: completing sinc interpolation of echo data to realize up-sampling; completing the calculation of the center phase of the radar antenna and the instantaneous target slope distance; meanwhile, distance phase compensation is completed; and finally completing BP imaging;

if the data processed by the Kernel1, the Kernel2, the Kernel3 and the Kernel4 have data dependency, the data can be distributed to the GPU device side and sequentially run through Kernel functions one by one;

after the Kernel function operation is completed, the BP imaging is completed by using Kernel 9;

in consideration of the consumption of the memory and the overlapped calculation among the GPU equipment ends, the data migration which is preferentially stored based on the azimuth direction can be calculated, including the migration of the rows and the columns, the parallel processing of multiple GPUs is realized, and the algorithm efficiency is further improved.