CN113283505A

CN113283505A - Radar data AP clustering method based on GPU

Info

Publication number: CN113283505A
Application number: CN202110571635.7A
Authority: CN
Inventors: 李云杰; 李岩; 刘博文; 张滋林
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-20

Abstract

The invention provides an AP clustering parallelization method based on a GPU, which comprises the steps of analyzing the whole process of an AP clustering algorithm, planning algorithm parts respectively responsible for a CPU and a GPU processor, analyzing the time-consuming performance of the AP clustering algorithm, and performing GPU parallel optimization on the most time-consuming part of the AP clustering algorithm, wherein the method relates to key methods of grouping, solving the most value, summing the specifications and sharing a memory; compared with the realization of the AP clustering algorithm on the CPU, the time complexity is greatly reduced, the time-consuming performance and the optimization degree of the algorithm are tested and analyzed, and the actual test result also proves the high performance of the method.

Description

Radar data AP clustering method based on GPU

Technical Field

The invention relates to the technical field of intersection of radar electronic reconnaissance, artificial intelligence and a GPU (graphics processing unit) parallel computing framework, in particular to a radar data AP (access point) clustering method based on a GPU (graphics processing unit).

Background

The radar electronic countermeasure is of great importance in modern war, and the main task in the field of radar electronic warfare is to perform perception and sorting on intercepted radar signals so as to perform subsequent tasks such as interference identification and the like. With the development of radar technology, radar signals become more and more complex, and the traditional radiation source sorting algorithm based on histogram statistics is no longer suitable for the current scene. The clustering algorithm gradually becomes the mainstream radar sorting method by virtue of the characteristics of high real-time performance, multi-dimensional parameters and intellectualization.

The unsupervised and high self-adaptive advantages of the attractor Propagation Algorithm (AP) make it an effective clustering algorithm for radar signal sorting. Because the AP algorithm needs to update the attraction value and the attribution value of each data point for each iteration, the algorithm complexity is high, and the running time is long under the condition of large data volume. The hardware structure of the traditional CPU determines the defects of high multi-thread overhead and difficulty in realizing parallel processing, so that the instantaneity of realizing an intelligent algorithm is too low, and the radar signal processing task of a high-complexity scene is difficult to meet.

Disclosure of Invention

In view of this, the present invention provides a method for clustering radar data APs based on a GPU, which can greatly reduce the operation time.

A radar data AP clustering method based on a GPU comprises the steps of firstly calculating a similarity matrix S formed by similarity among data objects in a radar data set, and initializing an attraction matrix R and an attribution matrix A which have the same dimension with the matrix S; then improving iteration and continuously updating the matrix R and the matrix A, then adding the matrix R and the matrix A to obtain a matrix E, and solving a clustering center according to the matrix E; and when the clustering centers meet the set requirements, exiting iteration, finally dividing the data objects in the data set into each clustering center, finishing clustering, and updating the matrix R and the matrix A, wherein the updating process comprises the following steps:

step (1), calculating a matrix A and a matrix S to obtain a matrix S1;

step (2), dividing N data in each row in the matrix S1 into i groups, where each group of j data is N ═ i × j, where i and j take the same value as much as possible; allocating i × N threads in the GPU, so that each thread is responsible for calculating the maximum value and the sub-maximum value in each packet, and the index of the maximum value; then, the N threads are utilized to be responsible for solving the maximum value and the secondary maximum value among the line groups in the N lines of data in a one-to-one correspondence mode and the index of the maximum value is represented as S1m, S1S and S1ma respectively corresponding to the maximum value and the secondary maximum value of each line and the index of the maximum value;

step (3), subtracting the maximum value S1m from the S matrix by rows, and replacing the element value of the index position S1ma in the obtained result matrix with the value of S1S to obtain a matrix R1;

step (4), updating the matrix R according to a formula R, wherein the formula R is a, R + (1-a) R1; wherein a represents weight, data between 0 and 1 are taken, and the value is larger and the number of clustering centers is smaller according to the requirement;

step (5), distributing NxN threads in the GPU, wherein each thread is responsible for replacing all elements except diagonal elements in the R matrix with larger values compared with 0, and setting the obtained matrix as a matrix RP;

step (6), summing each column in the matrix RP to obtain a vector RPsum, subtracting the element value of the corresponding column in the matrix RP from each number in the vector RPsum according to the column to obtain a difference matrix with the size consistent with that of the matrix RP, and finally replacing all elements except diagonal elements of the obtained difference matrix with a smaller value compared with 0 to obtain a matrix A1;

step (7), updating the matrix A according to the formula A, a & A + (1-a) A1;

and (8) repeatedly executing the steps (1) to (7) until the cluster center solved according to the matrix E meets the set requirement, thereby completing the updating of the matrix R and the matrix A.

Preferably, the radar data includes radar PDW data, radar I/O waveform data, and radar time-frequency diagram data.

Preferably, the similarity matrix S is solved using negative euclidean distances.

Preferably, the updating process for the matrix R and the matrix a is implemented in the global memory of the GPU.

Preferably, the updating process for the matrix R and the matrix a is implemented in a shared memory of the GPU.

Preferably, in the step (6): the N × N threads are allocated in the GPU, first N × N/2 threads are used for summation operation, then N × N threads are used for subtraction operation of the matrix, and finally N × N threads replace data units in the matrix.

Preferably, in the step (7): and allocating NxN threads in the GPU to be respectively responsible for the number multiplication and summation operation of each data unit of the matrix A.

Preferably, in the step (4), the N × N threads allocated in the GPU are respectively responsible for the number multiplication and summation operations of each data unit in the matrix R.

Preferably, in the step (1), N × N threads are allocated in the GPU, and the matrix a and the matrix S are calculated to obtain a matrix S1, so that each thread is responsible for the summation operation of the data units in one matrix, and the N × N threads perform the operation simultaneously; where N is the number of data objects in the radar dataset.

Preferably, in the step (3), N × N threads are allocated in the GPU, each thread is responsible for subtracting the maximum value S1m from the S matrix by row, and then the N threads are repeatedly used to replace the element value of the S1ma index position in the obtained result matrix with the value of S1S, so as to obtain the matrix R1.

The invention has the following beneficial effects:

Drawings

Fig. 1 is a general flowchart of an AP clustering algorithm according to an embodiment of the present invention;

FIG. 2 is a partially exploded view of the most time consuming part of the algorithm of an embodiment of the present invention;

FIG. 3 is a block diagram illustrating a packet mode optimization scheme according to an embodiment of the present invention;

fig. 4 is a reduced sum diagram according to an embodiment of the present invention.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

The technical scheme of the invention is as follows: a radar data AP clustering method based on a GPU specifically comprises the following steps:

step one, an AP clustering algorithm general flow. Firstly, calculating similarity among data objects in a radar data set to form a similarity matrix S of NxN (N is the number of the data objects in the data set), and initializing an attraction matrix R and an attribution matrix A which have the same dimension as the matrix S; then updating the matrix R and the matrix A according to a formula; then adding the matrix R and the matrix A to obtain a matrix E, and solving a clustering center; when the clustering center is not changed any more or the specified iteration times are reached, exiting the iteration, otherwise, repeatedly iterating and executing the process of updating the matrixes R and A; and finally, dividing the data objects in the data set into each clustering center, and finishing clustering. The radar data comprises PDW data, radar I/O waveform data, radar time-frequency graphs and other data.

And step two, analyzing the time-consuming performance of the AP clustering algorithm. The overall processing flow arrangement for realizing the AP clustering algorithm based on the GPU is as follows: firstly, transmitting a data set from a CPU to a GPU, and calculating a similarity matrix S in the GPU; then initializing matrixes R and A in the GPU; then, iteratively updating the matrixes R, A and E in the GPU, and calculating a clustering center; transmitting the clustering center from the GPU to the CPU, and exiting iteration when judging in the CPU; and finally, dividing the data set in the CPU according to the data object and the clustering center.

The most time-consuming part of the algorithm is to iteratively update the matrices R and a. Because the updated calculation formulas of the matrixes R and A are too complex and the step is subjected to a plurality of iterative processes, the time consumption of the matrix R and A in the running process of the algorithm is the longest, and the time consumption accounts for more than 90% of the time consumption of the whole algorithm.

And thirdly, disassembling and optimizing the most time-consuming part of the AP clustering algorithm in parallel, namely, performing parallel optimization on the iterative update matrix R and the part A.

And step four, using the shared memory in the GPU for the AP clustering algorithm to reduce the communication time between the CPU and the GPU. GPU memories are divided into shared memories (shared memories) and global memories (global memories). shared memories are accessed and accessed faster than global memories. Therefore, the global memory can be used for realizing the AP clustering parallel optimization algorithm in the step three, and the shared memory can also be used, so that the operation efficiency can be further improved.

And step five, analyzing the overall time-consuming performance. Through GPU parallel optimization of the most time-consuming part of the AP clustering algorithm and the use of a shared memory, the overall time-consuming situation of the algorithm is reduced by more than 40 times compared with that of the algorithm in a CPU.

And solving a similarity matrix S in the AP clustering algorithm process described in the step one, wherein a negative Euclidean distance is adopted. Firstly, a similarity matrix S is constructed according to data, and for any 2 samples i and j, the values of the similarity matrix are calculated as follows: (ii) non-woven cloth with S (i, j) ═ i-j |)²。

And then, the AP algorithm constructs and iteratively updates an attraction matrix R and an attribution matrix A according to the similarity matrix S, and clustering is carried out.

And when the preset iteration times are reached, summing the matrix R and the matrix A to obtain a sample corresponding to the position of the element larger than 0 on the diagonal of the matrix, namely the clustering center.

In the third step, parallel optimization is performed on the most time-consuming part of the AP clustering algorithm, and the method specifically comprises the following steps:

(1) and calculating the matrix A and the matrix S to obtain a matrix S1. The GPU is distributed with NxN threads, each thread is responsible for the summation operation of the data units in one matrix, and the NxN threads execute the operation simultaneously, so that the matrix summation operation with the time complexity of O (N) in the CPU can be completed by serially calculating the time complexity of O (N x N) in the CPU.

(2) The maximum S1m per row, the maximum S1S per row, and the index S1ma of the maximum per row of the S1 matrix are found. And (3) simultaneously searching the maximum value, the secondary maximum value and the maximum value index in the GPU by using a grouping and solving method. N data in each row of the matrix are divided into i groups, and each group of j data is N ═ i × j, wherein the values of i and j are equal to each other as much as possible. And distributing i multiplied by N threads in the GPU, enabling the i multiplied by N threads to be responsible for solving work in each group, and repeatedly utilizing the N threads to be responsible for solving among the N rows of data one by one.

(3) The maximum value S1m is subtracted from the S matrix by rows, and the element values of the S1ma index positions in the obtained result matrix are replaced by the values of S1S, so that an R1 matrix is obtained. The GPU is distributed with N multiplied by N threads, each thread is firstly responsible for the subtraction operation of each data unit, and then the replacement operation of the maximum value index position of each row is repeatedly used.

(4) The R matrix is updated according to the formula R a · R + (1-a) · R1. As described in (1), the N × N threads allocated in the GPU are respectively responsible for the number multiplication and summation operations of each data unit. Wherein a represents the weight, data between 0 and 1 is taken, the larger the value is, the smaller the number of the final clustering centers is, and therefore, the weight value can be set according to the requirement.

(5) And replacing all elements except the diagonal elements in the R matrix with the maximum value compared with 0 to obtain the RP matrix. As described in (1), the GPU allocates N × N threads to be respectively responsible for the replacement operation of each data unit.

(6) Summing each column in the matrix RP to obtain a vector RPsum, subtracting the element value of the corresponding column in the matrix RP from each number in the vector RPsum according to the column to obtain a difference matrix with the same size with the matrix RP, and finally replacing all elements of the obtained matrix except diagonal elements with a smaller value compared with 0 to obtain an A1 matrix. The vector RPsum is calculated in the GPU using a sum of specifications method. The GPU is distributed with NxN threads, firstly NxN/2 threads are used for stipulation summation, then NxN threads are used for subtraction operation of the matrix, and finally NxN threads are used for replacing data units in the matrix.

(7) The a matrix is updated according to the formula a · a + (1-a) · a 1. As described in (1), the N × N threads allocated in the GPU are respectively responsible for the number multiplication and summation operations of each data unit.

Example (b):

(1) scenario setup of experiments

In the experiment, a scene is set as radar Pulse Description Word (PDW) data containing 6-dimensional parameters, and the time-consuming performance of the algorithm is tested by inputting Nx 6 data volume. The value of N is varied and the test is repeated multiple times to obtain a more comprehensive result. The actual operating platform of the algorithm is shown in table 1: the processor is a Jetson AGX Xavier of NVIDIA, the CPU model is an 8-core ARM v 8.264 bit processor, the main frequency is 2.26GHz, and the processor comprises 8MB L2 cache and 4MB L3 cache. The GPU model is a 512-core Volta architecture GPU, and the dominant frequency is 1.37 GHz. The memory is 32GB 256-bit LPDDR 4. The hard disk is a 2TB capacity SSD of samsung 970 EVO. The platform operating system is Ubuntu 18.04.

TABLE 1 test hardware platform configuration Table

(2) Evaluation index setting

The time consumption performance of the AP clustering algorithm under different processor environments is evaluated through experiments, the time consumption performance on a CPU and the time consumption performance on a GPU are respectively compared, and the time consumption comparison conditions under different data quantity conditions are analyzed by changing the size of a data sample.

(3) Experimental procedure

The general flow chart of the AP clustering algorithm implemented on the GPU is shown in fig. 1: specifically, the method comprises the following steps: and (3) transmitting the PDW data set to the GPU by the CPU, processing the data in the GPU by an AP clustering algorithm to obtain a clustering center, and returning the clustering center to the CPU.

The method is mainly aimed at the parallel optimization of the most time-consuming part of the AP clustering algorithm, namely the updating iteration part of the matrixes R and A shown in FIG. 2. The specific experimental procedure is as follows:

step one, setting actual PDW data to 1024 × 6 data size as an example, and obtaining a similarity matrix S as a 1024 × 1024 dimensional double type matrix. Applying for a shared memory in a GPU, storing an S matrix in the shared memory, wherein the dimensionalities of an initialization matrix A and an R matrix are 1024 multiplied by 1024, the values are 0, and the initialization matrix A and the R matrix are stored in the shared memory.

And step two, solving a formula S1 which is S + A, initializing a shared memory space with the size of 1024 × 1024 in the GPU for storing an S1 matrix, allocating 1024 × 1024 threads in the GPU, allocating each thread to be responsible for calculating the summation operation of each data unit in the matrix S and the matrix A, and giving the result to a corresponding position of the matrix S1.

And step three, searching a maximum value vector S1m of each row of the S1 matrix, a maximum value vector S1S of each row and an index vector S1ma of the maximum value of each row. The GPU initializes 3 shared memory spaces of size 1024 for storing vectors S1m, S1S, and S1ma, respectively. The maximum value, the sub-maximum value and the maximum value index are simultaneously searched in the GPU by using a grouping and maximum value solving method, as shown in figure 3. The matrix is divided into 32 groups of 1024 data in each row, and each group of 32 data is divided into 1024 × 32 groups. 1024 × 32 threads are allocated in the GPU, the 1024 × 32 threads are made to solve the intra-group maximum value, the sub-maximum value, and the maximum value index within each group, then the 1024 threads are repeatedly used to solve the maximum value and the sub-maximum value between 32 groups in each row of the matrix, and the index of the maximum value is recorded, which is given to the corresponding positions of the vectors S1m, S1S, and S1ma, respectively.

And step four, subtracting the S1m vector from the S matrix according to rows, and replacing the index position stored in the S1ma vector in the obtained result matrix with the value of the S1S vector to obtain the R1 matrix. A shared memory space of 1024 × 1024 size is initialized in the GPU for storing the R1 matrix. 1024 × 1024 threads are allocated in the GPU, each thread is firstly responsible for subtraction operation of each data unit of the S matrix and the S1m vector respectively, and then the 1024 threads are repeatedly used for replacement of the maximum value index position of each row of the result matrix into S1S vector data, and the result is given to the corresponding position of the R1 matrix.

And step five, updating the R matrix according to the formula R, wherein the formula R is a, R + (1-a), R1. 1024 × 1024 threads are allocated in the GPU to be responsible for the number multiplication and summation operations of each data unit of the matrices R and R1, respectively, and the result is given back to the corresponding position of the R matrix.

And step six, replacing all elements except the diagonal elements in the R matrix with the maximum value compared with 0 to obtain the RP matrix. And initializing a shared memory space with the size of 1024 multiplied by 1024 in the GPU for storing the RP matrix. 1024 x 1024 threads are allocated in the GPU to be respectively responsible for the comparison and replacement operations of each data unit of the R matrix, and the result is given to the corresponding position of the RP matrix.

And step seven, summing each column in the RP matrix to obtain a vector RPsum, subtracting the RP matrix from the RPsum according to the columns, and finally replacing all elements of the obtained matrix except diagonal elements with a minimum value compared with 0 to obtain an A1 matrix. And initializing a shared memory space with the size of 1024 in the GPU for storing the RPsum matrix, and initializing a shared memory space with the size of 1024 multiplied by 1024 for storing the A1 matrix. The vector RPsum is calculated in the GPU using a sum of specifications method. 1024 × 1024 threads are allocated in the GPU, and first 1024 × 512 threads are used for the specification summation, as in fig. 4. Then 1024 x 1024 threads are used to do subtraction of vector RPsum and RP matrix, finally 1024 x 1024 threads compare and replace data units in the result matrix, and the result is given to the corresponding position of the a1 matrix.

And step eight, updating the A matrix according to the formula A, a & A + (1-a) A1. 1024 × 1024 threads are allocated in the GPU to be responsible for the number multiplication and summation operations of each data unit of the matrices a and a1, respectively, and the result is given back to the corresponding position of the a matrix.

And ninthly, iteratively and repeatedly operating the steps, setting the iteration times to be 200 times, exiting the GPU kernel function after the iteration is finished, and finishing partial parallelization processing of the GPU.

(4) Analysis of results

Table 2 shows the optimization of the parallelization method of the present invention to the most time-consuming part of the AP clustering algorithm.

TABLE 2 comparison of time-consuming results of algorithmic experiments

Sample size	GPU time consuming (ms)	CPU time (ms)	Acceleration ratio
				6K	0.54	56.23	104.13
12K	1.42	139.68	98.37
				24K	5.98	563.63	94.25
48K	29.50	2113.37	71.64
				96K	169.34	7793.39	46.02

It can be seen that the algorithm implemented in parallel in the GPU is optimized by 46 times at least and 104 times at most compared with that in the CPU. As the number of test samples increased, the acceleration ratio became smaller, but there was still a 46 times optimization. The method can ensure high performance even when the data volume is large.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A radar data AP clustering method based on a GPU comprises the steps of firstly calculating a similarity matrix S formed by similarity among data objects in a radar data set, and initializing an attraction matrix R and an attribution matrix A which have the same dimension with the matrix S; then improving iteration and continuously updating the matrix R and the matrix A, then adding the matrix R and the matrix A to obtain a matrix E, and solving a clustering center according to the matrix E; when the clustering centers meet the set requirements, exiting iteration, finally dividing the data objects in the data set into each clustering center, and finishing clustering; the method is characterized in that the updating process of the matrix R and the matrix A comprises the following steps:

step (1), calculating a matrix A and a matrix S to obtain a matrix S1;

step (7), updating the matrix A according to the formula A, a & A + (1-a) A1;

2. The GPU-based radar data AP clustering method of claim 1, wherein the radar data comprises radar PDW data, radar I/O waveform data and radar time-frequency plot data.

3. The GPU-based radar data AP clustering method as recited in claim 1 or 2, characterized in that a negative Euclidean distance is adopted to solve the similarity matrix S.

4. The GPU-based radar data AP clustering method as recited in claim 1 or 2, wherein the updating process of the matrix R and the matrix A is implemented in a global memory of the GPU.

5. The GPU-based radar data AP clustering method as recited in claim 1 or 2, wherein the updating process of the matrix R and the matrix A is implemented in a shared memory of the GPU.

6. A GPU-based radar data AP clustering method according to claim 1 or 2, characterized by the step (6): the N × N threads are allocated in the GPU, first N × N/2 threads are used for summation operation, then N × N threads are used for subtraction operation of the matrix, and finally N × N threads replace data units in the matrix.

7. A GPU-based radar data AP clustering method according to claim 1 or 2, characterized by the step (7): and allocating NxN threads in the GPU to be respectively responsible for the number multiplication and summation operation of each data unit of the matrix A.

8. A GPU-based radar data AP clustering method as claimed in claim 1 or 2, wherein in the step (4), N × N threads are allocated in the GPU to be respectively responsible for the number multiplication and summation operation of each data unit in the matrix R.

9. A GPU-based radar data AP clustering method as claimed in claim 1 or 2, wherein in step (1), nxn threads are allocated in the GPU, and the matrix a plus the matrix S is calculated to obtain a matrix S1, so that each thread is responsible for the summation operation of the data units in one matrix, and nxn threads perform the operation simultaneously; where N is the number of data objects in the radar dataset.

10. A GPU-based radar data AP clustering method as claimed in claim 1 or 2, wherein in the step (3), N × N threads are allocated in the GPU, each thread is firstly responsible for subtracting the maximum value S1m from the S matrix by row, and then the N threads are repeatedly used to replace the element value of the S1ma index position in the obtained result matrix with the value of S1S, so as to obtain the matrix R1.