CN105955825B

CN105955825B - Method for optimizing astronomy software gridding

Info

Publication number: CN105955825B
Application number: CN201610303402.8A
Authority: CN
Inventors: 刘刚; 郭聪; 梁文栋; 毛睿
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2016-05-09
Filing date: 2016-05-09
Publication date: 2020-07-10
Anticipated expiration: 2036-05-09
Also published as: CN105955825A

Abstract

The invention discloses a method for optimizing astronomy software gridding, which comprises the following steps: memory partition parallelism is realized; and carrying out load balancing processing on the CPU and the MIC. The method of the embodiment of the invention greatly improves the running efficiency of griding in the SKA project, and has important significance for the SKA project.

Description

Method for optimizing astronomy software gridding

Technical Field

The invention relates to the technical field of computer application, in particular to a method for optimizing astronomy software gridding.

Background

Intel corporation proposed an Integrated Many-Core Architecture (MIC), which is a general-purpose multi-Core coprocessor that uses a shared memory Architecture that makes MICs easy to program, unlike other accelerated platforms.

gridding is a very important step, and preferably one of the most important steps, in the Square Kilometer radio telescope Array (SKA) data processing process. In order to generate the sky image, a series of operations are required to be performed on data collected by the radio telescope, but the data generated by the telescope is irregular and needs to be mapped onto a regular two-dimensional grid, and then Fourier change can be performed. At present, the speed of the serial processing version of gridding cannot reach an ideal state.

Disclosure of Invention

In order to solve the existing technical problem, the embodiment of the invention provides a method for optimizing griding.

In order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:

a method of optimizing astronomy software gridding, the method comprising:

memory partition parallelism is realized;

and carrying out load balancing processing on the CPU and the MIC.

The method for realizing memory partition parallelism includes: dividing a two-dimensional grid into four regions representing different types, dividing a memory into a plurality of memory blocks, and respectively allocating the memory blocks to the four regions, wherein the grid represents a two-dimensional vector, and the size of the two-dimensional vector is a constant determined according to the characteristics of a telescope; establishing a plurality of threads, and corresponding each thread to the memory blocks one to one, so that one thread can only read and write one memory block, and one memory block has only one thread which can read and write.

The method comprises the steps of selecting an offload mode with a CPU as a main mode and an MIC as an auxiliary mode, and realizing load balance between the CPU and the MIC.

The CPU operation proportion is 78%, and the MIC operation proportion is 22%.

Wherein, the load balancing of the CPU and the MIC further comprises: the SSE instruction set is used for optimization and vectorization of data types.

Wherein, the load balancing of the CPU and the MIC further comprises: asynchronous transfer is implemented using transfer statements.

After the implementing of the memory partitioning parallel algorithm and before the load balancing of the CPU and the MIC, the method further includes: data preprocessing and heterogeneous processing.

The embodiment of the invention provides a method for optimizing griding based on MIC, which is characterized in that on the basis of MIC, the method is optimized aiming at a griding program in an SKA project; the optimization measures including the memory partitioning algorithm, the optimization and the vectorization of the data structure and the memory use, the parallel algorithm based on MIC hardware and the compiler-level optimization based on an Intel compiler are used, so that the final average optimization effect reaches 40 times, the operation efficiency of the griding in the SKA project is greatly improved, and the important significance is brought to the SKA project.

Drawings

In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar components in different views. Like reference numerals having different letter suffixes may represent different examples of similar components. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed herein.

FIG. 1 is a schematic flow chart of a gridding optimization method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating grid function functionality according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an example of memory partitioning according to an embodiment of the present invention;

fig. 4 is a diagram illustrating the data contention phenomenon illustrated in fig. 3.

Detailed Description

The main idea of the embodiment of the invention is as follows: and optimizing the gridding in two aspects of memory partition parallelism and CPU & MIC load balance based on the MIC.

As shown in fig. 1, the method for optimizing griding based on MIC in the embodiment of the present invention mainly includes the following steps:

step 101: memory partition parallelism is realized;

step 102: and carrying out load balancing processing on the CPU and the MIC.

In step 101, memory partitioning is performed for a gridding program, data contention is eliminated, a boundary effect is controlled, and parallel read-write processing of multiple threads is realized. The process of realizing memory partition parallelism in the step comprises the following steps: and data competition, redundant memory division and correspondence between processes and memory partitions are eliminated, and parallelism is realized. Specifically, a two-dimensional grid is divided into four regions representing different types, a memory is divided into a plurality of memory blocks, the memory blocks are respectively distributed to the four regions, the grid represents a two-dimensional vector, and the size of the two-dimensional vector is a constant determined according to the characteristics of a telescope; establishing a plurality of threads, and corresponding each thread to the memory blocks one to one, so that one thread can only read and write one memory block, and one memory block has only one thread which can read and write.

In step 102, a CPU & MIC isomerization design is adopted, asynchronous transmission and data vectorization are performed based on the MIC, and load balance between the CPU and the MIC is controlled. The method specifically comprises the following steps: an Offload mode implementation using MIC; asynchronous transmission is realized by using a transfer statement; and, using the SSE instruction set for data type optimization and vectorization. In practical application, an offload mode with a CPU as a main mode and an MIC as an auxiliary mode is adopted to realize load balancing between the CPU and the MIC.

After step 101 and before step 102, the method may further include steps of data preprocessing and heterogeneous processing, specifically, preprocessing includes data type and format conversion, for example, converting ST L complex data into double data for data transmission, format conversion, for example, converting an array into a pointer, etc.

Further, the data preprocessing and the heterogeneous processing may further include: a data structure is established.

The following is a detailed description of a specific implementation process for optimizing the griding method according to an embodiment of the present invention.

The process of establishing the data structure in the embodiment of the invention is as follows:

the data structure may be specifically set according to the following table 1, and specifically may include five data types: samples, cube C, two-dimensional grid (grid), samples. Wherein cube C represents a five-dimensional vector, grid represents a two-dimensional vector of gSize × gSize, and the values of sSize, gSize, etc. are constants determined according to the characteristic input of the telescope.

Variables of	Type (B)	Size and breadth
			samples	Structure vector	nSample
C	Double vector	C.Size()
			grid	Double vector	gSize*gSize
samples.iu/iv/coffset	Integer					1
			samples.data	Double			1

TABLE 1

Here, a five-dimensional vector having an actual size of [129, 8,8,33] is regarded as a cube C having a bottom surface of sSize × sSize (129 × 129) and a height of 8 × 33. grid represents a two-dimensional vector of gSize × gSize, which can be viewed as a plane.

Two head pointers, namely, gind (index of grid) and cinad (index of C), are set, wherein, gind is the head pointer of a square area of gSize and cinad is the head pointer of a plane of sSize in the cube C, and the square memory area of a plane in the cube C is accumulated into the square memory area inside the grid by multiplying d through the two head pointers. Here, grid is a regular two-dimensional grid, and its function is shown in fig. 2, where double × gptr ═ grid [ gind ]; double cptr ═ C [ cind ]; gptr (pointer of grid), cptr (pointer of C) are pointers. Wherein the values of sSize, gSize, etc. are constants determined according to the characteristic input of the telescope.

The specific implementation process of the memory partitioning parallel algorithm in the embodiment of the invention is as follows:

in order to solve the problem of data contention, the basic idea of the memory partitioning parallel algorithm according to the embodiment of the present invention is to divide a grid into four regions representing different types, assuming that a memory is divided into 16 memory blocks, where the memory blocks respectively correspond to the four regions, and a memory allocation manner is shown in fig. 3. Where tile 1 is of the first type, if the gind point falls near the right and lower boundaries, then some of the sSize × sSize tiles corresponding to that gind will necessarily fall within 2-4 tiles, as shown in FIG. 4. At this time, the dark gray part in the 2 nd to 4 th small blocks can not be read and written, otherwise, the memory can be read and written by different threads at the same time, and the data competition phenomenon occurs.

In this embodiment of the present invention, to avoid data contention, the memory partition parallel algorithm in this embodiment of the present invention further includes: each thread corresponds to the unique memory block one to one, so that one thread can only read and write one memory block, and one memory block has one thread and only one thread can read and write. For example, if the upper left area shown in fig. 3 reads and writes the 1 st memory block by the process No. 1, and the upper right area shown in fig. 3 reads and writes the 1 st memory block by the process No. 2, the grid may be called as the first type grid _1, so that the memory redundancy algorithm is implemented, and the reading and writing of the second, third, and fourth types grid _2, grid _3, and grid _4 can also be ensured. In practical application, the memory with the size of four grid is preferably divided.

An example of thread and memory allocation is shown in Table 2 below:

thread number	grid type	Region(s)	Read-write memory block number
				1	grid_1	Upper left of	1
2	grid_1	Upper right part	1
				3	grid_1	Left lower part	1
4	grid_1	Lower right	1
				5	grid_2	Upper left of	2
6	grid_2	Upper right part	2
				7	grid_2	Left lower part	2
8	grid_2	Lower right	2
				9	grid_3	Upper left of	3
10	grid_3	Upper right part	3
				11	grid_3	Left lower part	3
12	grid_3	Lower right	3
				13	grid_4	Upper left of	4
14	grid_4	Upper right part	4
				15	grid_4	Left lower part	4
16	grid_4	Lower right	4

TABLE 2

In the embodiment of the invention, one grid can read and write a plurality of processes, and the problem of data competition is avoided, so that parallelization is realized. In practical application, the memory partitioning method needs to establish and store the corresponding relationship between the thread number and the memory block in advance, and finally needs an accumulation and combination process.

In the embodiment of the invention, the steps of data preprocessing and heterogeneous processing are specifically realized as follows: the original algorithm is transplanted into the MIC, and meanwhile, the overflow _ transfer statement is used for asynchronous transmission, a part of data which are to be transmitted into the MIC are processed in advance, and then asynchronous transmission is achieved. In practical application, MIC & CPU parallelism (target if statement) may use if statement to make MIC & CPU run heterogeneously.

In the embodiment of the invention, the specific implementation process of the CPU & MIC load balancing processing is as follows:

and the parallel efficiency is improved by using the MIC through the overflow mode with the CPU as the main mode and the MIC as the auxiliary mode. Meanwhile, through testing, the optimal static load balance is obtained, the performance of the CPU and the MIC is exerted to the maximum extent, and the optimal optimization effect is obtained. The results are shown in table 3.

CPU operation rate (%)	Time(s)	Rate(milliongrid points/s)
			0	11.15	4775.89
25	9.51	5599.5
			37.5	8.77	6071.97
50	7.75	6871.12
			62.5	6.96	7651.03
72	6.17	8630.66
			75	5.99	8890.02
78	5.61	9492.19
			81	5.91	9010.36
87.5	6.36	8372.83
			100	6.17	8630.66

TABLE 3

Through multiple statistics, the static load balance point is found to be about 78% of the CPU operation proportion. Therefore, in the embodiment of the present invention, it is preferable that the operation ratio of the CPU is 78% and the operation ratio of the MIC is 22% to achieve the optimal static load balance.

In the embodiment of the present invention, in the load balancing process, optimization and vectorization of data types are also required. As can be seen from the above data structure, the main data type is complex, which becomes exceptionally difficult when vectorizing is applied. Because the automatic vectorization is not feasible due to the special data structure, when the SIMD statement is used for vectorization, the real part and the imaginary part of a complex number need to be extracted and stored in a corresponding double array, and then the vectorization containing 4 double multiplications is carried out. In the embodiment of the present invention, the SSE instruction set is preferably used for optimization and vectorization of data types.

The optimization process of the embodiment of the invention is calculated as follows:

nSamples 3200000 (total number of samples, determining number of first layer cycles); wSize 33; nChan ═ 1; gSize 4096; baseline 2000; cellSize 5.000000. These five variables affect the values of the function variables, cind, gind, etc., and the first three values can be arbitrarily modified.

Test result representation method: time(s) represents gridKernel execution time; rate (million grids per second) indicates the number of grids per second, and higher rate indicates faster program speed.

Correctness verification (L1 norm) compares the execution result of gridKernel with the standard result grid _ std.dat by program verify calculation, and if L1 norm difference is less than 1e-12, it is acceptable.

The optimization results are that gridding operation is carried out on a random sample with the size of 3200000 on an experimental platform, the system time is calculated, L1 model checking is carried out on the results, and the error is ensured to be within an allowable range, the following optimization results are obtained through continuous improvement of an optimization algorithm, and are shown in the table 4:

Version	Time(s)	Rate(million grid points/s)	multiple of optimization
				In series	229.37	232.163	1
Memory partitioning algorithm	26.17	2034.82	8.76
				MIC+CPU	5.61	9492.19	40.89

TABLE 4

The embodiment of the invention optimizes the gridding program in the SKA project on the basis of MIC; the method uses a plurality of optimization measures including memory partitioning algorithm, optimization and vectorization of data structure and memory use, parallel algorithm based on MIC hardware and compiler-level optimization based on Intel compiler, and finally carries out parallel processing of MPI through a plurality of nodes, so that the final average optimization effect reaches 40 times, and the method has important significance on SKA projects.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A method of optimizing astronomy software gridding, the method comprising:

memory partition parallelism is realized;

carrying out load balancing processing on a CPU and an integrated many-core architecture MIC;

the method for realizing the memory partition parallelism comprises the following steps:

dividing a two-dimensional grid into four regions representing different types, dividing a memory into a plurality of memory blocks, and respectively allocating the memory blocks to the four regions, wherein the grid represents a two-dimensional vector, and the size of the two-dimensional vector is a constant determined according to the characteristics of a telescope;

establishing a plurality of threads, and corresponding each thread to the memory blocks one to one, so that one thread can only read and write one memory block, and one memory block has only one thread which can read and write.

2. The method according to claim 1, wherein an offload mode with a CPU as a main mode and a MIC as an auxiliary mode is adopted to achieve load balancing between the CPU and the MIC.

3. The method of claim 2, wherein the CPU operation percentage is 78% and the MIC operation percentage is 22%.

4. The method according to any of claims 1-3, wherein load balancing the CPU and the MIC further comprises: the SSE instruction set is used for optimization and vectorization of data types.

5. The method according to any of claims 1-3, wherein load balancing the CPU and the MIC further comprises: asynchronous transfer is implemented using transfer statements.

6. The method of claim 1, after the implementing a memory partitioning parallelism algorithm and before the load balancing the CPU and the integrated many-core architecture MIC, the method further comprising: data preprocessing and heterogeneous processing.