CN105955825B - Method for optimizing astronomy software gridding - Google Patents
Method for optimizing astronomy software gridding Download PDFInfo
- Publication number
- CN105955825B CN105955825B CN201610303402.8A CN201610303402A CN105955825B CN 105955825 B CN105955825 B CN 105955825B CN 201610303402 A CN201610303402 A CN 201610303402A CN 105955825 B CN105955825 B CN 105955825B
- Authority
- CN
- China
- Prior art keywords
- mic
- cpu
- memory
- grid
- load balancing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
The invention discloses a method for optimizing astronomy software gridding, which comprises the following steps: memory partition parallelism is realized; and carrying out load balancing processing on the CPU and the MIC. The method of the embodiment of the invention greatly improves the running efficiency of griding in the SKA project, and has important significance for the SKA project.
Description
Technical Field
The invention relates to the technical field of computer application, in particular to a method for optimizing astronomy software gridding.
Background
Intel corporation proposed an Integrated Many-Core Architecture (MIC), which is a general-purpose multi-Core coprocessor that uses a shared memory Architecture that makes MICs easy to program, unlike other accelerated platforms.
gridding is a very important step, and preferably one of the most important steps, in the Square Kilometer radio telescope Array (SKA) data processing process. In order to generate the sky image, a series of operations are required to be performed on data collected by the radio telescope, but the data generated by the telescope is irregular and needs to be mapped onto a regular two-dimensional grid, and then Fourier change can be performed. At present, the speed of the serial processing version of gridding cannot reach an ideal state.
Disclosure of Invention
In order to solve the existing technical problem, the embodiment of the invention provides a method for optimizing griding.
In order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:
a method of optimizing astronomy software gridding, the method comprising:
memory partition parallelism is realized;
and carrying out load balancing processing on the CPU and the MIC.
The method for realizing memory partition parallelism includes: dividing a two-dimensional grid into four regions representing different types, dividing a memory into a plurality of memory blocks, and respectively allocating the memory blocks to the four regions, wherein the grid represents a two-dimensional vector, and the size of the two-dimensional vector is a constant determined according to the characteristics of a telescope; establishing a plurality of threads, and corresponding each thread to the memory blocks one to one, so that one thread can only read and write one memory block, and one memory block has only one thread which can read and write.
The method comprises the steps of selecting an offload mode with a CPU as a main mode and an MIC as an auxiliary mode, and realizing load balance between the CPU and the MIC.
The CPU operation proportion is 78%, and the MIC operation proportion is 22%.
Wherein, the load balancing of the CPU and the MIC further comprises: the SSE instruction set is used for optimization and vectorization of data types.
Wherein, the load balancing of the CPU and the MIC further comprises: asynchronous transfer is implemented using transfer statements.
After the implementing of the memory partitioning parallel algorithm and before the load balancing of the CPU and the MIC, the method further includes: data preprocessing and heterogeneous processing.
The embodiment of the invention provides a method for optimizing griding based on MIC, which is characterized in that on the basis of MIC, the method is optimized aiming at a griding program in an SKA project; the optimization measures including the memory partitioning algorithm, the optimization and the vectorization of the data structure and the memory use, the parallel algorithm based on MIC hardware and the compiler-level optimization based on an Intel compiler are used, so that the final average optimization effect reaches 40 times, the operation efficiency of the griding in the SKA project is greatly improved, and the important significance is brought to the SKA project.
Drawings
In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar components in different views. Like reference numerals having different letter suffixes may represent different examples of similar components. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed herein.
FIG. 1 is a schematic flow chart of a gridding optimization method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating grid function functionality according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an example of memory partitioning according to an embodiment of the present invention;
fig. 4 is a diagram illustrating the data contention phenomenon illustrated in fig. 3.
Detailed Description
The main idea of the embodiment of the invention is as follows: and optimizing the gridding in two aspects of memory partition parallelism and CPU & MIC load balance based on the MIC.
As shown in fig. 1, the method for optimizing griding based on MIC in the embodiment of the present invention mainly includes the following steps:
step 101: memory partition parallelism is realized;
step 102: and carrying out load balancing processing on the CPU and the MIC.
In step 101, memory partitioning is performed for a gridding program, data contention is eliminated, a boundary effect is controlled, and parallel read-write processing of multiple threads is realized. The process of realizing memory partition parallelism in the step comprises the following steps: and data competition, redundant memory division and correspondence between processes and memory partitions are eliminated, and parallelism is realized. Specifically, a two-dimensional grid is divided into four regions representing different types, a memory is divided into a plurality of memory blocks, the memory blocks are respectively distributed to the four regions, the grid represents a two-dimensional vector, and the size of the two-dimensional vector is a constant determined according to the characteristics of a telescope; establishing a plurality of threads, and corresponding each thread to the memory blocks one to one, so that one thread can only read and write one memory block, and one memory block has only one thread which can read and write.
In step 102, a CPU & MIC isomerization design is adopted, asynchronous transmission and data vectorization are performed based on the MIC, and load balance between the CPU and the MIC is controlled. The method specifically comprises the following steps: an Offload mode implementation using MIC; asynchronous transmission is realized by using a transfer statement; and, using the SSE instruction set for data type optimization and vectorization. In practical application, an offload mode with a CPU as a main mode and an MIC as an auxiliary mode is adopted to realize load balancing between the CPU and the MIC.
After step 101 and before step 102, the method may further include steps of data preprocessing and heterogeneous processing, specifically, preprocessing includes data type and format conversion, for example, converting ST L complex data into double data for data transmission, format conversion, for example, converting an array into a pointer, etc.
Further, the data preprocessing and the heterogeneous processing may further include: a data structure is established.
The following is a detailed description of a specific implementation process for optimizing the griding method according to an embodiment of the present invention.
The process of establishing the data structure in the embodiment of the invention is as follows:
the data structure may be specifically set according to the following table 1, and specifically may include five data types: samples, cube C, two-dimensional grid (grid), samples. Wherein cube C represents a five-dimensional vector, grid represents a two-dimensional vector of gSize × gSize, and the values of sSize, gSize, etc. are constants determined according to the characteristic input of the telescope.
Variables of | Type (B) | Size and breadth | |
samples | Structure vector | nSample | |
C | Double vector | C.Size() | |
grid | Double vector | gSize*gSize | |
samples.iu/iv/ | Integer | 1 | |
| Double | 1 |
TABLE 1
Here, a five-dimensional vector having an actual size of [129, 8,8,33] is regarded as a cube C having a bottom surface of sSize × sSize (129 × 129) and a height of 8 × 33. grid represents a two-dimensional vector of gSize × gSize, which can be viewed as a plane.
Two head pointers, namely, gind (index of grid) and cinad (index of C), are set, wherein, gind is the head pointer of a square area of gSize and cinad is the head pointer of a plane of sSize in the cube C, and the square memory area of a plane in the cube C is accumulated into the square memory area inside the grid by multiplying d through the two head pointers. Here, grid is a regular two-dimensional grid, and its function is shown in fig. 2, where double × gptr ═ grid [ gind ]; double cptr ═ C [ cind ]; gptr (pointer of grid), cptr (pointer of C) are pointers. Wherein the values of sSize, gSize, etc. are constants determined according to the characteristic input of the telescope.
The specific implementation process of the memory partitioning parallel algorithm in the embodiment of the invention is as follows:
in order to solve the problem of data contention, the basic idea of the memory partitioning parallel algorithm according to the embodiment of the present invention is to divide a grid into four regions representing different types, assuming that a memory is divided into 16 memory blocks, where the memory blocks respectively correspond to the four regions, and a memory allocation manner is shown in fig. 3. Where tile 1 is of the first type, if the gind point falls near the right and lower boundaries, then some of the sSize × sSize tiles corresponding to that gind will necessarily fall within 2-4 tiles, as shown in FIG. 4. At this time, the dark gray part in the 2 nd to 4 th small blocks can not be read and written, otherwise, the memory can be read and written by different threads at the same time, and the data competition phenomenon occurs.
In this embodiment of the present invention, to avoid data contention, the memory partition parallel algorithm in this embodiment of the present invention further includes: each thread corresponds to the unique memory block one to one, so that one thread can only read and write one memory block, and one memory block has one thread and only one thread can read and write. For example, if the upper left area shown in fig. 3 reads and writes the 1 st memory block by the process No. 1, and the upper right area shown in fig. 3 reads and writes the 1 st memory block by the process No. 2, the grid may be called as the first type grid _1, so that the memory redundancy algorithm is implemented, and the reading and writing of the second, third, and fourth types grid _2, grid _3, and grid _4 can also be ensured. In practical application, the memory with the size of four grid is preferably divided.
An example of thread and memory allocation is shown in Table 2 below:
thread number | grid type | Region(s) | Read-write |
1 | grid_1 | Upper left of | 1 |
2 | grid_1 | Upper |
1 |
3 | grid_1 | Left |
1 |
4 | grid_1 | |
1 |
5 | grid_2 | Upper left of | 2 |
6 | grid_2 | Upper |
2 |
7 | grid_2 | Left |
2 |
8 | grid_2 | |
2 |
9 | grid_3 | Upper left of | 3 |
10 | grid_3 | Upper |
3 |
11 | grid_3 | Left |
3 |
12 | grid_3 | |
3 |
13 | grid_4 | Upper left of | 4 |
14 | grid_4 | Upper |
4 |
15 | grid_4 | Left |
4 |
16 | grid_4 | |
4 |
TABLE 2
In the embodiment of the invention, one grid can read and write a plurality of processes, and the problem of data competition is avoided, so that parallelization is realized. In practical application, the memory partitioning method needs to establish and store the corresponding relationship between the thread number and the memory block in advance, and finally needs an accumulation and combination process.
In the embodiment of the invention, the steps of data preprocessing and heterogeneous processing are specifically realized as follows: the original algorithm is transplanted into the MIC, and meanwhile, the overflow _ transfer statement is used for asynchronous transmission, a part of data which are to be transmitted into the MIC are processed in advance, and then asynchronous transmission is achieved. In practical application, MIC & CPU parallelism (target if statement) may use if statement to make MIC & CPU run heterogeneously.
In the embodiment of the invention, the specific implementation process of the CPU & MIC load balancing processing is as follows:
and the parallel efficiency is improved by using the MIC through the overflow mode with the CPU as the main mode and the MIC as the auxiliary mode. Meanwhile, through testing, the optimal static load balance is obtained, the performance of the CPU and the MIC is exerted to the maximum extent, and the optimal optimization effect is obtained. The results are shown in table 3.
CPU operation rate (%) | Time(s) | Rate(milliongrid points/s) |
0 | 11.15 | 4775.89 |
25 | 9.51 | 5599.5 |
37.5 | 8.77 | 6071.97 |
50 | 7.75 | 6871.12 |
62.5 | 6.96 | 7651.03 |
72 | 6.17 | 8630.66 |
75 | 5.99 | 8890.02 |
78 | 5.61 | 9492.19 |
81 | 5.91 | 9010.36 |
87.5 | 6.36 | 8372.83 |
100 | 6.17 | 8630.66 |
TABLE 3
Through multiple statistics, the static load balance point is found to be about 78% of the CPU operation proportion. Therefore, in the embodiment of the present invention, it is preferable that the operation ratio of the CPU is 78% and the operation ratio of the MIC is 22% to achieve the optimal static load balance.
In the embodiment of the present invention, in the load balancing process, optimization and vectorization of data types are also required. As can be seen from the above data structure, the main data type is complex, which becomes exceptionally difficult when vectorizing is applied. Because the automatic vectorization is not feasible due to the special data structure, when the SIMD statement is used for vectorization, the real part and the imaginary part of a complex number need to be extracted and stored in a corresponding double array, and then the vectorization containing 4 double multiplications is carried out. In the embodiment of the present invention, the SSE instruction set is preferably used for optimization and vectorization of data types.
The optimization process of the embodiment of the invention is calculated as follows:
nSamples 3200000 (total number of samples, determining number of first layer cycles); wSize 33; nChan ═ 1; gSize 4096; baseline 2000; cellSize 5.000000. These five variables affect the values of the function variables, cind, gind, etc., and the first three values can be arbitrarily modified.
Test result representation method: time(s) represents gridKernel execution time; rate (million grids per second) indicates the number of grids per second, and higher rate indicates faster program speed.
Correctness verification (L1 norm) compares the execution result of gridKernel with the standard result grid _ std.dat by program verify calculation, and if L1 norm difference is less than 1e-12, it is acceptable.
The optimization results are that gridding operation is carried out on a random sample with the size of 3200000 on an experimental platform, the system time is calculated, L1 model checking is carried out on the results, and the error is ensured to be within an allowable range, the following optimization results are obtained through continuous improvement of an optimization algorithm, and are shown in the table 4:
Version | Time(s) | Rate(million grid points/s) | multiple of optimization |
In series | 229.37 | 232.163 | 1 |
Memory partitioning algorithm | 26.17 | 2034.82 | 8.76 |
MIC+CPU | 5.61 | 9492.19 | 40.89 |
TABLE 4
The embodiment of the invention optimizes the gridding program in the SKA project on the basis of MIC; the method uses a plurality of optimization measures including memory partitioning algorithm, optimization and vectorization of data structure and memory use, parallel algorithm based on MIC hardware and compiler-level optimization based on Intel compiler, and finally carries out parallel processing of MPI through a plurality of nodes, so that the final average optimization effect reaches 40 times, and the method has important significance on SKA projects.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.
Claims (6)
1. A method of optimizing astronomy software gridding, the method comprising:
memory partition parallelism is realized;
carrying out load balancing processing on a CPU and an integrated many-core architecture MIC;
the method for realizing the memory partition parallelism comprises the following steps:
dividing a two-dimensional grid into four regions representing different types, dividing a memory into a plurality of memory blocks, and respectively allocating the memory blocks to the four regions, wherein the grid represents a two-dimensional vector, and the size of the two-dimensional vector is a constant determined according to the characteristics of a telescope;
establishing a plurality of threads, and corresponding each thread to the memory blocks one to one, so that one thread can only read and write one memory block, and one memory block has only one thread which can read and write.
2. The method according to claim 1, wherein an offload mode with a CPU as a main mode and a MIC as an auxiliary mode is adopted to achieve load balancing between the CPU and the MIC.
3. The method of claim 2, wherein the CPU operation percentage is 78% and the MIC operation percentage is 22%.
4. The method according to any of claims 1-3, wherein load balancing the CPU and the MIC further comprises: the SSE instruction set is used for optimization and vectorization of data types.
5. The method according to any of claims 1-3, wherein load balancing the CPU and the MIC further comprises: asynchronous transfer is implemented using transfer statements.
6. The method of claim 1, after the implementing a memory partitioning parallelism algorithm and before the load balancing the CPU and the integrated many-core architecture MIC, the method further comprising: data preprocessing and heterogeneous processing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610303402.8A CN105955825B (en) | 2016-05-09 | 2016-05-09 | Method for optimizing astronomy software gridding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610303402.8A CN105955825B (en) | 2016-05-09 | 2016-05-09 | Method for optimizing astronomy software gridding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105955825A CN105955825A (en) | 2016-09-21 |
CN105955825B true CN105955825B (en) | 2020-07-10 |
Family
ID=56914325
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610303402.8A Active CN105955825B (en) | 2016-05-09 | 2016-05-09 | Method for optimizing astronomy software gridding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105955825B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106598552A (en) * | 2016-12-22 | 2017-04-26 | 郑州云海信息技术有限公司 | Data point conversion method and device based on Gridding module |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279391A (en) * | 2013-06-09 | 2013-09-04 | 浪潮电子信息产业股份有限公司 | Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing |
CN103530132A (en) * | 2013-10-29 | 2014-01-22 | 浪潮电子信息产业股份有限公司 | Method for transplanting CPU (central processing unit) serial programs to MIC (microphone) platform |
CN104375838A (en) * | 2014-11-27 | 2015-02-25 | 浪潮电子信息产业股份有限公司 | OpenMP (open mesh point protocol) -based astronomy software Griding optimization method |
-
2016
- 2016-05-09 CN CN201610303402.8A patent/CN105955825B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279391A (en) * | 2013-06-09 | 2013-09-04 | 浪潮电子信息产业股份有限公司 | Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing |
CN103530132A (en) * | 2013-10-29 | 2014-01-22 | 浪潮电子信息产业股份有限公司 | Method for transplanting CPU (central processing unit) serial programs to MIC (microphone) platform |
CN104375838A (en) * | 2014-11-27 | 2015-02-25 | 浪潮电子信息产业股份有限公司 | OpenMP (open mesh point protocol) -based astronomy software Griding optimization method |
Also Published As
Publication number | Publication date |
---|---|
CN105955825A (en) | 2016-09-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5425541B2 (en) | Method and apparatus for partitioning and sorting data sets on a multiprocessor system | |
Qin et al. | Parallelizing flow-accumulation calculations on graphics processing units—From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm | |
Wu et al. | Kernel weaver: Automatically fusing database primitives for efficient gpu computation | |
EP3713093A1 (en) | Data compression for a neural network | |
CN107657599B (en) | Parallel implementation method of remote sensing image fusion system based on mixed granularity division and dynamic load distribution | |
US10810784B1 (en) | Techniques for preloading textures in rendering graphics | |
CN110135569A (en) | Heterogeneous platform neuron positioning three-level flow parallel method, system and medium | |
CN108897616B (en) | Non-downsampling contourlet transform optimization method based on parallel operation | |
Bøgh et al. | Work-efficient parallel skyline computation for the GPU | |
Lee et al. | Optimization of GPU-based sparse matrix multiplication for large sparse networks | |
Jiang et al. | Parallel contributing area calculation with granularity control on massive grid terrain datasets | |
CN112765094A (en) | Sparse tensor canonical decomposition method based on data division and calculation distribution | |
CN106484532B (en) | GPGPU parallel calculating method towards SPH fluid simulation | |
CN105955825B (en) | Method for optimizing astronomy software gridding | |
CN104572588B (en) | Matrix inversion process method and apparatus | |
CN111274335B (en) | Rapid implementation method for space superposition analysis | |
CN115756605A (en) | Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs | |
Chatterjee et al. | Data structures and algorithms for counting problems on graphs using gpu | |
CN106598552A (en) | Data point conversion method and device based on Gridding module | |
CN113297537B (en) | High-performance implementation method and device for solving sparse structured trigonometric equation set | |
CN104156268B (en) | The load distribution of MapReduce and thread structure optimization method on a kind of GPU | |
Zhou et al. | A Parallel Scheme for Large‐scale Polygon Rasterization on CUDA‐enabled GPUs | |
CN115344383A (en) | Streamline visualization parallel acceleration method based on process parallel | |
CN109558817A (en) | A kind of airfield runway detection method accelerated based on FPGA | |
Fan et al. | Fast sparse gpu kernels for accelerated training of graph neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |