CN105955825B - Method for optimizing astronomy software gridding - Google Patents

Method for optimizing astronomy software gridding Download PDF

Info

Publication number
CN105955825B
CN105955825B CN201610303402.8A CN201610303402A CN105955825B CN 105955825 B CN105955825 B CN 105955825B CN 201610303402 A CN201610303402 A CN 201610303402A CN 105955825 B CN105955825 B CN 105955825B
Authority
CN
China
Prior art keywords
mic
cpu
memory
grid
load balancing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610303402.8A
Other languages
Chinese (zh)
Other versions
CN105955825A (en
Inventor
刘刚
郭聪
梁文栋
毛睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN201610303402.8A priority Critical patent/CN105955825B/en
Publication of CN105955825A publication Critical patent/CN105955825A/en
Application granted granted Critical
Publication of CN105955825B publication Critical patent/CN105955825B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a method for optimizing astronomy software gridding, which comprises the following steps: memory partition parallelism is realized; and carrying out load balancing processing on the CPU and the MIC. The method of the embodiment of the invention greatly improves the running efficiency of griding in the SKA project, and has important significance for the SKA project.

Description

Method for optimizing astronomy software gridding
Technical Field
The invention relates to the technical field of computer application, in particular to a method for optimizing astronomy software gridding.
Background
Intel corporation proposed an Integrated Many-Core Architecture (MIC), which is a general-purpose multi-Core coprocessor that uses a shared memory Architecture that makes MICs easy to program, unlike other accelerated platforms.
gridding is a very important step, and preferably one of the most important steps, in the Square Kilometer radio telescope Array (SKA) data processing process. In order to generate the sky image, a series of operations are required to be performed on data collected by the radio telescope, but the data generated by the telescope is irregular and needs to be mapped onto a regular two-dimensional grid, and then Fourier change can be performed. At present, the speed of the serial processing version of gridding cannot reach an ideal state.
Disclosure of Invention
In order to solve the existing technical problem, the embodiment of the invention provides a method for optimizing griding.
In order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:
a method of optimizing astronomy software gridding, the method comprising:
memory partition parallelism is realized;
and carrying out load balancing processing on the CPU and the MIC.
The method for realizing memory partition parallelism includes: dividing a two-dimensional grid into four regions representing different types, dividing a memory into a plurality of memory blocks, and respectively allocating the memory blocks to the four regions, wherein the grid represents a two-dimensional vector, and the size of the two-dimensional vector is a constant determined according to the characteristics of a telescope; establishing a plurality of threads, and corresponding each thread to the memory blocks one to one, so that one thread can only read and write one memory block, and one memory block has only one thread which can read and write.
The method comprises the steps of selecting an offload mode with a CPU as a main mode and an MIC as an auxiliary mode, and realizing load balance between the CPU and the MIC.
The CPU operation proportion is 78%, and the MIC operation proportion is 22%.
Wherein, the load balancing of the CPU and the MIC further comprises: the SSE instruction set is used for optimization and vectorization of data types.
Wherein, the load balancing of the CPU and the MIC further comprises: asynchronous transfer is implemented using transfer statements.
After the implementing of the memory partitioning parallel algorithm and before the load balancing of the CPU and the MIC, the method further includes: data preprocessing and heterogeneous processing.
The embodiment of the invention provides a method for optimizing griding based on MIC, which is characterized in that on the basis of MIC, the method is optimized aiming at a griding program in an SKA project; the optimization measures including the memory partitioning algorithm, the optimization and the vectorization of the data structure and the memory use, the parallel algorithm based on MIC hardware and the compiler-level optimization based on an Intel compiler are used, so that the final average optimization effect reaches 40 times, the operation efficiency of the griding in the SKA project is greatly improved, and the important significance is brought to the SKA project.
Drawings
In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar components in different views. Like reference numerals having different letter suffixes may represent different examples of similar components. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed herein.
FIG. 1 is a schematic flow chart of a gridding optimization method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating grid function functionality according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an example of memory partitioning according to an embodiment of the present invention;
fig. 4 is a diagram illustrating the data contention phenomenon illustrated in fig. 3.
Detailed Description
The main idea of the embodiment of the invention is as follows: and optimizing the gridding in two aspects of memory partition parallelism and CPU & MIC load balance based on the MIC.
As shown in fig. 1, the method for optimizing griding based on MIC in the embodiment of the present invention mainly includes the following steps:
step 101: memory partition parallelism is realized;
step 102: and carrying out load balancing processing on the CPU and the MIC.
In step 101, memory partitioning is performed for a gridding program, data contention is eliminated, a boundary effect is controlled, and parallel read-write processing of multiple threads is realized. The process of realizing memory partition parallelism in the step comprises the following steps: and data competition, redundant memory division and correspondence between processes and memory partitions are eliminated, and parallelism is realized. Specifically, a two-dimensional grid is divided into four regions representing different types, a memory is divided into a plurality of memory blocks, the memory blocks are respectively distributed to the four regions, the grid represents a two-dimensional vector, and the size of the two-dimensional vector is a constant determined according to the characteristics of a telescope; establishing a plurality of threads, and corresponding each thread to the memory blocks one to one, so that one thread can only read and write one memory block, and one memory block has only one thread which can read and write.
In step 102, a CPU & MIC isomerization design is adopted, asynchronous transmission and data vectorization are performed based on the MIC, and load balance between the CPU and the MIC is controlled. The method specifically comprises the following steps: an Offload mode implementation using MIC; asynchronous transmission is realized by using a transfer statement; and, using the SSE instruction set for data type optimization and vectorization. In practical application, an offload mode with a CPU as a main mode and an MIC as an auxiliary mode is adopted to realize load balancing between the CPU and the MIC.
After step 101 and before step 102, the method may further include steps of data preprocessing and heterogeneous processing, specifically, preprocessing includes data type and format conversion, for example, converting ST L complex data into double data for data transmission, format conversion, for example, converting an array into a pointer, etc.
Further, the data preprocessing and the heterogeneous processing may further include: a data structure is established.
The following is a detailed description of a specific implementation process for optimizing the griding method according to an embodiment of the present invention.
The process of establishing the data structure in the embodiment of the invention is as follows:
the data structure may be specifically set according to the following table 1, and specifically may include five data types: samples, cube C, two-dimensional grid (grid), samples. Wherein cube C represents a five-dimensional vector, grid represents a two-dimensional vector of gSize × gSize, and the values of sSize, gSize, etc. are constants determined according to the characteristic input of the telescope.
Variables of Type (B) Size and breadth
samples Structure vector nSample
C Double vector C.Size()
grid Double vector gSize*gSize
samples.iu/iv/coffset Integer 1
samples.data Double 1
TABLE 1
Here, a five-dimensional vector having an actual size of [129, 8,8,33] is regarded as a cube C having a bottom surface of sSize × sSize (129 × 129) and a height of 8 × 33. grid represents a two-dimensional vector of gSize × gSize, which can be viewed as a plane.
Two head pointers, namely, gind (index of grid) and cinad (index of C), are set, wherein, gind is the head pointer of a square area of gSize and cinad is the head pointer of a plane of sSize in the cube C, and the square memory area of a plane in the cube C is accumulated into the square memory area inside the grid by multiplying d through the two head pointers. Here, grid is a regular two-dimensional grid, and its function is shown in fig. 2, where double × gptr ═ grid [ gind ]; double cptr ═ C [ cind ]; gptr (pointer of grid), cptr (pointer of C) are pointers. Wherein the values of sSize, gSize, etc. are constants determined according to the characteristic input of the telescope.
The specific implementation process of the memory partitioning parallel algorithm in the embodiment of the invention is as follows:
in order to solve the problem of data contention, the basic idea of the memory partitioning parallel algorithm according to the embodiment of the present invention is to divide a grid into four regions representing different types, assuming that a memory is divided into 16 memory blocks, where the memory blocks respectively correspond to the four regions, and a memory allocation manner is shown in fig. 3. Where tile 1 is of the first type, if the gind point falls near the right and lower boundaries, then some of the sSize × sSize tiles corresponding to that gind will necessarily fall within 2-4 tiles, as shown in FIG. 4. At this time, the dark gray part in the 2 nd to 4 th small blocks can not be read and written, otherwise, the memory can be read and written by different threads at the same time, and the data competition phenomenon occurs.
In this embodiment of the present invention, to avoid data contention, the memory partition parallel algorithm in this embodiment of the present invention further includes: each thread corresponds to the unique memory block one to one, so that one thread can only read and write one memory block, and one memory block has one thread and only one thread can read and write. For example, if the upper left area shown in fig. 3 reads and writes the 1 st memory block by the process No. 1, and the upper right area shown in fig. 3 reads and writes the 1 st memory block by the process No. 2, the grid may be called as the first type grid _1, so that the memory redundancy algorithm is implemented, and the reading and writing of the second, third, and fourth types grid _2, grid _3, and grid _4 can also be ensured. In practical application, the memory with the size of four grid is preferably divided.
An example of thread and memory allocation is shown in Table 2 below:
thread number grid type Region(s) Read-write memory block number
1 grid_1 Upper left of 1
2 grid_1 Upper right part 1
3 grid_1 Left lower part 1
4 grid_1 Lower right 1
5 grid_2 Upper left of 2
6 grid_2 Upper right part 2
7 grid_2 Left lower part 2
8 grid_2 Lower right 2
9 grid_3 Upper left of 3
10 grid_3 Upper right part 3
11 grid_3 Left lower part 3
12 grid_3 Lower right 3
13 grid_4 Upper left of 4
14 grid_4 Upper right part 4
15 grid_4 Left lower part 4
16 grid_4 Lower right 4
TABLE 2
In the embodiment of the invention, one grid can read and write a plurality of processes, and the problem of data competition is avoided, so that parallelization is realized. In practical application, the memory partitioning method needs to establish and store the corresponding relationship between the thread number and the memory block in advance, and finally needs an accumulation and combination process.
In the embodiment of the invention, the steps of data preprocessing and heterogeneous processing are specifically realized as follows: the original algorithm is transplanted into the MIC, and meanwhile, the overflow _ transfer statement is used for asynchronous transmission, a part of data which are to be transmitted into the MIC are processed in advance, and then asynchronous transmission is achieved. In practical application, MIC & CPU parallelism (target if statement) may use if statement to make MIC & CPU run heterogeneously.
In the embodiment of the invention, the specific implementation process of the CPU & MIC load balancing processing is as follows:
and the parallel efficiency is improved by using the MIC through the overflow mode with the CPU as the main mode and the MIC as the auxiliary mode. Meanwhile, through testing, the optimal static load balance is obtained, the performance of the CPU and the MIC is exerted to the maximum extent, and the optimal optimization effect is obtained. The results are shown in table 3.
CPU operation rate (%) Time(s) Rate(milliongrid points/s)
0 11.15 4775.89
25 9.51 5599.5
37.5 8.77 6071.97
50 7.75 6871.12
62.5 6.96 7651.03
72 6.17 8630.66
75 5.99 8890.02
78 5.61 9492.19
81 5.91 9010.36
87.5 6.36 8372.83
100 6.17 8630.66
TABLE 3
Through multiple statistics, the static load balance point is found to be about 78% of the CPU operation proportion. Therefore, in the embodiment of the present invention, it is preferable that the operation ratio of the CPU is 78% and the operation ratio of the MIC is 22% to achieve the optimal static load balance.
In the embodiment of the present invention, in the load balancing process, optimization and vectorization of data types are also required. As can be seen from the above data structure, the main data type is complex, which becomes exceptionally difficult when vectorizing is applied. Because the automatic vectorization is not feasible due to the special data structure, when the SIMD statement is used for vectorization, the real part and the imaginary part of a complex number need to be extracted and stored in a corresponding double array, and then the vectorization containing 4 double multiplications is carried out. In the embodiment of the present invention, the SSE instruction set is preferably used for optimization and vectorization of data types.
The optimization process of the embodiment of the invention is calculated as follows:
nSamples 3200000 (total number of samples, determining number of first layer cycles); wSize 33; nChan ═ 1; gSize 4096; baseline 2000; cellSize 5.000000. These five variables affect the values of the function variables, cind, gind, etc., and the first three values can be arbitrarily modified.
Test result representation method: time(s) represents gridKernel execution time; rate (million grids per second) indicates the number of grids per second, and higher rate indicates faster program speed.
Correctness verification (L1 norm) compares the execution result of gridKernel with the standard result grid _ std.dat by program verify calculation, and if L1 norm difference is less than 1e-12, it is acceptable.
The optimization results are that gridding operation is carried out on a random sample with the size of 3200000 on an experimental platform, the system time is calculated, L1 model checking is carried out on the results, and the error is ensured to be within an allowable range, the following optimization results are obtained through continuous improvement of an optimization algorithm, and are shown in the table 4:
Version Time(s) Rate(million grid points/s) multiple of optimization
In series 229.37 232.163 1
Memory partitioning algorithm 26.17 2034.82 8.76
MIC+CPU 5.61 9492.19 40.89
TABLE 4
The embodiment of the invention optimizes the gridding program in the SKA project on the basis of MIC; the method uses a plurality of optimization measures including memory partitioning algorithm, optimization and vectorization of data structure and memory use, parallel algorithm based on MIC hardware and compiler-level optimization based on Intel compiler, and finally carries out parallel processing of MPI through a plurality of nodes, so that the final average optimization effect reaches 40 times, and the method has important significance on SKA projects.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (6)

1. A method of optimizing astronomy software gridding, the method comprising:
memory partition parallelism is realized;
carrying out load balancing processing on a CPU and an integrated many-core architecture MIC;
the method for realizing the memory partition parallelism comprises the following steps:
dividing a two-dimensional grid into four regions representing different types, dividing a memory into a plurality of memory blocks, and respectively allocating the memory blocks to the four regions, wherein the grid represents a two-dimensional vector, and the size of the two-dimensional vector is a constant determined according to the characteristics of a telescope;
establishing a plurality of threads, and corresponding each thread to the memory blocks one to one, so that one thread can only read and write one memory block, and one memory block has only one thread which can read and write.
2. The method according to claim 1, wherein an offload mode with a CPU as a main mode and a MIC as an auxiliary mode is adopted to achieve load balancing between the CPU and the MIC.
3. The method of claim 2, wherein the CPU operation percentage is 78% and the MIC operation percentage is 22%.
4. The method according to any of claims 1-3, wherein load balancing the CPU and the MIC further comprises: the SSE instruction set is used for optimization and vectorization of data types.
5. The method according to any of claims 1-3, wherein load balancing the CPU and the MIC further comprises: asynchronous transfer is implemented using transfer statements.
6. The method of claim 1, after the implementing a memory partitioning parallelism algorithm and before the load balancing the CPU and the integrated many-core architecture MIC, the method further comprising: data preprocessing and heterogeneous processing.
CN201610303402.8A 2016-05-09 2016-05-09 Method for optimizing astronomy software gridding Active CN105955825B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610303402.8A CN105955825B (en) 2016-05-09 2016-05-09 Method for optimizing astronomy software gridding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610303402.8A CN105955825B (en) 2016-05-09 2016-05-09 Method for optimizing astronomy software gridding

Publications (2)

Publication Number Publication Date
CN105955825A CN105955825A (en) 2016-09-21
CN105955825B true CN105955825B (en) 2020-07-10

Family

ID=56914325

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610303402.8A Active CN105955825B (en) 2016-05-09 2016-05-09 Method for optimizing astronomy software gridding

Country Status (1)

Country Link
CN (1) CN105955825B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598552A (en) * 2016-12-22 2017-04-26 郑州云海信息技术有限公司 Data point conversion method and device based on Gridding module

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279391A (en) * 2013-06-09 2013-09-04 浪潮电子信息产业股份有限公司 Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing
CN103530132A (en) * 2013-10-29 2014-01-22 浪潮电子信息产业股份有限公司 Method for transplanting CPU (central processing unit) serial programs to MIC (microphone) platform
CN104375838A (en) * 2014-11-27 2015-02-25 浪潮电子信息产业股份有限公司 OpenMP (open mesh point protocol) -based astronomy software Griding optimization method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279391A (en) * 2013-06-09 2013-09-04 浪潮电子信息产业股份有限公司 Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing
CN103530132A (en) * 2013-10-29 2014-01-22 浪潮电子信息产业股份有限公司 Method for transplanting CPU (central processing unit) serial programs to MIC (microphone) platform
CN104375838A (en) * 2014-11-27 2015-02-25 浪潮电子信息产业股份有限公司 OpenMP (open mesh point protocol) -based astronomy software Griding optimization method

Also Published As

Publication number Publication date
CN105955825A (en) 2016-09-21

Similar Documents

Publication Publication Date Title
JP5425541B2 (en) Method and apparatus for partitioning and sorting data sets on a multiprocessor system
Qin et al. Parallelizing flow-accumulation calculations on graphics processing units—From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm
Wu et al. Kernel weaver: Automatically fusing database primitives for efficient gpu computation
EP3713093A1 (en) Data compression for a neural network
CN107657599B (en) Parallel implementation method of remote sensing image fusion system based on mixed granularity division and dynamic load distribution
US10810784B1 (en) Techniques for preloading textures in rendering graphics
CN110135569A (en) Heterogeneous platform neuron positioning three-level flow parallel method, system and medium
CN108897616B (en) Non-downsampling contourlet transform optimization method based on parallel operation
Bøgh et al. Work-efficient parallel skyline computation for the GPU
Lee et al. Optimization of GPU-based sparse matrix multiplication for large sparse networks
Jiang et al. Parallel contributing area calculation with granularity control on massive grid terrain datasets
CN112765094A (en) Sparse tensor canonical decomposition method based on data division and calculation distribution
CN106484532B (en) GPGPU parallel calculating method towards SPH fluid simulation
CN105955825B (en) Method for optimizing astronomy software gridding
CN104572588B (en) Matrix inversion process method and apparatus
CN111274335B (en) Rapid implementation method for space superposition analysis
CN115756605A (en) Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs
Chatterjee et al. Data structures and algorithms for counting problems on graphs using gpu
CN106598552A (en) Data point conversion method and device based on Gridding module
CN113297537B (en) High-performance implementation method and device for solving sparse structured trigonometric equation set
CN104156268B (en) The load distribution of MapReduce and thread structure optimization method on a kind of GPU
Zhou et al. A Parallel Scheme for Large‐scale Polygon Rasterization on CUDA‐enabled GPUs
CN115344383A (en) Streamline visualization parallel acceleration method based on process parallel
CN109558817A (en) A kind of airfield runway detection method accelerated based on FPGA
Fan et al. Fast sparse gpu kernels for accelerated training of graph neural networks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant