CN108614735B

CN108614735B - Template calculation method and system based on spatial dense paving

Info

Publication number: CN108614735B
Application number: CN201810204889.3A
Authority: CN
Inventors: 张云泉; 袁泉; 黄珊; 郭鹏
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-03-13
Filing date: 2018-03-13
Publication date: 2021-03-05
Anticipated expiration: 2038-03-13
Also published as: CN108614735A

Abstract

The invention relates to a template calculation method and a system based on spatial dense paving, which comprises the following steps: densely paving the data space by adopting a calculation template so as to divide the data space into a plurality of first-class blocks and update the first-class blocks into natural blocks; dividing the data space into a plurality of second-class blocks by taking the edge center of the natural block as a first-class center point and dividing the grid point in the data space into the first-class center point closest to the edge center; updating the second type of block, taking the grid point with zero data updating times in the data space as a second central point, and dividing the data space into a plurality of third blocks by dividing the grid point in the data space to the second central point closest to the grid point; and updating the third block to complete the updating of the data space. The space division method can improve the updating efficiency of the data space.

Description

Template calculation method and system based on spatial dense paving

Technical Field

The invention relates to the field of high-performance computing, in particular to a method and a system for computing a space-dense-paving-based parallel template (Tencel).

Background

The blocking method is one of the most effective conversion techniques for improving the data locality and parallelism of multi-loop nesting in recent years. Stencil computing is a type of computing in the field of high performance computing, i.e. each point in space is computed by relying on neighboring points and each point is the same computing template (stent), e.g. in two-dimensional box type stents, the updating of each point at time t requires that neighboring points comprise the value of a total of 9 points themselves at the last time t-1. The shape of the first type of tile is determined by the shape of the tencel computation template (star or box), and then the tile shape is formed by the rules of the rest after the previous tile is updated.

Hyper-rectangle tiling (Hyper-rectangle tiling) is commonly applied to manual tuning implementations in the field of high performance computing. In computing multiple time steps, the problem of data dependency between partitions can be solved by redundant computation, i.e. overlapping partitions (overlapping tiling). Philips and Fatica implemented manually tuned 3.5D chunking codes on GPUs, with temporal chunking based on 2.5D spatial chunking. Although the hyper-rectangular tile shape rules may support high concurrency and finer optimizations, the redundant computational overhead may outweigh the performance boosting effect. Therefore, the present invention focuses on a redundancy-free computing scheme.

Blocks generated by Time skewed blocks (Time skewed blocks) are parallelogram-shaped in 2D, parallelepiped-shaped in 3D, and hyper-flat in high dimension. This blocking approach can eliminate redundant computations, but in most cases only one block shape is used, resulting in pipeline startup and limited concurrency. The densely-paved scheme provided by the invention can be completely parallel, and all data blocks between two times of synchronization can be executed concurrently.

Bondhugula et al first proposed a diamond blocking (diamond blocking) method for 1D Stencil. Bandishti et al then extended it to the high dimensional stensil. Grosser et al coarsened the tips of the diamonds, changing them into hexagons in 2D and octahedrons in 3D. Essentially, the rhombus blocking method uses parallelepiped cutting and rotation space to get maximum parallelism, and the blocking wavefronts executed in parallel together constitute a hyperplane perpendicular to the time dimension. The main advantage of this method is that it allows for simultaneous start-up.

Cache independence (cache dependencies) algorithms can make full use of data locality without knowing the memory structure parameters. Frigo and strumpten propose the first serial and parallel cache independent Stencil algorithm. The cache-independent parallelogram blocking method simultaneously divides space and time dimensions, and can be regarded as a cache-independent version containing wavefront parallelism in a time tilt algorithm. Pochoir is a hyper-plane cut (hyper-plane cut) method for cutting all spatial dimensions simultaneously, and compared with a serial algorithm, the method improves parallelism and ensures the same cache complexity.

The split blocking (splitting) method determines independent sub-blocks in each block to execute concurrently, and then sends the result to a successor so that other areas can execute concurrently, thereby avoiding pipeline starting overhead in wavefront parallel. Grosser et al propose another cache independent paradigm similar to hyperspace slicing, where this circular slicing approach can recursively slice across all spatial dimensions.

Strzodka et al and Malas et al propose CATS algorithm and MWD algorithm in combination with diamond blocking, parallelogram blocking and pipeline execution methods. Grosser et al describe a hybrid hexagonal and parallelogram blocking algorithm. These algorithms decompose the iterative space into two parts, using Hexagonal tiling (Hexagonal tiling) or diamond-shaped tiling in the time dimension and one spatial dimension, and time-tilted tiling in the other spatial dimension. Hexagonal tiling corresponds to diamond-shaped tiling that is stretched in the spatial dimension, so that each tile in the higher-order stenil depends on at most three preceding tiles, which are extensions of Hybrid tiling (Hybrid tiling). The Hybrid split-blocking (Hybrid split-blocking) method is a combination of the nested split-blocking (nested split-blocking) method and the time-skewed blocking.

The diamond blocking, the hyperplane cutting (cache independence) and the nested split blocking are recently developed methods capable of realizing the maximum concurrency, but the methods have some defects, and the close-tiling algorithm of the invention is obtained by a mathematical framework capable of clearly showing a blocking scheme, so that the following problems can be overcome.

The rhombus partitioning is a compiler conversion technology based on a polyhedron model. Traditional code generation methods require a fixed block size at compile time, which limits efficient autotuning and code portability. Bertolacci et al propose parameterized versions of 2D Stencil diamond blocking, but do not address the issue of automatic code generation. Another disadvantage of diamond blocking is that it requires processing small blocks of data at the tip of the diamond. Grosser et al believe that diamond blocking can be expanded into hexagonal blocking in 2D and that truncated octahedra are selected as the 3D diamond expansion. They do not describe a chunking coarsening scheme in the 3D or higher dimension. In addition, the diamond partitions fill the entire iteration space with slanted hyper-rectangles, making it difficult to find the appropriate partition size to ensure concurrent startup. Meanwhile, the direct expansion into multi-level diamond blocking is difficult.

Common cache-independent algorithms are always limited by the overhead of recursion and can only achieve limited speed increases. The hyperplane cutting is staggered to perform time and space cutting, and the sub-blocks generate pseudo-dependence by the division and recursion implementation of the algorithm. The buffer independent wavefront algorithm proposed by Tang et al can eliminate these pseudo dependencies, improve parallelism, and shorten the critical path of the algorithm, but this approach requires checking the data structure of the wavefront, resulting in overhead in execution time.

The biggest problem of nested split blocking is the excessive synchronization overhead, and taking the step of d dimension as an example, the nested split blocking needs to be synchronized for 2d times. It is noted that although Tang et al correctly demonstrated the same synchronization overhead as the present algorithm. However, after carefully studying their codes, the present invention finds that this implementation based on the common cache independent recursive framework adopts the same strategy as nested split blocking, and needs to synchronize 2d times in the step of d dimension. While the use of dynamic queues can reduce synchronization overhead, this gain is offset by scheduling costs at runtime.

The inventor finds that the defect in the prior art is caused by only considering the problems of simple block shape, high efficiency of a recursive framework and no consideration of concurrency limitation, pseudo dependence between blocks and high synchronous overhead when researching a method for optimizing the Stencil parallel computing, finds that solving the defect can be realized by designing different block scheme close-lay substitution space methods for box type Stencil and star type Stencil respectively, obtains corresponding natural blocks according to different Stencil shapes, determines the central point of a close-lay block at the next stage through a specific method on the basis of the natural blocks, obtains a new close-lay block, and realizes a Stencil computing parallel algorithm with maximum concurrent execution and no redundant computation by utilizing the new two-layer close-lay block method.

Disclosure of Invention

The invention aims to solve the problems of redundant computation, concurrency limitation, false dependence and the like in the prior art, provides a novel parallel algorithm of two layers of densely paved blocks, can maximize concurrent execution, has no redundant computation, has concise cycle conditions, and adapts to different sizes, shapes, orders and boundary conditions of Stencil, wherein parallel means that data in each block does not depend on other blocks after the blocks are partitioned, and each block can independently and simultaneously compute.

Specifically, the invention discloses a template calculation method based on spatial dense paving, which comprises the following steps:

step 1, obtaining an array containing data to be updated, taking the array as a data space to be updated, wherein the storage position of the data to be updated in the data space is called a lattice point, densely paving the data space by adopting a box-type or star-type calculation template to divide the data space into a plurality of first-type blocks, updating the first-type blocks into natural blocks, and updating the first-type blocks into parallel updating;

step 2, taking the edge center of the natural block as a first-class central point, and dividing the data space into a plurality of second-class blocks by dividing the grid points in the data space to the first-class central point closest to the grid points;

step 3, updating the second type block, taking a grid point with zero data updating times in the data space as a second central point, dividing the data space into a plurality of third blocks by dividing the grid point in the data space to the second central point closest to the grid point, wherein the updating of the second type block is parallel updating;

and 4, updating the third block to complete the updating of the data space, wherein the updating of the third type block is parallel updating.

The template calculation method based on the spatial dense paving is characterized in that the array is a two-dimensional or multi-dimensional array.

The template calculation method based on the spatial dense paving is characterized in that the step 1 further comprises the following steps: and when the calculation template is in a box shape, the data space is densely paved by adopting a square block shape, and when the calculation template is in a star shape, the data space is densely paved by adopting a diamond block shape.

The parallel template computing method based on the spatial dense paving comprises the steps that the natural block is the minimum grid point set required by the first class block for updating t times, wherein t is the target updating times of all grid points in the data space.

The template calculation method based on the spatial dense paving is characterized in that the step 3 further comprises the following steps: updating the second type of block by adopting a maximum updating strategy; the step 4 further comprises: updating the third block by adopting a maximum updating strategy;

wherein the maximum update strategy is:

and giving a block in the data space, and updating the grid points in the block along the time dimension until a certain coordinate point is updated t times, wherein the neighbor grid points are updated t-2 times, and t is the target updating times of all the grid points in the data space.

The invention also provides a template computing system based on spatial dense paving, which comprises the following components:

the space densely paving module is used for acquiring an array containing data to be updated, taking the array as a data space to be updated, wherein the storage position of the data to be updated in the data space is called a lattice point, densely paving the data space by adopting a box-type or star-type calculation template so as to divide the data space into a plurality of first type blocks, and updating the first type blocks into natural blocks;

the first space dividing module is used for dividing the data space into a plurality of second-class blocks by taking the edge center of the natural block as a first-class center point and dividing the grid point in the data space to the first-class center point closest to the grid point;

the second space division module is used for updating the second type of block, taking a grid point with zero data updating times in the data space as a second central point, and dividing the data space into a plurality of third blocks by dividing the grid point in the data space to the second central point closest to the grid point;

and the updating module is used for updating the third block so as to finish updating the data space.

The spatial tiling-based template computing system, wherein the array is a two-dimensional or multi-dimensional array.

The spatial-tiling-based template computing system, wherein the spatial tiling module further comprises: and when the calculation template is in a box shape, the data space is densely paved by adopting a square block shape, and when the calculation template is in a star shape, the data space is densely paved by adopting a diamond block shape.

The parallel template computing system based on the spatial dense paving is characterized in that the natural block is the minimum grid point set required by the first type block for updating t times, wherein t is the target updating times of all grid points in the data space.

The space-based tiling template computing system, wherein the second space partitioning module further comprises: updating the second type of block by adopting a maximum updating strategy; the update module further comprises: updating the third block by adopting a maximum updating strategy;

wherein the maximum update strategy is:

The experimental environment was a machine equipped with two 2.70GHz Intel Xeon E5-2670 processors and tested on a single core to 24 cores. Single cores L1 and L2 are 32K and 256K, respectively, and 12 cores on one socket share 30M L3 cache. The ICC version is 16.0.1, and the optimization option is 'O3-openmp'.

The box-type tiling algorithm was compared to two other highly parallel schemes, namely Pluto and pocholor. Wherein Pluto adopts a diamond blocking method, and Pochoir adopts a hyperspace cut blocking method. The test included 4 star type Stencil (Heat-1D 3-point, Heat-1D 5-point, Heat-2D 5-point and Heat-3D 7-point) and 2 box type Stencil (Heat-2D 9-point and Heat-3D 27-point). The parameter size is set according to the parameter size mentioned in the test reference, so that the data scale can truly reflect the performance of the algorithm, and meanwhile, the overlong test time cannot be caused. Since the code of the tiling scheme contains more tunable parameters, other parameters are set to half or twice the tile size, making it similar to the tiles in Pluto and Pochoir. For 2D 9-point Stencil, Pluto and Pochoir showed the same trends as in 5-point Stencil. But the average performance of the scheme of the invention is improved by 14 percent and 20 percent compared with the scheme of the invention. In 3D 27-point Stencil, the code performance of the invention is obviously superior to that of Pluto and Pochoir, the maximum is increased by 74 percent and 100 percent, and the average increase is 30 percent and 99 percent.

Compared with the Girih and Pluto schemes, the star-type dense-paving algorithm is improved by 1.24 times compared with the 3D7P star-type Stencil with the data volume of 2563, the scheme of the invention is improved by 1.19 times averagely, and the performance of the scheme on the higher-order Stencil is higher than that of the other two schemes.

Drawings

FIG. 1 is a schematic view of a 2D9P cassette type Stencil;

FIG. 2 is a schematic view of a 2D5P star Stencil;

FIG. 3 is a schematic spatial view of densely-laid 2D box type Stencil data;

FIG. 4 is a schematic diagram of a densely paved 2D box type Stencil iteration space;

FIG. 5 is a schematic spatial view of densely paved 2D star-shaped Stencil data;

FIG. 6 is a schematic diagram of a densely paved 2D star-shaped Stencil iteration space;

fig. 7 and 8 are schematic diagrams of the two-time updating technique.

Detailed Description

The invention discloses a template calculation method based on spatial dense paving, which comprises the following steps:

step 1, obtaining an array containing data to be updated, taking the array as a data space to be updated, wherein the storage position of the data to be updated in the data space is called a lattice point, and densely paving the data space by adopting a box-type or star-type calculation template to divide the data space into a plurality of first-type blocks (corresponding to B in the following embodiment)₁) Updating the first type block into a natural block;

step 2, taking the edge center of the natural block as the firstA class center, which divides the data space into a plurality of second class blocks (corresponding to B in the following embodiments) by dividing the grid points in the data space into the first class center points closest to the grid points₂)；

Step 3, updating the second type block, taking a grid point with zero data updating times in the data space as a second central point, and dividing the data space into a plurality of third blocks by dividing the grid point in the data space to the second central point closest to the grid point;

and 4, updating the third block to finish updating the data space.

wherein the maximum update strategy is:

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The invention aims to update all grid points in a data space (two-dimensional) t times, namely updating along a time dimension, wherein the data space dimension + the time dimension are iteration space dimensions, and the data space is two-dimensional or three-dimensional and is determined according to a coordinate dimension of the grid points in a Stencil calculation template. The iteration space total is one more dimension (i.e., the temporal dimension) than the data space. And when the time slice size is b, updating b time steps for all grid points in each time slice.

First, in order to distinguish different Stencil shapes, the concept of natural block (natural block) needs to be introduced. Each coordinate point in the data space is called a grid point. When all points are in the same time dimension, the minimum set of grid points needed for updating a grid point t time step is called a natural block for updating the t step, the minimum set of grid points, that is, the boundary of the generated natural block, has at most one circle of 0, but not a plurality of circles of 0, wherein all points refer to all grid points in the data space, and the value of all grid points in the data space at the time step t is initially known to be 0. Each data to be updated is updated once, the peripheral data and the data of the data to be updated need to be called, namely 9 data are needed for the box type, 5 data are needed for the star type to complete the updating once, therefore, the updating frequency of the outermost peripheral data of the natural block is 0, and the data cannot be updated due to insufficient data number on the periphery. Fig. 1 and 2 show 4-step natural blocks of 2D9P box type stepsil and 2D5P star type stepsil update, respectively. The invention uses B0 to identify natural blocks B0 in data space, and the dense blocks in iteration space after each lattice point in B0 updates corresponding time step are marked as

Another concept similar to natural blocks is maximum update (maximum update): and giving a block B in a data space, updating each point along a time dimension until the dependency relationship defined by Stencil is not satisfied, wherein the maximum time step number of the updating is the maximum updating, the dependency relationship refers to calculating the value of a certain lattice point at a time step t, the value of a neighbor lattice point at the time step t-1 is required, namely, the dependency relationship is satisfied, namely when the dependency relationship is calculated, the neighbor lattice point is not updated to the time step t-1. Natural blocks and maximum updates are somewhat of a dual concept, and given the same set of lattice points B0, both mechanisms can produce the same B0. However, these two concepts complement each other, the maximum update does not provide a specific shape for partitioning the data space, and only the natural block B0 cannot densely spread the entire iteration space, and the maximum update is also required to generate the subsequent blocks B1 and B2 in the iteration space.

The three-dimensional (3D) iteration space is divided into a plurality of time slices in the time dimension, and 3 dense-paving stages of the box type Stencil and star type Stencil iteration space in each time slice (t is 4) are implemented as follows, wherein t is the size of the time slice and can be other positive integers, and the steps are the same.

In the first stage, the data space is densely tiled with squares (box type) or diamonds (star type) B0, and B0 is used as the first type of block, as shown in fig. 3 (a) and fig. 5 (a), wherein the size of the data space is initially determined according to the scale of the problem to be solved. After the update by the natural block B0, the points in the box-type and star-type stensil at which the number of updates is 0 are the boundaries (sides) of the square and the diamond, respectively, as shown in fig. 4 (a) and fig. 6 (a). Then, according to the idea of natural blocks, the center point of the grid point convex set with the updated step number of 0 after the updating of B0 is determined as the center point of B1 (the center point is always on the boundary of the previous block), and points on the connecting line of any two points in the set S are all in S, so the set S is called as the grid point convex set. According to the convex surface requirement, the four boundaries are four point sets, the four points enclosed by the boxes are the center points of the subsequent blocks, and the points in B0 are allocated to the B1 block located at the nearest center point. The conversion process of B0 to the second-type block B1 is shown by the dotted lines in fig. 3 (B) and fig. 5 (B). The iteration space is thus divided into a plurality of time slices, each rhombus is divided into two sub-triangles of equal size by transverse lines parallel to the x-axis, and the iteration space is densely paved by regular triangles and inverted triangles. Note that B0 is the first block to densely tile the data space, so the first type of tile was written before; to decide whether B0 is square or diamond based on the characteristics of the tencel computation template (box or star), B0 must satisfy the definition of natural blocks. Thus, B0 is both a natural block and a first type block.

In the second phase, the data space is densely tiled with diamonds or squares B1, as shown in fig. 3 (c) and 5 (c), and then updated using the maximum update strategy. The number of times that the B1 updates each point in the block B1 in the iteration space is as shown in fig. 4 (B) and fig. 6 (B), the number of times that each point updates after being updated by the B0 and the B1 is as shown in fig. 4 (c) and fig. 6 (c), and it can be determined that the grid point where the number of the frame is located is the center point of the subsequent block, i.e., the third block B2. The points in B1 were then assigned to the B2 block that was located closest to the center point. The conversion process from B1 to B2 is shown by a dotted line in fig. 3 (d) and fig. 5 (d).

In the third stage, the data space is densely tiled with squares or diamonds B2, as shown in fig. 3 (e) and 5 (e), and then updated using the maximum update strategy. B2 corresponds to the update times of each point in block B2 in the iteration space as shown in fig. 4 (d) and fig. 6 (d), the update times of each point after B0, B1 and B2 are as shown in fig. 4 (e) and fig. 6 (e), all the points are updated 4(t is 4), and the update of one time slice is completed.

It should be noted that the boundaries of the blocks share one row or one column of data because they are densely spread. The intra-block data space grid is represented by its number of updates in the time dimension of the block. The box-type dense-paving algorithm can be expanded to d-dimensional data space, and the transformation method from Bi to Bi +1 can be simply expressed as: the boundary (d-2 dimension) connecting the center point of Bi and all the boundaries (d-1 dimension) of Bi is shown by the dotted line in FIGS. 5 and 3, and then the original boundary of Bi is removed (solid line in FIGS. 5 and 3).

d +1 dense paving stages of d dimension Stencil; after a data space is densely paved by using d +1 blocks, d +1 dimensional blocks expanded along a time dimension by blocks can form a time slice with all grid points updated by b steps, so that a generation space is densely paved, a series of blocks such as a d-dimensional hypercube of a d-dimensional template finger and the like are formed, wherein the grid points calculated by the d-dimensional template finger can be points in the d-dimensional space, namely the values of the grid points with coordinates (x, y) at a time step t are expressed by A [ t ] [ x ] [ y ] in two dimensions, and the grid points are the points in the d-dimensional data space in the d-dimensional template finger.

As shown in the second row of FIG. 7, the general update of the B1 block is from the starting dimension (master pair)Angular line) to the end dimension (minor diagonal), B1 is decomposed into 4 matrices at time step, i.e., 4 matrices represent the grid point positions to be updated in each time step data space, respectively. Each step, superimposed on B0, is the result of the first row arrow "→". The star-type tiling algorithm can continuously update the same lattice point twice, namely a double updating (double updating) technology, which is introduced in the updating of B1 by the invention. The grid points at one time step are updated along the same time dimension, so the order of updating the grid points in the same time step can be arbitrary, and the second row of FIG. 7 can be used

The middle two matrices of the update step are decomposed into the form shown in fig. 8 and executed in the order after the decomposition. The matrix thus decomposed can be compared with the second row of FIG. 7, respectively

The leftmost and rightmost matrices of the update step are combined to form line 3 of FIG. 7: (

Pointing), each point in the matrix is updated 2 times.

The following is a system example corresponding to the above method example, and the present implementation system can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in the present implementation system, and are not described herein again for the sake of reducing repetition. Accordingly, the related-art details mentioned in the present embodiment system can also be applied to the above-described embodiments.

The space-based tiling template computing system, wherein the second space partitioning module further comprises: updating the second type of block by adopting a maximum updating strategy; the update module further comprises: and updating the third block by adopting a maximum updating strategy.

Although the present invention has been described in terms of the above embodiments, the embodiments are merely illustrative, and not restrictive, and various changes and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention is defined by the appended claims.

Claims

1. A template calculation method based on spatial dense paving is characterized by comprising the following steps:

step 1, obtaining an array containing data to be updated, taking the array as a data space to be updated, wherein the storage position of the data to be updated in the data space is called a grid point, densely paving the data space by adopting a box-type or star-type calculation template to divide the data space into a plurality of first-type blocks, updating the first-type blocks into natural blocks, wherein the natural blocks are a minimum grid point set required by updating the first-type blocks for t times, and t is the target updating times of all grid points in the data space;

and 4, updating the third block to finish updating the data space.

2. The spatial-tiling-based template computation method of claim 1, wherein the array is a two-dimensional or multi-dimensional array.

3. The spatial-tiling-based template calculation method according to claim 1, wherein the step 1 further comprises: and when the calculation template is in a box shape, the data space is densely paved by adopting a square block shape, and when the calculation template is in a star shape, the data space is densely paved by adopting a diamond block shape.

4. The spatial-tiling-based template calculation method according to claim 1, wherein the step 3 further comprises: updating the second type of block by adopting a maximum updating strategy; the step 4 further comprises: updating the third block by adopting a maximum updating strategy;

wherein the maximum update strategy is:

5. A spatial tiling-based template computing system, comprising:

the space densely paving module is used for acquiring an array containing data to be updated, taking the array as a data space to be updated, wherein the storage position of the data to be updated in the data space is called a grid point, densely paving the data space by adopting a box-type or star-type calculation template so as to divide the data space into a plurality of first type blocks, updating the first type blocks into natural blocks, wherein the natural blocks are a minimum grid point set required by updating the first type blocks for t times, and t is the target updating times of all grid points in the data space;

6. The spatial-tiling-based template computing system of claim 5, wherein the array is a two-dimensional or multi-dimensional array.

7. The spatial-tiling-based template computing system of claim 5, wherein the spatial tiling module further comprises: and when the calculation template is in a box shape, the data space is densely paved by adopting a square block shape, and when the calculation template is in a star shape, the data space is densely paved by adopting a diamond block shape.

8. The space-based tiling template computing system of claim 5, wherein the second space partitioning module further comprises: updating the second type of block by adopting a maximum updating strategy; the update module further comprises: updating the third block by adopting a maximum updating strategy;

wherein the maximum update strategy is: