CN102096744A

CN102096744A - Irregular iteration parallelization method

Info

Publication number: CN102096744A
Application number: CN2011100537959A
Authority: CN
Inventors: 张纪林; 徐向华; 万健; 蒋从锋; 张伟; 任永坚
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2011-03-07
Filing date: 2011-03-07
Publication date: 2011-06-15

Abstract

The invention relates to an irregular iteration parallelization method. In the initializing stage, the data locality property and the data parallelism of irregular iteration calculation are enhanced by analyzing an access mode of data according to a generation strategy of a data block and a sub data block; in the executing stage, the data locality property and the data parallelism of an irregular iteration technology are enhanced by executing a new generated scheduling strategy and a converted code; and in an actual executing process, an automatic performance optimizer of the irregular iteration calculation method is created, a parameter value combination under the condition of optimal efficiency is found out by an exhaustion detecting method and the parameter values of the parameter value combination are fixed, and the optimal running efficiency under the system architecture can be realized. The method is good in parallelization efficiency and expandability.

Description

A kind of non-regular iteration parallel method

Technical field

The invention belongs to the numerical reservoir simulation field, relate to a kind of non-regular iteration parallel method.

Background technology

In numerical reservoir simulation, through finite element analysis, physical region is dispersed is non-regular grid, and the problem of finding the solution the variablees such as pressure of each grid finally is summed up as the problem that adopts successive overrelaxation (SOR) or Gauss-Sai Deer iterative algorithms such as (GS) to find the solution large-scale sparse linear system of equations efficiently.Pressure equation is used in the numerical reservoir simulation Expression, wherein,

Expression pressure unknown number vector, Represent large-scale sparse matrix of coefficients,

Expression constant vector.In the numerical reservoir simulation field, large-scale sparse matrix of coefficients

In the shared global matrix of nonzero element capacity usually less than ,, can effectively reduce storage space and computing time by the row compression and storage method.But in actual applications,, also need the indirect referencing between array in order to quote the concrete numerical value of nonzero element.A kind of form of expression of this class indirect referencing right and wrong rule iterative computation can cause compiler to be difficult to the control program behavior, can't discern the concurrency of non-regular iterative computation, and the locality that non-rule is calculated is difficult to be optimized.How improving the data locality and the concurrency of non-regular iterative computation, is the key issue that improves its performance.

Non-regular iterative computation parallel model comprises PW(Post/Wait) model, SE(Speculative Executor) model, IE(Inspector/Executor) model and EIE(Extend Inspector/Executor) model, the wherein kernel model of IE model right and wrong rule iterative computation.Because large-scale sparse matrix of coefficients

The non-regular access mode of the caused data of packed data method can cause the room and time locality of data to reduce.Therefore, aspect the raising temporal locality, can use data multiplexing technique; Aspect the raising spatial locality, can adopt the summit method for reordering.In addition, it is the important optimization method that improves iterative technique degree of parallelism and data locality that traditional circulation stick is divided, and this method can be improved data locality to a great extent, and reorders by stick and can improve degree of parallelism and to reduce communication overhead.Traditional stick optimization research is used for the data traversal of regular stick more.But not regular iterative computation problem can't be determined sparse matrix of coefficients when compiling The data array subscript, therefore research method in the past is to this type of problem and inapplicable.

Summary of the invention

The present invention proposes a kind of non-regular iterative computation parallel method, it is towards distributed type assemblies and have the function of automatic tuning, during by utilization and operation and line interlacing stick strategy, improves the executed in parallel performance of non-regular iterative computation.

The technical solution adopted in the present invention is:

The present invention by analyzing the access mode of data, improves the data locality and the concurrency of non-regular iterative computation in the starting stage according to the generation strategy of data block and sub-block; In the execute phase, by carrying out newly-generated scheduling strategy, carry out the code after changing, improve the data locality and the concurrency of non-regular iterative technique; In practical implementation, construct the automatic tuning device of performance of non-regular iterative calculation method, find the parameter values under the efficiency optimization situation to make up by exhaustive detection method, and fix its parameter value, the operational efficiency optimum of realization technology under this architecture; Concrete steps are:

1, definition initialization matrix: the matrix of coefficients that will have symmetrical structure

With an adjacent map

Describe,

Figure 2011100537959100002DEST_PATH_IMAGE007

Represent the summit,

The expression matrix of coefficients

In element,

Figure 2011100537959100002DEST_PATH_IMAGE009

The expression adjacent map

In a limit

Figure 2011100537959100002DEST_PATH_IMAGE011

2, to matrix of coefficients

Once divide: carry out the figure division by the K-way method among the figure partitioning algorithm storehouse Metis, make the different closely summits of association in same subgraph, the data block number Value is by computing node number decision in the distributed type assemblies.Divide by figure once, make the summit

Figure 2011100537959100002DEST_PATH_IMAGE013

With divide the data block produced Set up following mapping relations:

, wherein

Represent Individual figure divides the back data block,

Value is a computing node number in the distributed type assemblies.

Partial order restriction relation with each summit of NodeDepence data structure storage.Be described below:

NodeDepence +={＜v _i, v _j| (Tile (0, v _i)＜Tile (0, v _j)) ∩ (＜v _i, v _j∈ E ∪＜v _j, v _i∈ E), wherein, Tile(0, v _i) the expression vertex v _iAt the numerical value of the data block at the 0th iteration place, and Tile (0, v _j) the expression vertex v _jNumerical value in the data block at the 0th iteration place.

3, to matrix of coefficients

Carrying out secondary divides: once divide the back and produce

It is exactly to once dividing the data block that the back produces that individual data block, secondary are divided

Figure 2011100537959100002DEST_PATH_IMAGE017

Carrying out quadratic diagram by the K-way method in the Metis storehouse, figure partitioning algorithm storehouse again divides.Divide the capacity decision of block size parameter by system memory unit, each data block is divided into again

Individual sub-block (sub-block-k).Partial order restriction relation with each summit of Sub_NodeDepence data structure storage.Be described below:

Sub_NodeDepence[m]+={<v _i,v _j>|Tile(0,v _i)<Tile(0,v _j))?∩(<v _i,v _j>∈E?∪?<?v _j?,v _i?>∈E)}

4, sub-block is carried out inner iteration border and revises, specifically describe as follows:

For each summit, by following formula, the update calculation vertex v _iNumerical value in the data block at the iter time iteration place.Inner herein iterations is that number of times is revised on the border.

Tile?(iter,v _i)?=?MAX(Tile(iter,v _i),Tile(iter-1,v _j));

Tile(?(iver,v _j)?=?MAX(Tile(iter,v _i),Tile(iter,v _j));

5, reorder in the summit: promptly to system of linear equations

In unknown quantity

Reorder, be mapped as new unknown quantity Under the prerequisite that satisfies the summit partial ordering relation, right Order rearranges, and in same data block, arrange according to partial ordering relation on the summit, and in the different pieces of information piece, the summit order produces new unknown vector according to the data block series arrangement

,

Figure 2011100537959100002DEST_PATH_IMAGE021

6, sub-block reorders: utilize the sub-block strategy that reorders, to the order of the sub-block rearrangement under the computing node in the distributed type assemblies.The sub-block tactful following description of reordering:

1. in 0 computing node inside, there is not the sub-block of dependence with other nodes, after the rearrangement order, be placed on the prostatitis of sub-block order, all the other sub-blocks are placed in the sub-block execution sequence by original order.

2. in non-0 computing node q inside, with computing node x(x＜q) sub-block of dependence is arranged, after the rearrangement order, be placed on the prostatitis of sub-block order, all the other sub-blocks are placed in the sub-block execution sequence by original order.

7, take a sub-block iterative computation process to test and assess as sample, setting the branch block size parameter by exhaustive detection method is the multiple of computing node level cache size in the distributed type assemblies, setting inner iterations is 3 to 10 times, carry out repeatedly iterative computation, choose optimum, i.e. the shortest data block size parameter and inner iterations of iterative computation time.

8, the execution of non-regular iteration parallel method: according to the order of the summit after the rearrangement, matrix

Row order and the row order one by one corresponding vertex change in proper order.Carry out the iterative computation based on data block and sub-block then, this type of iterative computation is described below:

Data block after the figure division is distributed in each computing node, carries out three cyclings then, outer circulation is convergent iterations, travels through each data block; The middle level circulation travels through the sub-block in each data block successively; After having traveled through middle level circulation, obtain and send data boundary, carry out the interior loop operation then, interior loop is at comprising in the sub-block

Iterative computation is carried out on the summit.

Characteristics of the present invention are: the non-regular bar block iteration parallel method that the present invention describes is towards distributed cluster system; Parallel method needs to consider communication optimization when considering local optimization.Adopted the figure partitioning technology twice, initial graph is divided in order to guarantee the load balancing of each processor node, when having guaranteed non-regular iterative calculation method local optimization, has reduced communication and synchronization overhead.In addition, method of the present invention has automatic tuning device, finds parameter matching under the efficiency optimization situation by it.

Non-regular iteration parallel method proposed by the invention has good parallel efficiency and extensibility.In addition, to parameters such as branch block size in the non-regular iteration parallel method and inner iterationses, design automatic tuning device, under different architectures, select and fixing best parameter, when calling non-regular iteration parallel method after being convenient to, realize the operational efficiency optimum of non-regular iteration parallel method.

Description of drawings

Fig. 1 is the non-regular iteration parallel method flow process with the automatic tuning function of performance.

Fig. 2 is certain structural symmetry matrix initialization synoptic diagram.

Fig. 3 carries out the figure division to matrix diagram among Fig. 2.

Fig. 4 is that the secondary of matrix diagram is divided synoptic diagram.

Fig. 5 is a sub-block border makeover process synoptic diagram.

The serial implementation of Fig. 6 right and wrong rule iterative calculation method.

Sub-block vertical view between Fig. 7 right and wrong rule iterative space-renewal sub-block order.

Embodiment:

Be described in further detail below in conjunction with the embodiment of accompanying drawing this method.

This method by analyzing the access mode of data, according to the generation strategy of data block and sub-block, improves the data locality and the concurrency of non-regular iterative computation in the starting stage; In the execute phase, by carrying out newly-generated scheduling strategy, carry out the code after changing, improve the data locality and the concurrency of non-regular iterative technique; In practical implementation, construct the automatic tuning device of performance of non-regular iterative calculation method, find the parameter values under the efficiency optimization situation to make up by exhaustive detection method, and fix its parameter value, the operational efficiency optimum of realization technology under this architecture.Fig. 1 is a method flow diagram.

The initialization matrix will have the matrix of coefficients of symmetrical structure as shown in Figure 2

With an adjacent map

Describe,

Represent the summit,

The expression matrix of coefficients

In element,

The expression adjacent map

In a limit

Carry out the figure division by the K-way method among the figure partitioning algorithm storehouse Metis, make the different closely summits of association in same subgraph, data block number parameter

Value is by computing node number decision in the distributed type assemblies.Partial order restriction relation with each summit of NodeDepence data structure storage.Divide by figure once, make the summit

With divide the data block produced

Set up following mapping relations:

, wherein Represent

Individual figure divides the back data block,

Value is a computing node number in the distributed type assemblies.Fig. 3 has described the result behind the matrix diagram piecemeal first time in the step 1.

Once figure divides the back generation Individual data block block piece, and then calling graph partitioning algorithm storehouse Metis carries out second time figure to each data block and divides, each data block piece is divided into again

Individual sub-block piece.Secondary is divided and exactly each data block block is carried out the figure division again.Divide the capacity decision of block size by system memory unit (as the level cache L1cache of computing node).Fig. 4 has described the result after the pairing adjacent map secondary division of sparse matrix of coefficients, and it carries out the secondary division to block1 and block2 respectively, respectively is divided into two sub-pieces, be respectively: sub-block1-1, sub-block-1-2, sub-block2-1, sub-block2-2.

In traditional iterative algorithm, computing node must and upgrade whole summits and finish iterative process one time by traversal, and when data volume increased and has indirect referencing, data locality was relatively poor.For this reason, we revise the boundary in each inner iteration at sub-block and finish serial iteration, by being carried out time-axis direction in each iterative process, data block is divided into sub-block, realization is carried out repeatedly iteration step renewal of recursion to same data block, thereby when not changing serial iteration algorithm character, improve data locality in the sub-block.

We carry out the border to sub-block and revise, and the border modification method adopts time lag technology (time-skewing), and the sub-block of each iteration is revised the boundary, and curved boundary is represented revised border among Fig. 5.The definition digraph Deposit the relation of adjacent sub-block.If summit in the sub-block v _iWith summit in the sub-block v _jThe border link to each other and v _i ＜v _j, then＜v _i, v _j∈ E.Definition

Figure 2011100537959100002DEST_PATH_IMAGE023

(v _i , v _j , k) be

Belong to data block in the inferior iteration v _iAnd with data block v _jAdjacent data boundary.

Data block border correction algorithm in the non-regular iterative technique carries out inner iteration border to sub-block and revises, for each summit, and by following formula, the update calculation vertex v _iAnd v _jNumerical value in the data block at the iter time iteration place.Inner herein iterations is that number of times is revised on the border.

Tile?(iter,v _i)?=?MAX(Tile(iter,v _i),Tile(iter-1,v _j));

Tile(?(iver,v _j)?=?MAX(Tile(iter,v _i),Tile(iter,v _j));

To system of linear equations

In unknown quantity Reorder, be mapped as new unknown quantity

Under the prerequisite that satisfies the summit partial ordering relation, right Order rearranges, and the following description of queueing discipline: in same data block, arrange according to partial ordering relation on the summit, and in the different pieces of information piece, the summit order produces new unknown vector according to the data block series arrangement

, promptly

Utilize the sub-block strategy that reorders, to the order of the sub-block rearrangement under the computing node in the distributed type assemblies.The sub-block tactful following description of reordering: in 0 computing node inside, do not have the sub-block of dependence with other nodes, be placed on the prostatitis of sub-block order after the rearrangement order, all the other sub-blocks are placed in the sub-block execution sequence by original order.

In non-0 computing node q inside, with computing node x(x＜q) sub-block of dependence is arranged, after the rearrangement order, be placed on the prostatitis of sub-block order, all the other sub-blocks are placed in the sub-block execution sequence by original order.

The serialization implementation of non-regular iterative calculation method is as shown in Figure 6: the shape on included summit is all different with quantity in each height piece, and the wire list registration of interblock is according to dependence.Sub-block 1 ~ sub-block 5 is that first data block is divided on first computing node and carries out; Sub-block 6 ~ sub-block 10 is that second data block is divided on second computing node and carries out.As shown in Figure 6, there is dependence successively in all sub-blocks, and therefore, the iteration of sub-block data needs serial to carry out.At first calculate the data in the sub-block 1, after fixed point in the 1st sub-block is carried out T iterative computation of inner iterations, sub-block 2 reads the summit value that associated border connects in the sub-block 1, carry out the iterative computation operation, carry out successively, finish up to sub-block 10 iterative computation.

Under the situation that guarantees the sub-block partial ordering relation,, need rearrange the sub-block order in order to realize the executed in parallel of the non-regular iterative computation of sub-block.Utilize the sub-block strategy that reorders, to the order of the sub-block rearrangement under the computing node in the distributed type assemblies.The sub-block tactful following description of reordering: in 0 computing node inside, do not have the sub-block of dependence with other nodes, be placed on the prostatitis of sub-block order after the rearrangement order, all the other sub-blocks are placed in the sub-block execution sequence by original order; In non-0 computing node q inside, with computing node x(x＜q) sub-block of dependence is arranged, after the rearrangement order, be placed on the prostatitis of sub-block order, all the other sub-blocks are placed in the sub-block execution sequence by original order.

Fig. 7 has described order after the permutatation of non-regular iterative computation neutron data piece: reorder back sub-block 2 and subdata 9 obtain/send data boundary after, sub-block 1,, 3,5,7,9 and sub-block 2,4,6,8,10 can be on two computing nodes executed in parallel iterative computation process.

A kind of non-regular iteration parallel method that the present invention proposes, it is a kind of parallel method with automatic tuning towards distributed type assemblies, by carrying out runtime data block boundary correction strategy, improves the executed in parallel performance of non-regular iteration.Construct the automatic tuning device of performance of non-regular iterative calculation method, find the parameter values under the efficiency optimization situation to make up by exhaustive detection method, and fix its parameter value, realize the operational efficiency optimum of non-regular iteration parallel method.

Claims

1. non-regular iteration parallel method is characterized in that this method may further comprise the steps:

Step 1, definition initialization matrix: the matrix of coefficients that will have symmetrical structure

Figure 2011100537959100001DEST_PATH_IMAGE002

With an adjacent map Describe,

Figure 2011100537959100001DEST_PATH_IMAGE006

Represent the summit,

Figure 2011100537959100001DEST_PATH_IMAGE008

The expression matrix of coefficients

In element,

Figure 2011100537959100001DEST_PATH_IMAGE010

The expression adjacent map

Figure 2011100537959100001DEST_PATH_IMAGE012

In a limit

Figure 2011100537959100001DEST_PATH_IMAGE014

Step 2, to matrix of coefficients

Once divide: carry out the figure division by the K-way method among the figure partitioning algorithm storehouse Metis, make the different closely summits of association in same subgraph, the data block number

Value is by computing node number decision in the distributed type assemblies; Divide by figure once, make the summit

Figure 2011100537959100001DEST_PATH_IMAGE018

With divide the data block produced

Figure 2011100537959100001DEST_PATH_IMAGE020

Set up following mapping relations:

, wherein

Figure 2011100537959100001DEST_PATH_IMAGE024

Represent

Individual figure divides the back data block,

Value is a computing node number in the distributed type assemblies;

Partial order restriction relation with each summit of NodeDepence data structure storage; Be described below:

NodeDepence +={＜v _i, v _j| (Tile (0, v _i)＜Tile (0, v _j)) ∩ (＜v _i, v _j∈ E ∪＜v _j, v _i∈ E), wherein, Tile(0, v _i) the expression vertex v _iAt the numerical value of the data block at the 0th iteration place, and Tile (0, v _j) the expression vertex v _jNumerical value in the data block at the 0th iteration place;

Step 3, to matrix of coefficients

Carrying out secondary divides: to once dividing the data block that the back produces Carrying out quadratic diagram by the K-way method in the Metis storehouse, figure partitioning algorithm storehouse again divides; Divide the capacity decision of block size parameter by system memory unit, each data block is divided into again

Individual sub-block; Partial order restriction relation with each summit of Sub_NodeDepence data structure storage; Be described below:

Step 4, sub-block carried out inner iteration border revise, specifically describe as follows:

For each summit, by following formula, the update calculation vertex v _iNumerical value in the data block at the iter time iteration place; Wherein inner iterations is that number of times is revised on the border;

Tile?(iter,v _i)?=?MAX(Tile(iter,v _i),Tile(iter-1,v _j));

Tile(?(iver,v _j)?=?MAX(Tile(iter,v _i),Tile(iter,v _j));

Reorder in step 5, summit: promptly to system of linear equations

Figure 2011100537959100001DEST_PATH_IMAGE030

In unknown quantity

Reorder, be mapped as new unknown quantity

Under the prerequisite that satisfies the summit partial ordering relation, right

Order rearranges, and in same data block, arrange according to partial ordering relation on the summit, and in the different pieces of information piece, the summit order produces new unknown vector according to the data block series arrangement ,

Figure 2011100537959100001DEST_PATH_IMAGE038

Step 6, sub-block reorder: utilize the sub-block strategy that reorders, to the order of the sub-block rearrangement under the computing node in the distributed type assemblies; Described sub-block reorders tactful as follows:

A. in 0 computing node inside, there is not the sub-block of dependence with other nodes, after the rearrangement order, be placed on the prostatitis of sub-block order, all the other sub-blocks are placed in the sub-block execution sequence by original order;

B. in non-0 computing node q inside, with computing node x the sub-block of dependence is arranged, after the rearrangement order, be placed on the prostatitis of sub-block order, all the other sub-blocks are placed in the sub-block execution sequence by original order, wherein x＜q;

Step 7, take a sub-block iterative computation process to test and assess as sample, setting the branch block size parameter by exhaustive detection method is the multiple of computing node level cache size in the distributed type assemblies, setting inner iterations is 3～10 times, carry out iterative computation, choose the shortest data block size parameter and inner iterations of iterative computation time;

The execution of step 8, non-regular iteration parallel method: according to the order of the summit after the rearrangement, matrix Row order and the row order one by one corresponding vertex change in proper order; Carry out the iterative computation based on data block and sub-block then, described iterative computation is as follows:

Figure 2011100537959100001DEST_PATH_IMAGE040

Iterative computation is carried out on the summit.