CN101944014B

CN101944014B - Method for realizing automatic pipeline parallelism

Info

Publication number: CN101944014B
Application number: CN 201010281797
Authority: CN
Inventors: 杨克峤; 李弋; 臧斌宇
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2010-09-15
Filing date: 2010-09-15
Publication date: 2013-08-21
Anticipated expiration: 2030-09-15
Also published as: CN101944014A

Abstract

The invention belongs to the technical field of program compilation and in particular relates to a method for realizing automatic pipeline parallelism. The method of the invention mainly comprises the following steps of: (1) identification of the pipeline parallelism, namely judging a loop structure which is provided with cross-loop iteration dependence and a dependence distance vector is a constant; (2) synchronization among threads, namely inserting the synchronization according to the dependence distance vector and deleting the redundant synchronization with the same distance vector; and (3) thread scheduling in a static step length, namely self-defining a thread scheduling strategy for balancing the workload of each thread and reducing the communication expense. The type identification of the loop structure is depended on the conventional array data stream analysis and dependence tests, while the pipeline parallelism only processes the regular loop structure with backward cross-loop iteration. The synchronization expense of the pipeline parallelism is high, so the pipeline parallelism is only performed on the outmost layer of a nested loop. Profit of the pipeline parallelism depends on programs, the number of the cyclic iteration is larger and the dependence distance is longer, the performance promotion is greater. The method for realizing the pipeline parallelism improves the capacity of automatic parallel optimization and contributes to further improving the performance of scientific calculation programs.

Description

The implementation method that a kind of automatic streamline is parallel

Technical field

The invention belongs to the program compilation technical field, be specifically related to the parallel implementation method of a kind of automatic streamline.

Background technology

The modern computing machine architecture develops towards the multi-thread direction of multinuclear, and the automatic program parallelization technology is proposed new challenge.Falling behind in relatively of parallelization technique of compiling restricted efficiently using and popularizing of high-performance computer on certain program.Serial program is compiled into concurrent program automatically for performance multinuclear advantage, improves program feature and promote the development of concurrent computational system to have great significance.

Circulation is to contain the abundantest a kind of structure of concurrency in the program, and also program is carried out part the most consuming time often.Be concurrency analysis at loop body to the essence of the automatic paralleling optimization of serial program.Can be divided into two kinds to the parallel of circulation: parallel fully and flowing water walks abreast.Parallel fully is the loop structure that relies at no datat between loop iteration, and this also is the loop structure that allows in the OpenMP programming model.For existing the circulation that relies on to have the forward direction of being divided into to rely on the back to dependence between loop iteration.Forward direction relies on, and namely stream is relevant, and the array read operation in this loop iteration relies on the write operation in the previous cycles iteration, can pass through cyclic deformation, realizes parallel fully; The back is to dependence, and the array read operation in this loop iteration must be before the array write operation of follow-up loop iteration.To dependence, if the dependence distance vector is constant, then can pass through flowing water Parallel Implementation executed in parallel for the back.Flowing water parallel needs realizes that by the scheduling of the data sync between circulation and parallel thread the flowing water of cross-thread is parallel, thereby many than complete parallel processing difficulty.Domestic parallelization research more lays particular emphasis on parallel fully loop structure.

The present invention proposes the parallel implementation method of a kind of automatic streamline, utilizes the data sync of spin lock realization cross-thread, and self-defined round-robin scheduling strategy finishes walking abreast to containing to stride the iterative data dependence and rely on vectorial loop structure expansion flowing water for constant.The present invention has not only increased the cyclical patterns of automatic paralleling optimization, and has broken external technical monopoly, and domestic high-performance calculation is had significance.

Summary of the invention

The object of the present invention is to provide the parallel implementation method of a kind of automatic streamline, with balanced load, realize optimization in Properties.

The implementation method that the automatic streamline that the present invention proposes is parallel, its step comprises: the identification that (1) flowing water is parallel; (2) cross-thread is synchronous; (3) thread scheduling of static step-length.Wherein:

The identification that flowing water is parallel is the feature according to loop structure, and it is parallel to judge whether to carry out flowing water to circulation, namely judges it is that the back is the loop structure of constant to relying on and relying on distance vector between loop iteration, and it is parallel to carry out flowing water; And by the parallel overhead model, assess parallel whether being worth of this flowing water and launch;

Cross-thread is synchronous, is the same beans-and bullets shooter according to the length computation flowing water that relies on distance vector, realizes the synchronous of cross-thread;

The thread scheduling of static step-length, the self-defined thread scheduling strategy of the present invention is guaranteed the operating load equilibrium of each thread.

Among the present invention, automatic streamline is parallel to have carried out automatic paralleling processing to striding the loop structure that relies on to constant behind the loop iteration, has increased the optimization ability of automatic paralleling optimization.

Among the present invention, utilize the spin lock realization to foundation and the emptying of multithreading streamline; By relying on distance vector, realize the synchronous of cross-thread.

Among the present invention, self-defined static step-length scheduling strategy has reduced communication-cost, realizes the load balancing of cross-thread.

The present invention be in automatic paralleling optimizing process to parallel fully replenishing, when loop structure can not adopt when parallel fully, it is parallel to judge whether to carry out flowing water again.Flowing water is parallel to need cross-thread synchronous, and parallel overhead is bigger than parallel fully.When the loop iteration number of times more big, or the data between the adjacent circulation iteration to rely on distance more far away, then the parallel income of flowing water is more big, synchronization overhead can be ignored.Back when relying on distance and be not constant when between loop iteration, the mode that then adopts serial to carry out, the present invention can not handle.

Description of drawings

Fig. 1 is walk abreast position in whole automatic paralleling optimization of automatic streamline.

Fig. 2 is the parallel execution example of flowing water.

Embodiment

Below concrete operations step of the present invention is further described.

Figure 1 shows that the whole automatic paralleling treatment scheme of the present invention.

The first, the dependence type of judgement loop structure

Automatic paralleling optimization at first is transformed into source program the form of intermediate representation.The intermediate representation of program carries out abstract to source program with structured form, and be recorded in the various information of collecting and producing in process analysis and the optimizing process, and it provides required program information support for each stage of process analysis, conversion and optimization.Based on the intermediate representation of program, launch the parallel type identification to loop structure, be labeled as parallel fully, flowing water is parallel or serial.

1. by traditional data dependency analysis and dependence test, compare standard for the circulation that does not have data to rely between loop iteration and loop structure, be designated parallel fully; Rely on for existing between loop iteration, judge that according to relying on the positive and negative of distance vector it still is that the back is to dependence that forward direction relies on.

Adopt Sk (I), wherein I=(i1, i2 ..., ik) conclude vector for circulation, represent a particular cycle iteration (I1=i1, I2=i2 ..., statement example Sk Ik=ik), then just like giving a definition:

Definition 1: if there is following dependence to take place: statement Sq (J) waits for finishing of Sp (I), and the entrance that then claims this dependence is Sp (I), exports into Sq (J).Rely on if contain the data of striding loop iteration, then Sp (I) and Sq (J) are in different loop iterations.

The definition 2: if the entrance of a dependence is Sp (I), export into Sq (J), then the distance vector of Yi Laiing be J-I=(j1-i1, j2-i2 ..., jc-ic), wherein c is the innermost loop of Sp and Sq.

If distance vector is constant, and being negative, then for forward direction relies on, as is positive number, then is that the back is to dependence; If relying on distance vector is not constant, then be that irregular circulation relies on, the executive mode of serial is adopted in this circulation.

Rely on for forward direction, adopt the cyclic transformation technology, as technology such as loop distribution, circulation integration technology, unimodular transformations.And for rely on vector for constant and be the back to the circulation that relies on, it is parallel then to be designated flowing water.The present invention only pays close attention to can the parallel loop structure of flowing water.

2. adopt the parallel overhead model to carry out Performance Evaluation to flowing water is parallel, as performance benefits, then adopt flowing water parallel, otherwise the parallel label of the flowing water on the deletion loop structure.Because the synchronization overhead that nested circulation internal layer flowing water walks abreast is bigger, it is parallel therefore only to keep the outermost flowing water of nested circulation, to the parallel executive mode that adopts serial of the flowing water of interior loop.

The second, insert synchronously according to relying on vector

At the streamline establishment stage, according to the parallel line number of passes, establish scope and the initialization of spin lock.Guarantee that by spin lock each thread enters streamline sequentially.Empty stage at streamline also will rely on spin lock to realize emptying as required.Modes to be recycled such as free lock employing condition realize.Determine the synchronous of cross-thread by relying on distance vector, realize by the calling system built-in function, as the built-in function synchronize () of Fortran.Redundancy for identical dependence distance can be deleted synchronously, reduces synchronization overhead.Fig. 2 is the parallel synoptic diagram of a flowing water.

The 3rd, before parallel loop structure, insert self-defining thread scheduling function

OpenMP3.0 programming standard allows User Defined thread scheduling algorithm.For balanced load, the parallel effect of performance reduces communication-cost.The present invention adopts static step-length scheduling strategy.Make L, U is respectively lower bound and the previous term of circulation, number of threads T＜U-L+1, and then the block size of the distribution of each thread is the upper bound of (U-L+1)/T.

The 4th, translate into the concurrent program that band OpenMP marks

The program conversion of source-to-source is generally adopted in automatic paralleling optimization, source program is translated into the source program of the parallel indication of band OpenMP by analysis optimization.According to the OpenMP standard, convert the centre sign to parallel source program, and identify the shared variable in the parallel loop structure of flowing water, privatization variable etc.

Claims

1. the implementation method that automatic streamline is parallel is characterized in that concrete steps are: the identification that (1) flowing water is parallel; (2) cross-thread is synchronous; (3) thread scheduling of static step-length; Wherein:

The thread scheduling of static step-length is self-defined thread scheduling strategy, guarantees the operating load equilibrium of each thread;

The concrete operations step is:

The first, the dependence type of judgement loop structure:

At first source program is transformed into the form of intermediate representation, the intermediate representation of program is to carry out abstract with structured form to source program, and be recorded in process analysis and the optimizing process the various information of collecting and producing, for each stage of process analysis, conversion and optimization provides required program information support; Based on the intermediate representation of program, launch the parallel type identification to loop structure, be labeled as parallel fully, flowing water is parallel or serial, the steps include:

1) by the test of existing data dependency analysis and dependence, for the circulation that does not have data to rely between loop iteration and loop structure standard relatively, be designated parallel fully; For exist relying between loop iteration, positive and negatively judge that it still is the back to dependence that forward direction relies on according to what rely on distance vector again: if distance vector is constant, and be negative, then be the forward direction dependence; If distance vector is constant, and being positive number, then is that the back is to dependence; If relying on distance vector is not constant, then be that irregular circulation relies on, the executive mode of serial is adopted in this circulation;

2) adopt the parallel overhead model to carry out Performance Evaluation to flowing water is parallel, as performance benefits, then adopt flowing water parallel, otherwise the parallel label of the flowing water on the deletion loop structure; And it is only parallel to the outermost layer expansion flowing water of nested circulation;

The second, insert synchronously according to relying on vector:

At the streamline establishment stage, according to the parallel line number of passes, establish scope and the initialization of spin lock; Guarantee that by spin lock each thread enters streamline sequentially; Empty stage at streamline relies on spin lock to realize emptying as required; Modes to be recycled such as spin lock employing condition realize; Determine the synchronous of cross-thread by relying on distance vector, realized by the calling system built-in function, for the redundant deletion synchronously of identical dependence distance;

The 3rd, before parallel loop structure, insert self-defining thread scheduling function:

Adopt static step-length scheduling strategy, make L, U is respectively lower bound and the previous term of circulation, number of threads T＜U-L+1, and then the block size of the distribution of each thread is the upper bound of (U-L+1)/T;

The 4th, translate into the concurrent program that band OpenMP marks:

Adopt the program conversion of source-to-source, source program is translated into the source program of the parallel indication of band OpenMP by analysis optimization.