JP2000020482A

JP2000020482A - Loop parallelizing method

Info

Publication number: JP2000020482A
Application number: JP10182368A
Authority: JP
Inventors: Shinichi Ito; 信一伊藤
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1998-06-29
Filing date: 1998-06-29
Publication date: 2000-01-21
Anticipated expiration: 2018-06-29
Also published as: JP3729644B2

Abstract

PROBLEM TO BE SOLVED: To provide a loop parallelizing method capable of performing the parallelizing, which is conventionally difficult, of a loop in which data dependency exists over loop repetitions. SOLUTION: Concerning the multiplex loop in which the data dependency exists over the loop repetitions, a pipeline paralleligation converting part 15 generates the loop for issuing barrier synchronism before and after that multiplex loop and generates a sentence for issuing the barrier synchronism before and after the divided loop. Thus, the multiplex loop can be parallelly executed by a multiprocessor in the manner of pipeline.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、バリア同期を利用
したループ並列化方法に係り、特に、バリア同期機構を
備えたマルチプロセッサに対する目的コードを生成する
ために使用して好適なコンパイラにおけるループ並列化
方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a loop parallelization method using barrier synchronization, and more particularly to a loop parallelization method for a compiler suitable for use in generating a target code for a multiprocessor having a barrier synchronization mechanism. About the method of conversion.

【０００２】[0002]

【従来の技術】同期制御としてバリア同期を利用する共
有メモリ型のマルチプロセッサ上で効率的に並列実行を
行うような目的コードを生成するための従来技術による
ループ並列化技術として、例えば、「スーバーコンパイ
ラ」(Hans Zima、Barbara Chapman 共著、村岡洋一訳、
オーム社）p276−p297等に記載された技術が知られてい
る。2. Description of the Related Art As a conventional loop parallelization technique for generating an objective code for efficiently executing parallel execution on a shared memory type multiprocessor utilizing barrier synchronization as synchronization control, for example, "Super Compiler "(co-authored by Hans Zima and Barbara Chapman, translated by Yoichi Muraoka,
Techniques described in Ohmsha, p. 276-p. 297 are known.

【０００３】従来技術によるループ並列化は、ループ繰
り返しにまたがるデータ依存（以下、これをループ運搬
依存と呼ぶ）が存在しないＤＯＡＬＬループを対象とし
て並列化を行うものである。そして、従来技術は、並列
化したＤＯＡＬＬループ間に依存関係が存在する場合、
ＤＯＡＬＬループ間にバリア同期を実行する文を挿入す
ることによって並列実行が正しく行われることを保証し
ている。[0003] Loop parallelization according to the prior art is to parallelize a DOALL loop in which there is no data dependency (hereinafter referred to as loop carrying dependency) over loop iterations. Then, in the related art, when there is a dependency between the parallelized DOALL loops,
Inserting a statement that executes barrier synchronization between DOALL loops ensures that parallel execution is performed correctly.

【０００４】図７は本発明によるループ並列化方法によ
り並列化された目的コードを並列に実行することが可能
な従来技術によるマルチプロセッサの構成と実行イメー
ジとを説明する図であり、以下、図７を参照して目的コ
ードの並列実行ついて説明する。FIG. 7 is a diagram for explaining a configuration and an execution image of a conventional multiprocessor capable of executing object codes parallelized by the loop parallelization method according to the present invention in parallel. The parallel execution of the object code will be described with reference to FIG.

【０００５】図７に示すマルチプロセッサは、単一ノー
ドとして機能するものであり、１台のマスタープロセッ
サ６１と、複数例えば８台のスレーブプロセッサ６２と
によって構成され、各プロセッサが、個々にキャッシュ
をもち、また、全てのプロセッサがメモリを共有してい
る。図７に示すマルチプロセッサは、前述のような構成
の下で、単一プログラムの処理の高速化を図るためにＳ
ＩＤ−ＳＭＰ方式と呼ばれる並列実行方式が採用されて
いる。以下、この方式について説明する。The multiprocessor shown in FIG. 7 functions as a single node, and is composed of one master processor 61 and a plurality of, for example, eight slave processors 62, and each processor individually has a cache. In addition, all processors share memory. The multiprocessor shown in FIG. 7 has a configuration in which the S
A parallel execution method called an ID-SMP method is adopted. Hereinafter, this method will be described.

【０００６】図７において、プログラムは、最初にマス
タープロセッサ６１上で実行される。同時に、プログラ
ム実行の先頭で、マスタープロセッサ６１は、マスター
プロセッサ６１上で実行するプログラムを親スレッドと
してスレーブプロセッサ数分の子スレッドを生成する。In FIG. 7, a program is first executed on a master processor 61. At the same time, at the beginning of the program execution, the master processor 61 uses the program executed on the master processor 61 as a parent thread and generates child threads for the number of slave processors.

【０００７】生成された子スレッドは、プログラムが終
了するまで各スレーブプロセッサ６２に固定して割り付
けられる。従って、スレッドの生成、削減のためのオー
バーヘッドは、プログラムの最初と最後とにのみ発生す
る。プログラムは、コンパイラによって、予め逐次実行
部分と並列実行部分とに分割される。逐次実行部分は、
マスタープロセッサ６１によってのみ実行され、この部
分の実行中は、他のプロセッサは実行待ちのスピンルー
プを実行する。プログラムの実行が並列実行部分に到達
すると、マスタープロセッサ６１と共に、全てのスレー
ブプロセッサ６２が並列実行部分の実行を開始する。全
てのプロセッサの並列実行部分の実行が終了すると、マ
スタープロセッサ６１は後続の逐次実行部分の実行を開
始し、スレーブプロセッサ６１は再び実行待ちのスピン
ループに入る。The generated child thread is fixedly allocated to each slave processor 62 until the program ends. Therefore, overhead for thread generation and reduction occurs only at the beginning and end of a program. The program is divided into a serial execution part and a parallel execution part in advance by a compiler. The sequential execution part is
It is executed only by the master processor 61, and during the execution of this part, the other processors execute a spin loop waiting for execution. When the execution of the program reaches the parallel execution part, all the slave processors 62 together with the master processor 61 start executing the parallel execution part. When the execution of the parallel execution portions of all the processors is completed, the master processor 61 starts the execution of the subsequent sequential execution portion, and the slave processor 61 enters the spin loop waiting for execution again.

【０００８】前述の方式は、複数の子スレッドを、特定
のプロセッサ上で固定的に待機させておくことによっ
て、並列実行の起動、終結のためのオーバーヘッド時間
を最小限に抑えようとするものである。In the above-mentioned method, a plurality of child threads are fixedly waited on a specific processor to minimize the overhead time for starting and terminating parallel execution. is there.

【０００９】前述において、スレーブプロセッサの並列
実行中にプロセッサ間で同期をとりたい場合がある。こ
のとき、同期の方法としてバリア同期が使用されるのが
一般的である。バリア同期は、全てのプロセッサが同期
点に到達するまで、先に同期点に到達した他の全てのプ
ロセッサが待機することによって同期をとるための方式
であり、専用のハードウエア命令によって実現すること
ができる。In the above description, there is a case where it is desired to synchronize the processors during the parallel execution of the slave processors. At this time, generally, barrier synchronization is used as a synchronization method. Barrier synchronization is a method for achieving synchronization by waiting for all processors to reach the synchronization point before all processors have reached the synchronization point, and is realized by dedicated hardware instructions. Can be.

【００１０】一方、自動並列化コンパイラが並列化の対
象とするのはループである。そして、並列化の対象とな
るループは、任意のループではなく、ループ実行前にル
ープ長が計算できるループであ、例えば、ＦＯＲＴＲＡ
ＮのＤＯループがこれに相当する。コンパイラは、これ
らのループを解析し、並列実行可能か否かを判定し、並
列実行可能と判定したループに対して、そのループのイ
タレーションを分割して、分割されたループを各プロセ
ッサが実行するようなコードを生成する。On the other hand, what the automatic parallelizing compiler targets for parallelization is a loop. The loop to be parallelized is not an arbitrary loop, but a loop whose loop length can be calculated before execution of the loop. For example, FORTRA
An N DO loop corresponds to this. The compiler analyzes these loops, determines whether they can be executed in parallel, and divides the iterations of the loops that are judged to be parallel executable, and executes the divided loops by each processor. Generate code that does

【００１１】並列実行されるループのイタレーションを
どのように分割して各プロセッサに割り付けるかは、マ
ルチプロセッサにおけるスケジューリング問題といわれ
る。このようなスケジューリング方式としては、動的ス
ケジューリング方式と静的スケジューリング方式との２
つが考えられる。How to divide an iteration of a loop executed in parallel and allocate it to each processor is called a scheduling problem in a multiprocessor. Such scheduling methods include two types, a dynamic scheduling method and a static scheduling method.
One can be considered.

【００１２】動的スケジューリング方式は、実行時に動
的に割り付け先のプロセッサを決定するので、各プロセ
ッサの負荷を均等に近くできるという利点があるが、実
行時にプロセッサへの割り付けを行うためのオーバーヘ
ッドが大きくなりすぎる傾向がある。これに対して、静
的スケジューリング方式は、コンパイラが静的に割り付
け先のプロセッサを決定するため、実行時のオーバーヘ
ッドは少なくて済むという利点を有している。特に、Ｄ
Ｏループの並列化は、単純な分割でも負荷を均等に近く
できる場合が多いと考えられ、図７に説明したマルチプ
ロセッサも、スケジューリング方式として、コンパイラ
による静的分割方式が採用されている。The dynamic scheduling method has an advantage that the processor to be allocated is dynamically determined at the time of execution, so that the load of each processor can be made close to equal. However, the overhead for allocating the processors at the time of execution is increased. Tends to be too large. On the other hand, the static scheduling method has an advantage that the overhead at the time of execution can be reduced because the compiler statically determines the processor to which the allocation is made. In particular, D
It is considered that the parallelization of the O loop can often make the load close to even by simple division, and the multiprocessor described in FIG. 7 also employs a static division method by a compiler as a scheduling method.

【００１３】図８は並列化可能なＤＯループの例と、そ
れを静的分割によって並列実行用に変換したコードとを
示す図であり、以下、これについて説明する。FIG. 8 is a diagram showing an example of a DO loop that can be parallelized and a code obtained by converting the DO loop for parallel execution by static division. This will be described below.

【００１４】ＤＯループの静的分割の方法として、各種
の分割方法が知られているが、基本的にはブロック分割
を行う方法が使用される。ブロック分割されたＤＯルー
プのそれぞれは、各プロセッサにより１つのループがブ
ロック長分の連続したイタレーションとして実行され
る。ブロック長は、ループ長をプロセッサの総数で割る
ことによって求めることができる。As a method of statically dividing the DO loop, various division methods are known, but a method of performing block division is basically used. In each of the DO loops divided into blocks, one loop is executed by each processor as a continuous iteration corresponding to the block length. The block length can be determined by dividing the loop length by the total number of processors.

【００１５】図８に示す並列化後のループにおけるＴの
計算がブロック長を求めるものである。この計算式にお
いて、Ｔはブロック長、ＰＲＯＣ＿ＰＮはプロセッサの
総数、ＣＥＬＩは割り切れないときに切り上げを行う指
示である。各プロセッサは、自分のプロセッサの識別番
号、図８の例ではＰＲＯＣ＿ＩＤを持っているので、そ
のプロセッサ番号とブロック長とから自分の担当するイ
タレーションの下限Ｌ、上限Ｕを計算することができ
る。The calculation of T in the loop after parallelization shown in FIG. 8 determines the block length. In this formula, T is the block length, PROC_PN is the total number of processors, and CELI is an instruction to round up when it cannot be divided. Since each processor has its own identification number, PROC_ID in the example of FIG. 8, each processor can calculate the lower limit L and the upper limit U of its own iteration from the processor number and the block length.

【００１６】図８に示したような並列実行用のコード
は、各プロセッサ上にロードされて同一のものが並列に
実行される。但し、その中で参照される変数は全てのプ
ロセッサで共有されるグローバルなものと、各プロセッ
サに固有の領域に割り当てられるローカルなものとに分
かれる。基本的に、元のプログラム中に現れる変数はグ
ローバルで、Ｔ，Ｌ，Ｕのように並列化後に生成される
変数はローカルになる。ループ制御変数、すなわち、Ｄ
Ｏループの実行を制御する変数、図８の例におけるＩ
は、並列化時には、ローカルな変数になる。そうしない
と、同一領域に別々のプロセッサからの書き込みが生じ
てしまう。図８に示す例では、元のループ制御変数と区
別するため並列化後のループ制御変数を％Ｉとしてあ
る。The code for parallel execution as shown in FIG. 8 is loaded on each processor and the same code is executed in parallel. However, variables referred to therein are divided into global variables shared by all processors and local variables assigned to an area unique to each processor. Basically, variables appearing in the original program are global, and variables generated after parallelization, such as T, L, U, are local. The loop control variable, D
Variables controlling the execution of the O loop, I in the example of FIG.
Becomes a local variable during parallelization. Otherwise, writes from different processors will occur in the same area. In the example shown in FIG. 8, the loop control variable after parallelization is set to% I to distinguish it from the original loop control variable.

【００１７】[0017]

【発明が解決しようとする課題】一般に、マルチプロセ
ッサの実行性能を向上させるためには、多重ループから
効率的に並列性を抽出する技術が重要になる。しかし、
バリア同期を利用する前述した従来技術は、多重ループ
内にＤＯＡＬＬループがない場合、ループ分配等のルー
プ変換技術を使ってＤＯＡＬＬループを作り出すことが
できない限り、並列化を行うことが困難であるという問
題点を有している。Generally, in order to improve the execution performance of a multiprocessor, a technique for efficiently extracting parallelism from multiple loops is important. But,
The above-described conventional technique using barrier synchronization is that if there is no DOALL loop in a multiplex loop, it is difficult to perform parallelization unless a DOALL loop can be created using a loop conversion technique such as loop distribution. Has problems.

【００１８】本発明の目的は、前述した従来技術の問題
点を解決し、ＤＯＡＬＬループを含まない多重ループに
対しても、並列性を抽出可能な場合にはバリア同期を利
用して並列化を行い、マルチプロセッサ上で効率的な実
行性能が得られるような目的コードを生成することを可
能にしたコンパイラにおけるループ並列化方法を提供す
ることにある。An object of the present invention is to solve the above-mentioned problems of the prior art, and to perform parallelization using barrier synchronization even in a multiple loop that does not include a DOALL loop, if parallelism can be extracted. It is an object of the present invention to provide a loop parallelizing method in a compiler which is capable of generating a target code capable of obtaining efficient execution performance on a multiprocessor.

【００１９】[0019]

【課題を解決するための手段】本発明によれば前記目的
は、バリア同期機構を備えた共有メモリ型マルチプロセ
ッサに与える目的コードを生成するための高級言語の翻
訳を行うコンパイラにおけるループ並列化方法におい
て、ループ繰り返しにまたがったデータ依存関係の存在
する多重ループに対して、当該多重ループのうちのいず
れかのループを分割し、分割したループをパイプライン
的に並列実行させるために、分割後の多重ループの前後
にバリア同期を発行するループを生成し、分割したルー
プの直後にバリア同期を発行する文を生成することによ
り達成される。According to the present invention, there is provided a method for parallelizing a loop in a compiler for translating a high-level language for generating an object code to be provided to a shared memory type multiprocessor having a barrier synchronization mechanism. In, for any multiple loops that have a data dependency over the loop iteration, any one of the multiple loops is divided, and the divided loops are executed in parallel in a pipeline manner. This is achieved by generating a loop for issuing barrier synchronization before and after the multiplex loop, and generating a statement for issuing barrier synchronization immediately after the divided loop.

【００２０】また、前記目的は、前記分割したループが
内側ループである場合、外側ループをブロック化し、前
記分割したループが外側ループである場合、内側ループ
をブロック化することにより達成される。Further, the above object is achieved by blocking the outer loop when the divided loop is the inner loop, and blocking the inner loop when the divided loop is the outer loop.

【００２１】[0021]

【発明の実施の形態】以下、本発明によるバリア同期を
利用したループ並列化方法の一実施形態を図面により詳
細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of a loop parallelizing method using barrier synchronization according to the present invention will be described below in detail with reference to the drawings.

【００２２】図１は本発明によるループ並列化を行うコ
ンパイラの構成を示す図、図２は図１に示すコンパイラ
の動作を説明するフローチャート、図３はＦＯＲＴＲＡ
Ｎ言語で記述されたＤＯループの例を示す図、図４は図
３に示すＤＯループの並列分割を行った後のＤＯループ
を説明する図、図５は図４に示すＤＯループにバリア同
期文を挿入した後のＤＯループを説明する図、図６は図
５に示すＤＯループにブロック化を適用した後のＤＯル
ープを説明する図である。図１において、１１はソース
プログラム、１２はコンパイラ、１３は並列化変換部、
１４はＤＯＡＬＬ型並列化変換部、１５はパイプライン
型並列化変換部、１６は目的コード生成部、１７は目的
プログラム、１８はプログラム格納媒体である。FIG. 1 is a diagram showing the configuration of a compiler for performing loop parallelization according to the present invention, FIG. 2 is a flowchart for explaining the operation of the compiler shown in FIG. 1, and FIG. 3 is FORTRA.
FIG. 4 is a diagram showing an example of a DO loop described in N languages, FIG. 4 is a diagram for explaining a DO loop after performing parallel division of the DO loop shown in FIG. 3, and FIG. 5 is barrier synchronization with the DO loop shown in FIG. FIG. 6 is a diagram for explaining a DO loop after inserting a sentence, and FIG. 6 is a diagram for explaining a DO loop after applying blocking to the DO loop shown in FIG. In FIG. 1, 11 is a source program, 12 is a compiler, 13 is a parallelization conversion unit,
Reference numeral 14 denotes a DOALL type parallelization conversion unit, 15 denotes a pipeline type parallelization conversion unit, 16 denotes a target code generation unit, 17 denotes a target program, and 18 denotes a program storage medium.

【００２３】本発明によるループ並列化を行うコンパイ
ラ１２は、ＦＯＲＴＲＡＮ等の逐次型の高級言語で記述
されたソースプログラム１１を入力として、マルチプロ
セッサ上で実行可能な目的コードを生成し目的プログラ
ム１７として出力するものであり、並列化変換部１３と
目的コード生成部１６とにより構成される。並列化変換
部１３は、並列化の判定及び変換を実施する機能を持
ち、プログラム中のループを認識し、これらの各ループ
に対して並列化が可能か否かの判定を行い、並列化可能
と判定した場合に、入力されたソースプログラムを並列
実行可能なループに変換する。The compiler 12 for performing loop parallelization according to the present invention receives a source program 11 described in a serial type high-level language such as FORTRAN as an input, generates a target code executable on a multiprocessor, and generates the target program 17 as a target program 17. It is output and comprises a parallelization conversion unit 13 and a target code generation unit 16. The parallelization conversion unit 13 has a function of performing parallelization determination and conversion, recognizes loops in the program, determines whether or not each of these loops can be parallelized, and performs parallelization. When the determination is made, the input source program is converted into a loop that can be executed in parallel.

【００２４】並列化変換部１３は、従来型のＤＯＡＬＬ
型の並列化を行うＤＯＡＬＬ型並列化変換部１４と、本
発明によるパイプライン型の並列化を行うパイプライン
型並列化変換部１５とにより構成される。そして、並列
化変換部１３は、入力されたソースプログラムの各ルー
プに対して、最初にＤＯＡＬＬ型の並列化を試み、これ
が適用できないループに対してパイプライン型の並列化
の適用が試みる。The parallelization conversion unit 13 is a conventional DOALL
It is composed of a DOALL type parallelization conversion unit 14 for performing parallelization of the type and a pipeline type parallelization conversion unit 15 for performing the pipeline type parallelization according to the present invention. Then, the parallelization conversion unit 13 first attempts DOALL-type parallelization for each loop of the input source program, and attempts to apply pipeline-type parallelization to a loop to which this cannot be applied.

【００２５】なお、コンパイラ１２における処理を実行
するコンパイラプログラムは、記録媒体１８に格納され
ている。The compiler program for executing the processing in the compiler 12 is stored in the recording medium 18.

【００２６】次に、図２に示すフローを参照してコンパ
イラ１２における処理動作を説明する。Next, the processing operation in the compiler 12 will be described with reference to the flow shown in FIG.

【００２７】（１）入力されるソースプログラム内の各
ループネストに対してデータ依存の解析を行う（ステッ
プ２０１）。(1) Data dependency analysis is performed for each loop nest in the input source program (step 201).

【００２８】（２）データ依存の解析からＤＯＡＬＬ型
並列化が適用可能か否かをチェックし、ＤＯＡＬＬ型並
列化が適用可能な場合、ＤＯＡＬＬ型並列化変換部１４
に並列化を実行させる（ステップ２０２、２０３）。(2) It is checked whether or not DOALL type parallelization can be applied based on data dependency analysis, and if DOALL type parallelization can be applied, the DOALL type parallelization conversion unit 14
(Steps 202 and 203).

【００２９】（３）ステップ２０２で、ＤＯＡＬＬ型並
列化が適用不可能であると判定された場合、パイプライ
ン型並列化が適用可能か否かをチェックし、パイプライ
ン型並列化が適用可能な場合、パイプライン型並列化変
換部１５に並列化を実行させる（ステップ２０４、２０
５）。(3) If it is determined in step 202 that the DOALL type parallelization is not applicable, it is checked whether the pipeline type parallelization is applicable, and the pipeline type parallelization is applicable. In this case, the parallelization conversion unit 15 is caused to execute parallelization (steps 204 and 20).
5).

【００３０】（４）ステップ２０４で、パイプライン型
並列化が適用不可能であると判定された場合、並列化を
行わず、ステップ２０１の処理に戻って、次のループネ
ストの解析からの処理を続ける。なお、並列化されない
ループは、逐次実行されることになる（ステップ２０
６）。(4) If it is determined in step 204 that the pipeline type parallelization is not applicable, the parallelization is not performed, and the process returns to step 201 and the processing from the analysis of the next loop nest is performed. Continue. Note that loops that are not parallelized are sequentially executed (step 20).
6).

【００３１】次に、ソースプログラムの例として、図３
に示すプログラムを考え、パイプライン型の並列化の概
要を説明する。Next, FIG. 3 shows an example of a source program.
Considering the program shown in (1), an outline of pipeline type parallelization will be described.

【００３２】図３に示す例は、ＦＯＲＴＲＡＮ言語で記
述されたＤＯループであり、このＤＯループは、外側ル
ープ２１に対しても、内側ループ２２に対しても配列Ａ
に関してループ運搬依存が存在する。このようなループ
は、まず、内側ループを分割してマルチプロセッサ上の
各プロセッサに割り当てられる。プロセッサ番号をＰ
１，Ｐ２，・・・，Ｐｎとすると、内側ループのイタレ
ーション［１：Ｎ］が分割されて、順番にプロセッサＰ
１，Ｐ２，・・・，Ｐｎに割り当てられる。The example shown in FIG. 3 is a DO loop described in the FORTRAN language. This DO loop has an array A for both the outer loop 21 and the inner loop 22.
There is a loop-carrying dependency on Such a loop is first divided into inner loops and assigned to each processor on the multiprocessor. P for processor number
1, P2,..., Pn, the iteration [1: N] of the inner loop is divided and the processor P
, P2,..., Pn.

【００３３】このとき、各プロセッサに割り当てられた
内側ループのイタレーションを、プロセッサＰｋに対し
て［Ｌｋ：Ｕｋ］（但し、Ｌｋは、イタレーションの下
限値、Ｕｋはイタレーションの上限値）とする。これ
に、外側ループイタレーション［１：Ｎ］を考慮に入れ
ると、結局、各プロセッサは、プロセッサ番号Ｐｋとす
るとき、［１：Ｎ，Ｌｋ：Ｕｋ］のイタレーションを実
行することになる。At this time, the iteration of the inner loop assigned to each processor is defined as [Lk: Uk] (where Lk is the lower limit of the iteration and Uk is the upper limit of the iteration) with respect to the processor Pk. I do. When the outer loop iteration [1: N] is taken into consideration, each processor eventually executes the iteration of [1: N, Lk: Uk] when the processor number is Pk.

【００３４】ここで、外側ループのイタレーションをＭ
とするとき、イタレーション［Ｍ，Ｌｋ：Ｕｋ］からイ
タレーション［Ｍ，Ｌｋ＋１：Ｕｋ＋１］への依存が存
在するため、イタレーション［Ｍ，Ｌｋ：Ｕｋ］の実行
が完了してからでないと、イタレーション［Ｍ，Ｌｋ＋
１：Ｕｋ＋１］の実行を開始することはできない。逆
に、これ以外には、異なるプロセッサに割り当てられた
イタレーション間には依存関係は存在しない。このよう
な場合、パイプライン的な並列化を適用することができ
る。Here, the iteration of the outer loop is represented by M
Since there is a dependency from the iteration [M, Lk: Uk] to the iteration [M, Lk + 1: Uk + 1], the execution of the iteration [M, Lk: Uk] must be completed. Iteration [M, Lk +
1: Uk + 1] cannot be started. Conversely, there are no other dependencies between iterations assigned to different processors. In such a case, pipeline parallelization can be applied.

【００３５】バリア同期を利用したパイプライン的な並
列化の処理は、プロセッサ番号の小さいものから順番に
ループの実行を開始することにより行われる。まず、プ
ロセッサＰ１がループの実行を開始し、プロセッサＰ１
は、外側ループの最初のイタレーションの実行を終了し
たとき、バリア同期を発行し、プロセッサＰ２がループ
の実行を開始する。次に、プロセッサＰ１が外側ループ
の２番目のイタレーションの実行を終了し、プロセッサ
Ｐ２が外側ループの最初のイタレーションの実行を終了
したとき、バリア同期を発行し、プロセッサＰ３がルー
プの実行を開始する。The pipeline-like parallel processing using barrier synchronization is performed by starting the execution of the loop in ascending order of processor number. First, the processor P1 starts executing the loop, and the processor P1
Issues a barrier synchronization when the first iteration of the outer loop is completed, and the processor P2 starts executing the loop. Next, when the processor P1 finishes executing the second iteration of the outer loop and the processor P2 finishes executing the first iteration of the outer loop, the processor P3 issues a barrier synchronization, and the processor P3 starts executing the loop. Start.

【００３６】以下、同様にしてバリア同期によって同期
をとりながらプロセッサは順番にループの実行を開始
し、ループの実行を行うプロセッサが１台ずつ増えてゆ
き、やがて、全てのプロセッサによって並列実行が行わ
れるようになる。このときも、外側ループの１イタレー
ションの実行終了毎にバリア同期が発行される。ループ
の実行の終了は、必然的にループ開始と同様にプロセッ
サ番号が小さいものから完了し、プロセッサＰ１がルー
プの実行を終了した後、ループの実行を行うプロセッサ
は１台ずつ減ってゆき、やがて、全てのプロセッサがル
ープの実行を終了する。In the same manner, the processors sequentially start executing the loop while synchronizing with each other by the barrier synchronization, and the number of processors executing the loop increases one by one. You will be Also at this time, the barrier synchronization is issued every time one iteration of the outer loop is completed. The end of the execution of the loop is inevitably completed from the one with the smallest processor number, similarly to the start of the loop. After the processor P1 finishes the execution of the loop, the number of processors that execute the loop decreases one by one. , All the processors finish executing the loop.

【００３７】本発明の実施形態においては、前述したよ
うなバリア同期を利用したパイプライン的な並列化を実
現するために、以下のような方針に従って目的コードを
生成する。すなわち、まず、ループ実行の直前に（プロ
セッサ番号−１）回分のバリア同期を発行するループを
生成する。次に、内側ループを分割し各プロセッサに割
り当て、分割したループの直後にバリア同期を発行する
文を生成する。最後に、ループ実行の直後に（プロセッ
サ総数−プロセッサ番号）回分バリア同期を発行するル
ープを生成する。In the embodiment of the present invention, in order to realize pipeline-like parallelization using barrier synchronization as described above, an object code is generated according to the following policy. That is, first, a loop for issuing (processor number-1) barrier synchronizations immediately before the loop execution is generated. Next, the inner loop is divided and assigned to each processor, and a statement for issuing barrier synchronization immediately after the divided loop is generated. Finally, a loop for issuing (the total number of processors−the processor number) times of barrier synchronizations immediately after the execution of the loop is generated.

【００３８】前述した説明におけるパイプライン的な並
列化は、パイプラインの１ステージ、すなわち、バリア
同期の発行の間の実行が外側ループの１イタレーション
に相当するが、外側ループ長が充分に長いことがわかっ
ている、または、予測されるとき、このパイプラインの
１ステージで外側ループの複数のイタレーションを実行
するようにすることによってバリア同期の発行回数を削
減することができる。これを、ブロック化と呼び、バリ
ア同期の発行回数を削減することが可能となり、実行時
間の削減が期待できる。In the pipeline parallelization in the above description, one stage of the pipeline, that is, execution during issuance of barrier synchronization corresponds to one iteration of the outer loop, but the outer loop length is sufficiently long. When known or predicted, the number of barrier synchronizations issued can be reduced by performing multiple iterations of the outer loop in one stage of the pipeline. This is called blocking, and it is possible to reduce the number of times barrier synchronization is issued, and it can be expected that execution time will be reduced.

【００３９】前述したように、図３に示したＦＯＲＴＲ
ＡＮプログラムで記述されたＤＯループは、外側ルー
プ、内側ループの両方に対してループ運搬依存が存在す
るためＤＯＡＬＬ型の並列化を適用することができな
い。従って、パイプライン型の並列化が適用される。前
述した方針に従って図３に示すＤＯループを分割したＤ
Ｏループが図４に示されており、以下、これについて説
明する。As described above, the FORTR shown in FIG.
The DOALL described in the AN program cannot be applied to the DOALL type parallelization because the outer loop and the inner loop both have loop carrying dependency. Therefore, pipeline type parallelization is applied. D obtained by dividing the DO loop shown in FIG.
The O-loop is shown in FIG. 4 and will be described below.

【００４０】最初に、内側ループのイタレーションをマ
ルチプロセッサ上のプロセッサ数だけ均等に分割して各
プロセッサに順番に割り当てる。各プロセッサに割り当
てられる内側ループのループ長をＢ、プロセッサ総数Ｐ
Ｎとすると、Ｂ＝［Ｎ／ＰＮ］となる。ここで、［］
は割り切れなかった場合、整数への切り上げを示すもの
である。また、各ループが実行する内側ループのイタレ
ーションの下限値をＬ、上限値をＵとすると、Ｌ＝Ｂ×
（Ｐ−１）＋１、Ｕ＝ＭＩＮ（Ｌ＋Ｂ−１，Ｎ）とな
る。この式において、Ｐはプロセッサ番号（Ｐ＝１，
２，・・・，ＰＮ）である。First, the iterations of the inner loop are equally divided by the number of processors on the multiprocessor and are sequentially allocated to each processor. B is the loop length of the inner loop allocated to each processor, and P is the total number of processors.
If N, then B = [N / PN]. here,[ ]
Indicates that if not divisible, it is rounded up to an integer. If the lower limit of the iteration of the inner loop executed by each loop is L and the upper limit is U, L = B ×
(P-1) +1, U = MIN (L + B-1, N). In this equation, P is the processor number (P = 1,
2,..., PN).

【００４１】前述した変換によって、各プロセッサは、
自プロセッサ番号からループの上下限値を計算して自分
に割り当てられたループを実行することが可能になる。
そして、自分に割り当てられたループを計算するため
に、各プロセッサに与えるＤＯループは、前述した分割
ループのループ長の計算式３１、分割ループの下限値計
算式３２、分割ループの上限値計算式３３及び分割され
た内側ループ３４の情報を含んでいる。According to the above-described conversion, each processor:
It becomes possible to calculate the upper and lower limits of the loop from its own processor number and execute the loop assigned to itself.
In order to calculate the loop assigned to itself, the DO loop given to each processor includes the above-described formula 31 for calculating the loop length of the divided loop, the formula 32 for calculating the lower limit value of the divided loop, and the formula for calculating the upper limit value of the divided loop. 33 and information on the divided inner loop 34.

【００４２】さらに、図４に示す分割したループを各プ
ロセッサがパイプライン的に並列実行するためにバリア
同期を発行するバリア同期文が挿入される。バリア同期
文は、図５に示すように、図４に示したループの直前に
挿入されるバリア同期によるプロローグループ４１、直
後に挿入されるバリア同期によるエピローグループ４３
と、分割したループの直後の位置に挿入されるバリア同
期文４２とにより構成される。プロローグループ４１
は、（Ｐ−１）回繰り返えされ、エピローグループ４３
は、（ＰＮ−Ｐ）回繰り返えされる。Further, a barrier synchronization statement for issuing barrier synchronization is inserted in order for each processor to execute the divided loop shown in FIG. 4 in parallel in a pipeline manner. As shown in FIG. 5, the barrier synchronization statement includes a pro-row group 41 by the barrier synchronization inserted immediately before the loop shown in FIG. 4, and an epi-row group 43 by the barrier synchronization inserted immediately after the loop.
And a barrier synchronization statement 42 inserted at a position immediately after the divided loop. Pro Law Group 41
Is repeated (P-1) times and the epi-row group 43
Is repeated (PN-P) times.

【００４３】この結果、マルチプロセッサ上の各プロセ
ッサは、自プロセッサ番号より１つ若いプロセッサ番号
のプロセッサが分割されたＤＯループの１回を終了する
までプロローグループ４１を実行し、その後に自プロセ
ッサでのＤＯループの処理を開始することになる。そし
て、マルチプロセッサは、若番のプロセッサから順に処
理を開始して全てのプロセッサが協調してパイプライン
的な並列実行を行うことが可能となる。また、各プロセ
ッサでのＤＯループの処理は、若番のプロセッサから順
に終了することになるが、エピローグループ４３を実行
することにより、全プロセッサが同時に処理を終了する
ことになる。As a result, each processor on the multiprocessor executes the pro row group 41 until the processor with the processor number one younger than the own processor number completes one divided DO loop. Of the DO loop is started. Then, the multiprocessor starts the processing in order from the youngest processor, and all the processors can cooperate and perform pipeline-like parallel execution. In addition, the processing of the DO loop in each processor ends in order from the youngest processor, but by executing the epi-row group 43, all the processors end the processing at the same time.

【００４４】前述した本発明の実施形態は、内側ループ
を分割し、分割されたＤＯループをマルチプロセッサを
構成する各プロセッサに割り当てるとして説明したが、
本発明は、外側ループを分割し、分割されたＤＯループ
をマルチプロセッサを構成する各プロセッサに割り当て
るように構成してもよい。In the above-described embodiment of the present invention, the inner loop is divided, and the divided DO loop is allocated to each processor constituting the multiprocessor.
The present invention may be configured such that the outer loop is divided, and the divided DO loop is assigned to each processor constituting the multiprocessor.

【００４５】また、前述で説明した実施形態により各プ
ロセッサに割り当てられるループにおける多重ループ内
でのバリア同期文４２の発行回数は外側ループ長と等し
くＮ回となるが、外側ループにブロック化を適用するこ
とによってバリア同期の実行回数を削減することができ
る。In the loop allocated to each processor according to the above-described embodiment, the number of times the barrier synchronization statement 42 is issued in a multiplex loop is N times, which is equal to the outer loop length. However, blocking is applied to the outer loop. By doing so, the number of executions of barrier synchronization can be reduced.

【００４６】外側ループにブロック化を適用して得るこ
とができるループの例が図６に示されている。このルー
プは、図５に示すループに対して、外側ループをブロッ
ク化することによりループ５１とループ５２とからなる
二重ループに変換したものである。ループ５２は、ブロ
ック化されたループ長Ｔのループであり、ループ５３
は、その実行の制御ループになる。このとき、バリア同
期文は、ブロック化されたループの直後の位置にバリア
同期文５３として挿入される。この結果、バリア同期文
５３の発行は、ブロック化を行わない場合にＮ回であっ
たものを、ブロック化を適用することにより［Ｎ／Ｔ］
回とすることができる。An example of a loop that can be obtained by applying blocking to the outer loop is shown in FIG. This loop is obtained by converting the loop shown in FIG. 5 into a double loop including a loop 51 and a loop 52 by blocking the outer loop. The loop 52 is a loop having a blocked loop length T, and a loop 53
Is a control loop for its execution. At this time, the barrier synchronization statement is inserted as a barrier synchronization statement 53 at a position immediately after the blocked loop. As a result, the issue of the barrier synchronization statement 53 is N times when the blocking is not performed, but is changed to N / T by applying the blocking.
Times.

【００４７】前述では内側ループを分割し、外側ループ
をブロック化するとして説明したが、本発明は、外側ル
ープを分割し、内側ループをブロック化することもで
き、同様な効果を得ることができる。Although the above description has been made on the assumption that the inner loop is divided and the outer loop is blocked, the present invention can also divide the outer loop and block the inner loop, and can obtain the same effect. .

【００４８】[0048]

【発明の効果】以上説明したように本発明によれば、Ｄ
ＯＡＬＬループを含まない多重ループに対しても、並列
性を抽出可能な場合にはバリア同期を利用して並列化を
行い、マルチプロセッサ上で効率的な実行性能が得られ
るような目的コードを生成することができる。As described above, according to the present invention, D
If parallelism can be extracted even for multiple loops that do not include OALL loops, parallelization is performed using barrier synchronization, and object code that can provide efficient execution performance on a multiprocessor is generated. can do.

[Brief description of the drawings]

【図１】本発明によるループ並列化を行うコンパイラの
構成を示す図である。FIG. 1 is a diagram illustrating a configuration of a compiler that performs loop parallelization according to the present invention.

【図２】図１に示すコンパイラの動作を説明するフロー
チャートである。FIG. 2 is a flowchart illustrating the operation of the compiler shown in FIG. 1;

【図３】ＦＯＲＴＲＡＮ言語で記述されたＤＯループの
例を示す図である。FIG. 3 is a diagram illustrating an example of a DO loop described in a FORTRAN language.

【図４】図３に示すＤＯループの並列分割を行った後の
ＤＯループを説明する図である。FIG. 4 is a diagram illustrating a DO loop after the DO loop shown in FIG. 3 is divided in parallel.

【図５】図４に示すＤＯループにバリア同期文を挿入し
た後のＤＯループを説明する図である。FIG. 5 is a diagram illustrating a DO loop after inserting a barrier synchronization statement into the DO loop shown in FIG. 4;

【図６】図５に示すＤＯループにブロック化を適用した
後のＤＯループを説明する図である。FIG. 6 is a diagram illustrating a DO loop after applying blocking to the DO loop shown in FIG. 5;

【図７】本発明によるループ並列化方法により並列化さ
れた目的コードを並列に実行することが可能な従来技術
によるマルチプロセッサの構成と実行イメージとを説明
する図である。FIG. 7 is a diagram illustrating a configuration and an execution image of a conventional multiprocessor capable of executing object codes parallelized by the loop parallelization method according to the present invention in parallel.

【図８】並列化可能なＤＯループの例と、それを静的分
割によって並列実行用に変換したコードとを示す図であ
る。FIG. 8 is a diagram illustrating an example of a DO loop that can be parallelized and a code obtained by converting the DO loop for parallel execution by static division.

[Explanation of symbols]

１１ソースプログラム１２コンパイラ１３並列化変換部１４ＤＯＡＬＬ型並列化変換部１５パイプライン型並列化変換部１６目的コード生成部１７目的プログラム１８プログラム格納媒体 DESCRIPTION OF SYMBOLS 11 Source program 12 Compiler 13 Parallelization conversion part 14 DOALL type parallelization conversion part 15 Pipeline type parallelization conversion part 16 Object code generation part 17 Object program 18 Program storage medium

Claims

[Claims]

1. A loop parallelizing method in a compiler for translating a high-level language for generating an object code to be provided to a shared memory type multiprocessor having a barrier synchronization mechanism, wherein a data dependency over a loop iteration exists. Generates a loop that issues barrier synchronization before and after the split multiplex loop in order to split any of the multiplex loops and execute the split loops in parallel in a pipeline manner. And generating a statement for issuing barrier synchronization immediately after the divided loop.

2. The loop according to claim 1, wherein if the divided loop is an inner loop, the outer loop is blocked, and if the divided loop is an outer loop, the inner loop is blocked. Parallelization method.

3. A recording medium storing a program for realizing the loop parallelization method according to claim 1.