JP2001306333A

JP2001306333A - Loop parallelizing method

Info

Publication number: JP2001306333A
Application number: JP2000122022A
Authority: JP
Inventors: Kiyomi Wada; 清美和田; Hiroshi Ota; 寛太田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2000-04-24
Filing date: 2000-04-24
Publication date: 2001-11-02

Abstract

PROBLEM TO BE SOLVED: To output a program or an object code to shorten parallel execution time of loops consisting of sentences to be included in a dependence cycle and sentences not to be included in the dependence cycle. SOLUTION: In this loop parallelizing method to output a parallelized program 16 or the object code by inputting a program 14 for computer including a loop processing, a processing part 123 to separate a loop main body, a processing part 125 to schedule the loops consisting of the sentences to be included in the dependence cycle to each processor so that they are successively executed in an execution order and a processing part 126 to schedule the loops consisting of the sentences not to be included in the dependence cycle so no waiting time for synchronization by the dependence cycle becomes idle by every processor are provided.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、計算機用プログラ
ムを分割して並列計算機で実行可能なプログラムのルー
プに並列化するループ並列化方法に関し、特に、依存サ
イクルに含まれる文と依存サイクルに含まれない文を含
むループを有する計算機用プログラムを入力して、並列
化されたプログラムまたはオブジェクトコードを出力す
ることにより、並列計算機のループの実行時間を短縮さ
せることが可能なループ並列化方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a loop parallelizing method for dividing a computer program into parallel loops of a program executable by a parallel computer, and more particularly, to a statement included in a dependent cycle and a dependent cycle. The present invention relates to a loop parallelizing method capable of reducing the execution time of a loop of a parallel computer by inputting a computer program having a loop including a statement that is not included and outputting a parallelized program or object code.

【０００２】[0002]

【従来の技術】従来技術として、計算機が複数のプロセ
ッサからなる並列計算機に対して、従来の並列化コンパ
イラは、プログラムに現れるループのループ範囲を分割
して、各プロセッサに割り当てる様にプログラムを変換
する。このとき、場合によっては、プロセッサ間で同期
をとりながら並列実行するようなプログラムに変換する
必要がある。例えば、図４に示す入力プログラム４０を
考えてみる。このプログラムは、S₁，S₂，S₃の各式にお
いて、i＝1〜９を繰り返し実行して終了するものであ
る。入力プログラム４０では、i=1のときにD(1)に代入
された値が、i=2のときにD(i-1)として参照されるた
め、i=1の場合とi=2の場合を独立に並列実行することが
できない。このように、ループ繰り返しにまたがった依
存がある場合、並列計算機上で実行可能なプログラム
に、並列化コンパイラを用いて従来技術で変換した場
合、図５に示すような出力プログラム５０になる。出力
プログラム５０のdoループのループ範囲を、並列化コン
パイラは1〜nからplb〜pubに変換する。plb,pubは、並
列化コンパイラが各プロセッサに割り当てたループ範囲
の下限値・上限値である。doループは、各iについて並
列に実行をするのであるが、まったく独立に実行される
のでなく、適当に同期をとりながら実行される。2. Description of the Related Art As a conventional technique, a conventional parallelizing compiler converts a program into a parallel computer having a plurality of processors by dividing a loop range of a loop appearing in the program and assigning it to each processor. I do. At this time, in some cases, it is necessary to convert to a program that is executed in parallel while synchronizing between the processors. For example, consider the input program 40 shown in FIG. The program, in each formula of _{_{_{S 1, S 2, S 3}}} , is to exit repeatedly executes the i = 1 to 9. In the input program 40, the value assigned to D (1) when i = 1 is referred to as D (i-1) when i = 2. Cases cannot be executed independently in parallel. As described above, when there is a dependency over the loop repetition, an output program 50 as shown in FIG. 5 is obtained when a program executable on a parallel computer is converted by a conventional technique using a parallelizing compiler. The parallelizing compiler converts the loop range of the do loop of the output program 50 from 1 to n to plb to pub. plb and pub are the lower limit and upper limit of the loop range assigned to each processor by the parallelizing compiler. The do loop executes in parallel for each i, but is not executed completely independently, but is executed with appropriate synchronization.

【０００３】この例では、S1を実行し終わったプロセッ
サはそのことを示すシグナルを発行し(send_signal)、i
-1について実行しているプロセッサからのシグナルを待
ってから(wait_signal)、S1を実行する。図１４は、出
力プログラム５０に対する並列実行の様子を示す図であ
る。図１４に示す様に、１つ前のループ繰り返しで定義
した値を使用するための同期（矢印で示す）をとりなが
ら実行する必要がある。尚、図１４では、不要な同一プ
ロセッサ内の同期を取り除き、異なるプロセッサ間の同
期のみを示す。ここでは、D(i)=D(i-1)+1およびA(i)=C
(i)/2+C(i-1)*3の各式で、i=3からi=4、およびi=6からi
=7に移る時点で別のプロセッサにシグナルを送る必要が
ある。以上述べた並列化技術は、 Hans Zima, Barbara
Chapman 著, "Supercompilers for Parallel and Vecto
r Computers", ACM Press,1990 に記載されている。In this example, the processor that has executed S1 issues a signal indicating this (send_signal), and i
Wait for a signal from the executing processor for -1 (wait_signal), and then execute S1. FIG. 14 is a diagram illustrating a state of the parallel execution of the output program 50. As shown in FIG. 14, it is necessary to execute while synchronizing (indicated by an arrow) for using the value defined in the immediately preceding loop iteration. In FIG. 14, unnecessary synchronization in the same processor is removed, and only synchronization between different processors is shown. Here, D (i) = D (i-1) +1 and A (i) = C
In each formula of (i) / 2 + C (i-1) * 3, i = 3 to i = 4, and i = 6 to i
It is necessary to send a signal to another processor when moving to = 7. The parallelization technology described above is based on Hans Zima, Barbara
Chapman, "Supercompilers for Parallel and Vecto
r Computers ", ACM Press, 1990.

【０００４】[0004]

【発明が解決しようとする課題】このように、上記従来
技術では、ループ処理を含むプログラムを入力して、複
数のプロセッサ上で並列化されたプログラムを実行する
場合に、ループ繰り返しにまたがった依存があるループ
で、図５に示す様な出力プログラム５０になったとき、
プロセッサPE1,PE2,PE3の3つのプロセッサに対してi=1
〜3,4〜6,7〜9に分割された場合には、図１４に示す様
にPE2以降において、矢印で示すシグナルを受けるまで
は同期待ち状態、つまり無駄時間となっていた。この同
期待ちの時間がアイドルであるため、実行に長い時間が
かかるという問題がある。As described above, according to the above-mentioned prior art, when a program including a loop process is input and a parallelized program is executed on a plurality of processors, the dependency over the loop repetition is obtained. In a certain loop, when the output program 50 as shown in FIG.
I = 1 for three processors PE1, PE2, PE3
In the case of division into 33,4〜6,7〜9, as shown in FIG. 14, after PE2, the synchronization wait state until the signal indicated by the arrow is received, that is, wasted time. Since the synchronization waiting time is idle, it takes a long time to execute.

【０００５】そこで、本発明の目的は、これら従来の課
題を解決し、並列計算機のループの実行時間を短縮する
プログラムまたはオブジェクトコードを出力することが
可能なループ並列化方法を提供することにある。An object of the present invention is to solve the conventional problems and to provide a loop parallelizing method capable of outputting a program or an object code for reducing the loop execution time of a parallel computer. .

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するた
め、本発明の本発明のループ並列化方法では、依存サイ
クルに含まれる文と依存サイクルに含まれない文を含む
ループに対して、ループ本体を分割し、各プロセッサ毎
に、分割された個々のループを異なる順序で実行する。
詳細には、第一から第七までのステップにより手順よく
処理を行う。第一のステップで、ループに対するデータ
依存を解析する。第二のステップで、並列化ループを決
定する。第三のステップで、該並列化ループにおいて、
ループ繰り返しにまたがった依存のある文に対して、同
期文を挿入する。第四のステップで、該並列化ループの
うち、ループ繰り返しにまたがった依存のある文を含む
ループのループ本体を分割する。第五のステップで、該
並列化ループのループ範囲を分割して各プロセッサに割
り当てる。第六のステップで、該ループ本体を分割した
ループのうち、依存サイクルに含まれる文を含むループ
を、各プロセッサに、逐次実行順序で実行するようにス
ケジューリングする。第七のステップで、各プロセッサ
毎に、該ループ本体を分割したループのうち、依存サイ
クルに含まれない文を含むループを、同期のための待ち
時間がアイドルでなくなるように、依存関係を保ってス
ケジューリングする。本発明のループ並列化方法による
と、図６のような出力プログラム６０が生成される。図
６に示す様に、依存サイクルに含まれる文S1からなるル
ープ1Lは、各プロセッサに、逐次実行順序で実行するよ
うにスケジューリングされ、依存サイクルに含まれない
文S2からなるループ2Lと文S3からなるループ3Lは、各プ
ロセッサ毎に、依存サイクルによる同期待ちの時間がア
イドルでなくなるようにスケジューリングされている。In order to achieve the above object, in the loop parallelizing method of the present invention, a loop including a statement included in a dependent cycle and a loop including a statement not included in a dependent cycle is used. The main body is divided, and each divided loop is executed in a different order for each processor.
Specifically, the process is performed in a well-ordered manner by the first to seventh steps. In the first step, the data dependence on the loop is analyzed. In the second step, a parallelization loop is determined. In a third step, in the parallelization loop:
Insert a synchronization statement for a statement that depends on the loop iteration. In the fourth step, a loop body of a loop including a dependent statement that extends over loop iterations is divided from the parallelized loop. In a fifth step, the loop range of the parallelized loop is divided and assigned to each processor. In the sixth step, among the loops obtained by dividing the loop body, a loop including a statement included in a dependent cycle is scheduled to be executed by each processor in a sequential execution order. In the seventh step, for each processor, among the loops obtained by dividing the loop main body, the loop including a statement that is not included in the dependency cycle is kept in a dependency relationship so that the waiting time for synchronization is not idle. Schedule. According to the loop parallelizing method of the present invention, an output program 60 as shown in FIG. 6 is generated. As shown in FIG. 6, a loop 1L composed of a statement S1 included in a dependent cycle is scheduled by each processor so as to be executed in a sequential execution order, and a loop 2L composed of a statement S2 not included in a dependent cycle and a statement S3. Is scheduled for each processor such that the synchronization waiting time due to the dependent cycle is not idle.

【０００７】図１５は、本発明のループ並列化方法によ
る出力プログラム６０に対する並列実行の様子を示す図
である。図１５に示す様に、各プロセッサに、依存サイ
クルに含まれる文が逐次実行順序で実行されるようにス
ケジューリングされ、各プロセッサ毎に、ループ本体を
分割した個々のループが、依存サイクルによる同期のた
めの待ちの時間がアイドルでなくなるようにスケジュー
リングされている。本発明のループ並列化方法によれ
ば、依存サイクルに含まれる文と依存サイクルに含まれ
ない文からなるループに対して、依存サイクルによる同
期待ちの時間がアイドルでなくなるので、プログラムま
たはオブジェクトコードを出力する実行時間を短縮する
ことが可能である。FIG. 15 is a diagram showing a state of parallel execution on an output program 60 by the loop parallelizing method of the present invention. As shown in FIG. 15, the statements included in the dependent cycles are scheduled to be executed in the sequential execution order in each processor, and for each processor, an individual loop obtained by dividing the loop body is used for synchronization by the dependent cycle. Is scheduled so that it is no longer idle. According to the loop parallelization method of the present invention, the synchronization wait time due to the dependent cycle is not idle for the loop including the statement included in the dependent cycle and the statement not included in the dependent cycle. The output execution time can be reduced.

【０００８】[0008]

【発明の実施の形態】以下、本発明の実施例を、図面に
より詳細に説明する。図３は、本発明の一実施例を示す
共有メモリ型並列計算機の構成図である。図３に示す様
に、共有メモリ型並列計算機は、プロセッサ３２１〜３
２ｎと、共有メモリ３１と、それらを結合するバス３３
から構成されている。各プロセッサ３２１〜３２ｎは、
図１５に示す各時刻Ｔ＝１〜９において、それぞれ共有
メモリ３１にアクセスして自分が演算する式のデータを
取り込み、演算を行う。式S1のi=3を演算するプロセッ
サからi=4を演算するプロセッサへ、またS1のi=6を演算
するプロセッサからi=7を演算するプロセッサへ、それ
ぞれシグナルを送出する。同じようにして、S2のi=3を
演算するプロセッサからS3のi=4を演算するプロセッサ
へ、またS2のi=6を演算するプロセッサからS3のi=7を演
算するプロセッサへ、それぞれシグナルを送出する。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 3 is a configuration diagram of a shared memory type parallel computer according to an embodiment of the present invention. As shown in FIG. 3, the shared memory type parallel computer includes processors 321 to
2n, shared memory 31, and bus 33 connecting them
It is composed of Each of the processors 321 to 32n includes:
At each time T = 1 to 9 shown in FIG. 15, each of them accesses the shared memory 31 and fetches the data of the expression to be operated by itself, and performs the operation. Signals are sent from the processor that computes i = 3 in the equation S1 to the processor that computes i = 4, and from the processor that computes i = 6 in S1 to the processor that computes i = 7. Similarly, the signal from the processor that calculates i = 3 of S2 to the processor that calculates i = 4 of S3 and the signal from the processor that calculates i = 6 of S2 to the processor that calculates i = 7 of S3, respectively. Is sent.

【０００９】図１は、本発明の一実施例を示す並列化コ
ンパイラの構成図である。図１に示す様に、並列化コン
パイラ１０は、構文解析部１１、ループ並列化部１２、
コード生成部１３から構成されており、並列化コンパイ
ラ１０は磁気ディスク装置等の記憶媒体に格納され、主
記憶装置にロードされて実行される。並列化コンパイラ
１０の構文解析部１１は、記憶媒体から入力プログラム
１４を読み込んで中間語１５を生成し、記憶媒体に格納
する。なお、入力プログラム１４、中間語１５および出
力プログラム１６が格納されている記憶媒体は別個に記
載されているが、同一のものでよいことは勿論である。
なお、中間語１５は並列化コンパイラ１０内部のプログ
ラムの表現であり、その形式は通常のコンパイラの場合
と特に変わらないので、ここでは詳細には述べない。ル
ープ並列化部１２は、構文解析部１１の結果を利用して
ループ並列化可能なループを検出し、ループ構造を変更
し、ループ処理を各プロセッサに割り当てる処理部であ
り、ループ依存解析部１２０、並列化ループ決定部１２
１、同期生成部１２２、ループ本体分割部１２３、ルー
プ範囲分割部１２４、プロセッサ間スケジューリング部
１２５、プロセッサ内スケジューリング部１２６から構
成されている。FIG. 1 is a configuration diagram of a parallelizing compiler showing one embodiment of the present invention. As shown in FIG. 1, the parallelizing compiler 10 includes a syntax analyzing unit 11, a loop parallelizing unit 12,
The parallelizing compiler 10 is configured by a code generation unit 13 and is stored in a storage medium such as a magnetic disk device, loaded into a main storage device, and executed. The syntax analysis unit 11 of the parallelizing compiler 10 reads the input program 14 from the storage medium, generates an intermediate language 15, and stores the intermediate language 15 in the storage medium. Although the storage medium storing the input program 14, the intermediate language 15, and the output program 16 is described separately, it is needless to say that the same medium may be used.
The intermediate language 15 is a representation of a program in the parallelizing compiler 10, and its format is not particularly different from that of a normal compiler, and therefore will not be described in detail here. The loop parallelizing unit 12 is a processing unit that detects a loop that can be loop-parallelized using the result of the syntax analysis unit 11, changes the loop structure, and assigns loop processing to each processor. , The parallelization loop determination unit 12
1, a synchronization generation unit 122, a loop body division unit 123, a loop range division unit 124, an inter-processor scheduling unit 125, and an intra-processor scheduling unit 126.

【００１０】図２は、図１におけるループ並列化部の処
理手順を示すフローチャートである。図２に示す様に、
本手順は、ステップ２０〜２６から構成されており、各
ステップ２０〜２６は、それぞれ図１のループ依存解析
部１２０、並列化ループ決定部１２１、同期生成部１２
２、ループ本体分割部１２３、ループ範囲分割部１２
４、プロセッサ間スケジューリング部１２５およびプロ
セッサ内スケジューリング部１２６の処理動作として対
応している。ステップ２０は、ループ依存解析部１２０
によって実行され、ループに対するデータ依存グラフを
生成する。ステップ２１は、並列化ループ決定部１２１
によって実行され、同期なしに並列実行できるDOALLル
ープと、同期をとって並列実行するDOACROSSループを決
定する。ステップ２２は、同期生成部１２２によって実
行され、DOACROSSループに対して、ループ繰り返しにま
たがる依存のある文に対して、同期文を挿入する。ステ
ップ２３は、ループ本体分割部１２３によって実行さ
れ、DOACROSSループに対して、ループ本体を分割する。
例えば、図４の入力プログラムの場合には、文がS1,S2,
S3の3つ存在するので、それぞれの文に対して3ループに
分割する。ステップ２４は、ループ範囲分割部１２４に
よって実行され、ループ範囲を分割して、各プロセッサ
に割り当てる。図４の入力プログラムの場合には、i=1
〜3,i=4〜6,i=7〜9に分割して各プロセッサPE1,PE2,PE3
に割り当てる。FIG. 2 is a flowchart showing a processing procedure of the loop parallelizing section in FIG. As shown in FIG.
This procedure includes steps 20 to 26. Each of the steps 20 to 26 corresponds to the loop dependence analysis unit 120, the parallelization loop determination unit 121, and the synchronization generation unit 12 in FIG.
2. Loop body dividing unit 123, loop range dividing unit 12
4, corresponding to the processing operations of the inter-processor scheduling unit 125 and the intra-processor scheduling unit 126. Step 20 includes a loop dependency analysis unit 120
To generate a data dependency graph for the loop. Step 21 includes a parallelization loop determination unit 121
And a DOALL loop that can be executed in parallel without synchronization, and a DOACROSS loop that can be executed in parallel with synchronization. Step 22 is executed by the synchronization generation unit 122, and inserts a synchronization statement into the DOACROSS loop into a statement that depends on the loop iteration. Step 23 is executed by the loop body dividing unit 123 to divide the loop body into DOACROSS loops.
For example, in the case of the input program of FIG. 4, the statements are S1, S2,
Since there are three of S3, each statement is divided into three loops. Step 24 is executed by the loop range dividing unit 124, and the loop range is divided and assigned to each processor. In the case of the input program of FIG. 4, i = 1
~ 3, i = 4 ~ 6, i = 7 ~ 9 and divided into processors PE1, PE2, PE3
Assign to

【００１１】ステップ２５は、プロセッサ間スケジュー
リング部１２５によって実行され、ステップ２３でルー
プ本体分割したループのうち、依存サイクルに含まれる
文を含むループを、各プロセッサに、逐次実行順序で実
行するようにスケジューリングする。図４の入力プログ
ラムの場合には、図１２に示す様に3つのプロセッサに
対して第1の文1Lを逐次実行順序にスケジューリングす
る。ステップ２６は、プロセッサ内スケジューリング部
１２６によって実行され、各プロセッサ毎に、ステップ
２３でループ本体分割したループのうち、依存サイクル
に含まれない文を含むループを、同期のための待ち時間
がアイドルでなくなるように、依存関係を保ってスケジ
ューリングする。図４の入力プログラムの場合には、図
１３に示す様に、3つのプロセッサに対して第2の文2Lと
第3の文3Lを待ち時間がアイドルでないようにスケジュ
ーリングする。コード生成部１３は、記憶媒体から中間
語１５を読み込んで出力プログラム１６を生成する。生
成された出力プログラム１６は、再度、記憶媒体に格納
される。これらの処理の内容は通常のコンパイラの場合
と特に変わらないので、ここでは詳細には述べない。Step 25 is executed by the inter-processor scheduling unit 125 so that a loop including a statement included in the dependent cycle among the loops divided in the loop body in step 23 is executed by each processor in the sequential execution order. Schedule. In the case of the input program of FIG. 4, the first statement 1L is scheduled for three processors in the sequential execution order as shown in FIG. Step 26 is executed by the intra-processor scheduling unit 126, and for each processor, of the loops divided in the loop body in step 23, a loop including a statement that is not included in a dependent cycle is set to a state in which the waiting time for synchronization is idle. Schedule so that dependencies are maintained so that they disappear. In the case of the input program of FIG. 4, as shown in FIG. 13, the second statement 2L and the third statement 3L are scheduled for three processors so that the waiting time is not idle. The code generator 13 reads the intermediate language 15 from the storage medium and generates an output program 16. The generated output program 16 is stored in the storage medium again. Since the contents of these processes are not particularly different from those of a normal compiler, they will not be described in detail here.

【００１２】図４は、本発明の一実施例を示す入力プロ
グラムのソースイメージの図である。図４の入力プログ
ラムのソースイメージは、並列化コンパイラ１０のルー
プ並列化部１２への中間語１５の入力プログラム４０の
一例を表している。図６は、本発明の一実施例を示す出
力プログラムのソースイメージの図である。図６の出力
プログラムのソースイメージは、入力プログラム４０に
対応するループ並列化部１２からの中間語１５への出力
プログラム６０の一例を表している。以下に、入力プロ
グラム４０、出力プログラム６０を用いて、並列化コン
パイラ１０によるループ並列化の一例を説明する。ま
ず、ループ依存解析部１２０は、入力プログラム４０の
ループLに対するデータ依存グラフ（図８参照）を生成
する。FIG. 4 is a diagram of a source image of an input program showing one embodiment of the present invention. The source image of the input program in FIG. 4 illustrates an example of the input program 40 of the intermediate language 15 to the loop parallelizing unit 12 of the parallelizing compiler 10. FIG. 6 is a diagram of a source image of an output program showing one embodiment of the present invention. The source image of the output program in FIG. 6 illustrates an example of the output program 60 from the loop parallelizing unit 12 corresponding to the input program 40 to the intermediate language 15. Hereinafter, an example of loop parallelization by the parallelizing compiler 10 using the input program 40 and the output program 60 will be described. First, the loop dependence analysis unit 120 generates a data dependence graph (see FIG. 8) for the loop L of the input program 40.

【００１３】図８は、入力プログラム４０のループLに
対するデータ依存グラフの図である。図８において、グ
ラフのノードは、ループ内の文S1,S2,S3を表し、グラフ
のエッジは、文間の依存関係を表す。δ1は、ループレ
ベル１において、エッジの始点から終点へ、ループ繰り
返しにまたがる依存があることを意味し、δ∞は、始点
から終点へのループ独立依存があることを意味する。デ
ータ依存グラフの生成技術は、HansZima, Barbara Chap
man 著, "Supercompilers for Parallel and Vector Co
mputers", ACM Press, 1990 で述べられており、公知技
術であるため、ここではこれ以上詳細には述べない。依
存グラフの径路が閉路であるとき、これをサイクルと呼
ぶ。図８に示す様に、依存サイクルに含まれる文はS1で
あり、依存サイクルに含まれない文はS2とS3である。並
列化ループ決定部１２１は、入力プログラム４０のルー
プLを、DOACROSSループと決定する。これは、入力プロ
グラム４０のループLには、ループ繰り返しにまたがる
依存があるため、ループの繰り返しを並列に実行する場
合に、異なった繰り返しの間で同期をとる必要があるか
らである。同期生成部１２２は、入力プログラム４０の
ループ内の、ループ繰り返しにまたがる依存のある文S1
の前後およびS2の直後とS3の直前に同期文を生成する。FIG. 8 is a diagram of a data dependency graph for the loop L of the input program 40. In FIG. 8, the nodes of the graph represent the sentences S1, S2, S3 in the loop, and the edges of the graph represent the dependencies between the sentences. δ1 means that there is a dependency over the loop repetition from the start point to the end point of the edge at the loop level 1, and δ∞ means that there is a loop-independent dependency from the start point to the end point. Data dependency graph generation technology is provided by Hans Zima, Barbara Chap
man, "Supercompilers for Parallel and Vector Co
mputers ", ACM Press, 1990, which is well known in the art and will not be described in further detail here. When the path of the dependency graph is closed, it is called a cycle, as shown in FIG. The statement included in the dependency cycle is S1, and the statements not included in the dependency cycle are S2 and S3.The parallelization loop determination unit 121 determines the loop L of the input program 40 as a DOACROSS loop. This is because the loop L of the input program 40 has a dependency over the loop repetition, and thus it is necessary to synchronize between different repetitions when the loop repetition is performed in parallel. 122 is a dependent statement S1 in the loop of the input program 40 that extends over loop iterations.
Before and after S2 and immediately before S3.

【００１４】図９は、入力プログラム４０に対して、同
期文を挿入した結果のプログラムを示す図である。図９
のプログラム９０に示す様に、文S1の直前に１つ前のル
ープ繰り返し回からのシグナルを受け取る文を、文S1の
直後にシグナルを送る文を挿入する。また、文S2の直後
にシグナルを送る文を、文S3の直前に１つ前のループ繰
り返し回からのシグナルを受け取る文を挿入する。ルー
プ本体分割部１２３は、DOACROSSループに対して、ルー
プ本体を分割して、S1からなるループ1L、S2からなるル
ープ2L、S3からなるループ3Lを生成する。FIG. 9 is a diagram showing a program obtained as a result of inserting a synchronization statement into the input program 40. FIG.
As shown in the program 90, a statement that receives a signal from the immediately preceding loop iteration is inserted immediately before the statement S1, and a statement that sends a signal immediately after the statement S1 is inserted. Also, a statement that sends a signal immediately after the statement S2 and a statement that receives a signal from the immediately preceding loop iteration are inserted immediately before the statement S3. The loop body dividing unit 123 divides the loop body with respect to the DOACROSS loop, and generates a loop 1L including S1, a loop 2L including S2, and a loop 3L including S3.

【００１５】図１０は、同期文を挿入したプログラム９
０に対して、ループ本体を分割した結果のプログラムを
示す図である。図１０のプログラム１００に示す様に、
ループ本体分割により、DOACROSSループLは、DOACROSS
ループ1L、DOALLループ2L、DOALLループ3Lに変換され
る。ループ範囲分割部１２４は、並列化ループ1L,2L,3L
のループ範囲を、1〜nからplb〜pubに変換する。plb,pu
bは、並列化コンパイラが各プロセッサに割り当てたル
ープ範囲の下限値・上限値である。さらに、同期文を、
各プロセッサに割り当てられたループ文の外に移動す
る。図１１は、ループ本体を分割したプログラム１００
に対して、ループ範囲を分割した結果のプログラムの図
である。図１１のプログラム１１０に示す様に、並列化
ループ1L,2L,3Lのループ範囲は1〜nからplb〜pubに変換
され、各プロセッサ間で同期をとるように、同期文は範
囲分割されたループ外に移される。プロセッサ間スケジ
ューリング部１２５は、プログラム１１０の依存サイク
ルに含まれる文S1からなるループ1Lを、逐次実行順序で
実行するようにスケジューリングする。FIG. 10 shows a program 9 in which a synchronization statement is inserted.
FIG. 14 is a diagram illustrating a program resulting from dividing a loop body with respect to 0. As shown in the program 100 of FIG.
By dividing the loop body, DOACROSS loop L becomes DOACROSS
Converted to loop 1L, DOALL loop 2L, DOALL loop 3L. The loop range dividing unit 124 includes the parallelized loops 1L, 2L, 3L
Is converted from 1 to n to plb to pub. plb, pu
b is the lower limit and the upper limit of the loop range assigned to each processor by the parallelizing compiler. In addition, the synchronization statement
Move out of the loop statement assigned to each processor. FIG. 11 shows a program 100 obtained by dividing a loop body.
FIG. 14 is a diagram of a program as a result of dividing a loop range. As shown in the program 110 of FIG. 11, the loop range of the parallelized loops 1L, 2L, and 3L was converted from 1 to n to plb to pub, and the synchronization statement was divided into ranges so as to synchronize the processors. Moved out of the loop. The inter-processor scheduling unit 125 schedules the loop 1L including the statement S1 included in the dependent cycle of the program 110 so as to be executed in the sequential execution order.

【００１６】図１２は、プロセッサ間スケジューリング
の結果を示す図である。図１２に示す様に、横軸がプロ
セッサ番号、縦軸が実行順序であるとき、依存サイクル
に含まれる文S1からなるループ1LをPE1は１番目、PE2は
２番目、…に実行するようにスケジューリングする。プ
ロセッサ内スケジューリング部１２６は、プログラム１
１０の依存サイクルに含まれない文S2からなるループ2L
と文S3からなるループ3Lを、各プロセッサ毎に、同期の
ための待ち時間がアイドルでなくなるように、依存関係
を保ってスケジューリングする。図１３は、プロセッサ
内スケジューリングの結果を示す図である。図１３に示
す様に、PE2及びPE3は、同期待ちの時間に先行できる文
からなるループを実行するようにスケジューリングする
ため、PE2は文S1からなるループ1Lに先行して文S2から
なるループ2Lを実行し、PE3は文S1からなるループ1Lに
先行して、文S2からなるループ2L及び文S3からなるルー
プ3Lを実行するようにスケジューリングする。以上の処
理を行った結果、図６に示す出力プログラム６０が生成
される。FIG. 12 is a diagram showing the result of inter-processor scheduling. As shown in FIG. 12, when the horizontal axis indicates the processor number and the vertical axis indicates the execution order, the loop 1L including the statement S1 included in the dependent cycle is executed first by PE1, second by PE2,. Schedule. The in-processor scheduling unit 126 is a program 1
Loop 2L consisting of statement S2 not included in 10 dependent cycles
And a loop 3L composed of the statement S3 and scheduling is performed for each processor while maintaining the dependency so that the waiting time for synchronization is not idle. FIG. 13 is a diagram showing a result of intra-processor scheduling. As shown in FIG. 13, PE2 and PE3 schedule so as to execute a loop consisting of a statement that can precede the synchronization wait time, so that PE2 precedes a loop 1L consisting of statement S1 and a loop 2L consisting of statement S2. , The PE3 schedules to execute the loop 2L including the statement S2 and the loop 3L including the statement S3 prior to the loop 1L including the statement S1. As a result of performing the above processing, an output program 60 shown in FIG. 6 is generated.

【００１７】図７は、本発明の他の実施例を示す出力プ
ログラムのソースイメージ図である。出力プログラムの
ソースイメージの一例として、図６の他に、図７に示す
ような出力プログラム７０もある。図７の出力プログラ
ム７０は、図６に示す出力プログラム６０と同じ意味を
もつが、コード量が小さく、単一プログラム複数データ
(SPMD)モデルとして汎用性がある。すなわち、実行する
処理は図６のものと全く同じであるが、記述の方法が異
なり、サブルーチンサイクルとして文S1からなるループ
1Lだけは逐次順序で実行し、文S2からなるループ2Lと文
S3からなるループ3Lについては、プロセッサPE1であれ
ば順序通り、プロセッサPE2であれば順序通り、プロセ
ッサPE3であれば同じく順序通り、それぞれ実行するこ
とを述べている。プロセッサの台数が少ないときには、
図６の場合と図７の場合とでコード量はそれほど変わら
ないが、台数が多くなると相当量違ってくる。以上で、
入力プログラム４０、出力プログラム６０を用いたルー
プ並列化の例の説明を終る。FIG. 7 is a source image diagram of an output program showing another embodiment of the present invention. As an example of the source image of the output program, there is an output program 70 as shown in FIG. 7 in addition to FIG. The output program 70 shown in FIG. 7 has the same meaning as the output program 60 shown in FIG.
There is versatility as a (SPMD) model. That is, the processing to be executed is exactly the same as that of FIG. 6, but the description method is different, and a loop composed of the statement S1 is used as a subroutine cycle.
Only 1L is executed in the sequential order, and the loop 2L composed of the statement S2 and the statement
It is stated that the loop 3L composed of S3 is executed in the order of the processor PE1, in the order of the processor PE2, and in the order of the processor PE3. When the number of processors is small,
Although the code amount is not so different between the case of FIG. 6 and the case of FIG. 7, the code amount is considerably different as the number increases. Above,
The description of the example of the loop parallelization using the input program 40 and the output program 60 ends.

【００１８】本並列化を実行しなければ、依存サイクル
による同期待ちの時間がアイドルであるため、実行に長
い時間がかかるのに対して、本発明のようにループ並列
化を行えば、依存サイクルによる同期待ちの時間に先行
可能な文を実行し、同期待ちの時間がアイドルでなくな
るので、プログラムの実行が高速化できる。なお、本発
明は、前記実施例に限定されるものではなく、その要旨
を逸脱しない範囲において種々変更可能であることは勿
論である。例えば、シグナルによる同期のバリア同期へ
の変更、分散メモリ型並列計算機への適用が可能であ
る。If this parallelism is not executed, the synchronization wait time due to the dependent cycle is idle, so that it takes a long time to execute. On the other hand, if the loop parallelization is performed as in the present invention, the dependent cycle Since a statement that can precede the synchronization waiting time is executed and the synchronization waiting time is not idle, the execution of the program can be sped up. It should be noted that the present invention is not limited to the above-described embodiment, and it is needless to say that various changes can be made without departing from the scope of the present invention. For example, it is possible to change synchronization by signal to barrier synchronization, and to apply to a distributed memory type parallel computer.

【００１９】[0019]

【発明の効果】以上説明したように、本発明によれば、
依存サイクルに含まれる文と依存サイクルに含まれない
文からなるループに対して、依存サイクルによる同期待
ちの時間がアイドルでなくなるので、プログラムまたは
オブジェクトコードを出力する実行時間を短縮すること
が可能である。As described above, according to the present invention,
For a loop consisting of a statement included in a dependent cycle and a statement not included in a dependent cycle, the time required for synchronization by the dependent cycle is not idle, so that the execution time for outputting a program or object code can be reduced. is there.

[Brief description of the drawings]

【図１】本発明の一実施例を示す並列化コンパイラの構
成図である。FIG. 1 is a configuration diagram of a parallelizing compiler showing one embodiment of the present invention.

【図２】図１におけるループ並列化部の処理手順を示す
フローチャートである。FIG. 2 is a flowchart illustrating a processing procedure of a loop parallelizing unit in FIG. 1;

【図３】本発明が適用される共有メモリ型並列計算機の
構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of a shared memory parallel computer to which the present invention is applied;

【図４】入力プログラムのソースイメージの一例を示す
図である。FIG. 4 is a diagram illustrating an example of a source image of an input program.

【図５】従来技術による出力プログラムのソースイメー
ジの一例を示す図である。FIG. 5 is a diagram illustrating an example of a source image of an output program according to the related art.

【図６】本発明による出力プログラムのソースイメージ
の一例を示す図である。FIG. 6 is a diagram showing an example of a source image of an output program according to the present invention.

【図７】本発明による出力プログラムのソースイメージ
の他の例を示す図である。FIG. 7 is a diagram showing another example of a source image of an output program according to the present invention.

【図８】図４に対するデータ依存グラフを示す図であ
る。FIG. 8 is a diagram showing a data dependency graph for FIG. 4;

【図９】図４に対して同期文を挿入したプログラムのソ
ースイメージの一例を示す図である。FIG. 9 is a diagram illustrating an example of a source image of a program in which a synchronization statement is inserted into FIG. 4;

【図１０】図９に対してループ本体を分割したプログラ
ムのソースイメージの一例を示す図である。FIG. 10 is a diagram illustrating an example of a source image of a program obtained by dividing a loop body from FIG. 9;

【図１１】図１０に対してループ範囲を分割したプログ
ラムのソースイメージの一例を示す図である。11 is a diagram illustrating an example of a source image of a program obtained by dividing a loop range with respect to FIG. 10;

【図１２】図１１に対するプロセッサ間スケジュールを
示す図である。FIG. 12 is a diagram showing an inter-processor schedule for FIG. 11;

【図１３】図１２に対するプロセッサ内スケジュールを
示す図である。FIG. 13 is a diagram showing an in-processor schedule for FIG. 12;

【図１４】従来技術による並列実行の様子の一例を示す
図である。FIG. 14 is a diagram illustrating an example of a state of parallel execution according to the related art.

【図１５】本発明の一実施例を示す並列実行の様子の一
例を示す図である。FIG. 15 is a diagram showing an example of a state of parallel execution showing one embodiment of the present invention.

[Explanation of symbols]

１０…並列化コンパイラ、１１…構文解析部、１２…ル
ープ並列化部、１３…コード生成部、１４…入力プログ
ラム、１５…中間語、３１…共有メモリ、３３…バス、
３２１〜３２ｎ…プロセッサ、１６…出力プログラム、
４０…入力プログラム、５０…従来技術による出力プロ
グラム、６０、７０…本発明による出力プログラム、８
０…データ依存グラフ、９０…同期文挿入後のプログラ
ム、１００…ループ本体分割後のプログラム、１１０…
ループ範囲分割後のプログラム、１２０…ループ依存解
析部、１２１…並列化ループ決定部、１２２…同期生成
部、１２３…ループ本体分割部、１２４…ループ範囲分
割部、１２５…プロセッサ間スケジューリング部、１２
６…プロセッサ内スケジューリング部。DESCRIPTION OF SYMBOLS 10 ... Parallelization compiler, 11 ... Syntax analysis part, 12 ... Loop parallelization part, 13 ... Code generation part, 14 ... Input program, 15 ... Intermediate language, 31 ... Shared memory, 33 ... Bus,
321-32n processor, 16 output program,
40: input program, 50: output program according to the prior art, 60, 70: output program according to the present invention, 8
0: Data dependency graph, 90: Program after insertion of synchronization statement, 100: Program after division of loop body, 110:
Program after loop range division, 120: loop dependence analysis unit, 121: parallelized loop determination unit, 122: synchronization generation unit, 123: loop body division unit, 124: loop range division unit, 125: inter-processor scheduling unit, 12
6. In-processor scheduling unit.

Claims

[Claims]

1. A loop parallelization method for inputting a computer program including a statement for performing a loop process and outputting a parallelized program or object code to be executed on a plurality of processors. When there is a statement included in a cycle and a statement not included in each cycle, each processor reads a plurality of statements including the above two types of statements in the same loop in a different order as compared with any of the other statements as a whole of the processor. A loop parallelization method characterized by executing.

2. A loop parallelization method for inputting a computer program including a statement for performing a loop process and outputting a parallelized program or object code to be executed on a plurality of processors. When there is a sentence included in a cycle and a sentence not included, first, the loop body is divided, then the loop range of the loop including the statement included in the dependent cycle is divided, and then the statement included in the dependent cycle Is assigned to each processor so as to be executed in a sequential execution order. Then, for each processor, another statement that can precede the waiting time for inter-processor synchronization for the loop including the statement included in the dependent cycle is assigned. Loop rearranging method for rearranging so as to execute a loop including the loop

3. A loop parallelization method for inputting a computer program including a statement for performing a loop process and outputting a parallelized program or object code executed on a plurality of processors. Analyzing, a step of determining a parallelized loop as a result of the analysis, a step of inserting a synchronization statement into a statement in the parallelized loop that depends on the loop iteration, and a step of loop iteration Dividing the loop body of the parallelized loop including the dependent statement into several parts; and dividing the allocated range for the divided parallelized loops into loops in each of the divided ranges. A step for assigning to processors and a loop including a statement included in a dependent cycle are sequentially executed for each processor. Scheduling to be executed in line order, and scheduling a loop including a statement not included in the dependency cycle for each processor while maintaining the dependency such that the synchronization wait time by the dependency cycle is not idle. And a loop parallelizing method.