JPH09319722A

JPH09319722A - Parallel compiling method

Info

Publication number: JPH09319722A
Application number: JP12806596A
Authority: JP
Inventors: Toshio Suganuma; 俊夫菅沼; Hideaki Komatsu; 秀昭小松
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1996-05-23
Filing date: 1996-05-23
Publication date: 1997-12-12

Abstract

PROBLEM TO BE SOLVED: To fast run a compiled object program by extracting correctly the reduction, i.e., a loop pattern that frequently appears in a source program out of this source program. SOLUTION: The reduction is detected out of a loop (Step 21) and then converted into a loop that can be carried out effectively and in parallel (Step 22). Finally, the reduction communication is generated based on the converted loop and optimized (Step 23). The detection of reduction consists of three sub- steps. That is, a model is first produced in regard to the mask expression masking every substitute sentence in the loop. In other words, an expression tree concerning the execution condition of every substitute sentence is generated. Then it's decided whether the reduction consists of an expression tree of merged substitute sentences based on the expression tree of a prescribed conditional expression and also the data dependence relation.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する利用分野】本発明は、並列計算機用のコ
ンパイラに係り、特に、ソースプログラム中のリダクシ
ョンの検出及びその処理に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a compiler for a parallel computer, and more particularly to detection of reduction in a source program and its processing.

【０００２】[0002]

【従来の技術】近年、複数のプロセッサを有する並列計
算機が実用化されつつある。この計算機は、それぞれの
プロセッサを同時に動作させることにより、高い処理能
力を実現する。並列計算機は、図１に示すように、プロ
セッサ、ローカルメモリ、データ転送装置からなるプロ
セッサ要素を複数有していて、それぞれのプロセッサ要
素はバスを介して接続されている。Ｃ言語やＦＯＲＴＲ
ＡＮなどの高水準言語で記述されたソースプログラムを
並列計算機で効率的に実行するためには、ソースプログ
ラムを並列計算機用の目的プログラムに翻訳する並列化
コンパイラ(Parallelizing Compiler)が必要となる。こ
の並列化コンパイラの重要な役割は、ソースプログラム
中の並列実行可能な記述を正確に検出できること及びこ
の実行を効率的にそれぞれのプロセッサ要素に分散する
ことである。2. Description of the Related Art In recent years, parallel computers having a plurality of processors have been put into practical use. This computer realizes high processing capability by operating the respective processors at the same time. As shown in FIG. 1, the parallel computer has a plurality of processor elements including a processor, a local memory, and a data transfer device, and each processor element is connected via a bus. C language and FORTR
In order to efficiently execute a source program written in a high-level language such as AN on a parallel computer, a parallelizing compiler that translates the source program into a target program for the parallel computer is required. The important role of this parallelizing compiler is to be able to accurately detect the parallel executable description in the source program and to efficiently distribute this execution to the respective processor elements.

【０００３】ところで、数値計算プログラムなどにおい
て、リダクションは頻繁に現れる構文の一つである。リ
ダクションとは、交換法則及び結合法則の成立する演算
子であって、同種のものに基づいて、ある配列データか
ら単一の結果を求める操作をいう。例えば、多数の配列
データの中からその最大値、最小値またはインデックス
値を求めるという操作は、典型的なリダクション演算で
ある。ここで、交換法則及び結合法則の成立する演算子
とは、「＋」、「＊」、「max」または「min」などをい
う。また、同種な演算子であるから、異なる演算子を含
む演算はリダクションではなく、同一の演算子、例えば
演算子「＋」のみで構成された演算をいう。By the way, reduction is one of the frequently appearing syntaxes in a numerical calculation program or the like. The reduction is an operator that satisfies the exchange law and the associative law, and is an operation that obtains a single result from certain array data based on the same kind of operators. For example, the operation of obtaining the maximum value, the minimum value, or the index value from a large number of array data is a typical reduction operation. Here, the operator that satisfies the exchange law and the associative law means "+", "*", "max", "min", or the like. Further, since they are operators of the same kind, operations including different operators are not reductions, but operations that are composed of only the same operator, for example, the operator “+”.

【０００４】リダクションは、単一プロセッサでそれを
実行する場合には、逐次的に実行される計算ループであ
る。しかしながら、並列計算機において、それは並列実
行可能な記述である。従って、リダクションが検出でき
た場合、まず、各プロセッサ要素毎に、自己のローカル
メモリに記憶されている配列データをもとにしてローカ
ルな計算がそれぞれ実行される。次に、これらのローカ
ルな結果はあるプロセッサ要素に転送され、そのプロセ
ッサが、これらのローカルな結果を集計して最終結果を
求める。最後に、それぞれのプロセッサ要素中に記憶さ
れているデータの同一性を保証するために、求められた
最終結果はそれぞれのプロセッサ要素に再度転送され
る。このように、リダクションの計算を逐次的に実行す
るよりも、それぞれのプロセッサ要素ごとに分散処理す
る方が、実行時間は格段に短い。従って、並列化コンパ
イラがソースプログラム中のリダクション構文を正しく
検出することは、並列計算機が高速に実行できる目的プ
ログラムを生成するという点で非常に重要である。Reduction is a computational loop that is executed serially when it is executed on a single processor. However, in a parallel computer, it is a description that can be executed in parallel. Therefore, when reduction can be detected, first, local calculation is performed for each processor element based on the array data stored in its own local memory. These local results are then transferred to a processor element, which aggregates these local results into the final result. Finally, the determined final result is transferred again to the respective processor element in order to guarantee the identity of the data stored in the respective processor element. As described above, the execution time is significantly shorter when the distributed processing is performed for each processor element than when the reduction calculation is sequentially executed. Therefore, it is very important for the parallelizing compiler to correctly detect the reduction syntax in the source program in order to generate the target program that the parallel computer can execute at high speed.

【０００５】従来、このリダクションは、イディオム・
リコグニションにより、検出されていた。このイディオ
ム・リコグニションとは、パターン・マッチング、すな
わち、ソースプログラム中のある記述が、予め想定され
たパターンと一致するかどうかを調べ、一致する場合に
のみリダクションとして検出する方法である。このパタ
ーンは、ある特定の代入文、ループ、及びそのループ中
の条件文で表現されている。実際のソースプログラム中
に存在するあるループが、このイディオム・リコグニシ
ョンにより、例えば、配列データの最大値を求めるリダ
クションと認識された場合、並列化コンパイラは、この
ループを最大値を求めるための並列実行の手順を示す実
行時ルーチンの呼び出しに変換する。そして、実行時
に、このリダクションは、このルーチンに基づいて並列
に実行される。Conventionally, this reduction is based on the idiom
It was detected by recognition. The idiom recognition is a pattern matching method, that is, a method of checking whether a certain description in a source program matches a preliminarily assumed pattern and detecting it as a reduction only when it matches. This pattern is expressed by a specific assignment statement, a loop, and a conditional statement in the loop. If a loop existing in the actual source program is recognized by this idiom recognition as, for example, a reduction for obtaining the maximum value of array data, the parallelizing compiler executes the loop in parallel for obtaining the maximum value. Convert to a call to a runtime routine that shows the procedure of. Then, at run time, this reduction is executed in parallel based on this routine.

【０００６】従来の方法であるイディオム・リコグニシ
ョンでは、並列化コンパイラの開発者がプログラム中で
頻繁に使用される一般的な表現として想定したパターン
と厳密に一致するもののみをリダクションとして検出し
ていた。従って、リダクションの表現が想定されたパタ
ーンと少しでも相違する場合には、この表現はリダクシ
ョンとして検出されない。ところが、実際のプログラム
において存在するリダクションは、ループ中に条件式が
複雑に絡み合って表現されている場合が多く、またその
バリエーションも非常に多い。従って、従来の方法で
は、想定されたパターン以外の複雑な表現を有するリダ
クションに対しては、それが本質的にリダクションであ
るにも関わらず認識できなかった。In idiom recognition, which is a conventional method, only a pattern that exactly matches a pattern assumed by a parallelizing compiler developer as a general expression frequently used in a program is detected as a reduction. . Therefore, if the expression of the reduction is slightly different from the expected pattern, this expression is not detected as the reduction. However, the reductions that exist in actual programs are often expressed by complexly entangled conditional expressions in the loop, and there are also many variations. Therefore, in the conventional method, it is not possible to recognize a reduction having a complicated expression other than the assumed pattern, although it is essentially a reduction.

【０００７】[0007]

【発明が解決しようとする課題】このように、従来のリ
ダクションの検出方法を用いた並列化コンパイラでは、
コンパイル時において、リダクションを有効に検出する
ことができなかった。従って、本来的に並列実行が可能
な演算が、そのようにコンパイルされずに、逐次的に実
行される目的プログラムとして生成される場合があっ
た。それゆえに、並列計算機が効率的にこの演算を実行
することが困難な場合が生じていた。As described above, in the parallelizing compiler using the conventional reduction detection method,
The reduction could not be detected effectively at compile time. Therefore, there is a case where an operation which is originally capable of parallel execution is generated as a target program to be sequentially executed without being compiled as such. Therefore, it has been difficult for the parallel computer to efficiently execute this operation.

【０００８】そこで、本発明の目的は、効率的に並列実
行され得るような目的プログラムを生成できる並列化コ
ンパイラを提供することである。[0008] Therefore, an object of the present invention is to provide a parallelizing compiler capable of generating a target program that can be efficiently executed in parallel.

【０００９】また、本発明の別の目的は、ソースプログ
ラム中のリダクション表現を正しく検出することであ
る。Another object of the present invention is to correctly detect a reduction expression in a source program.

【００１０】さらに、本発明の別の目的は、このように
して検出されたリダクションの実行時間を効果的に短縮
することである。Still another object of the present invention is to effectively reduce the execution time of the reduction thus detected.

【００１１】[0011]

【課題を解決するための手段】かかる問題点に鑑み、第
１の発明は、複数のプロセッサを有する並列計算機用
に、ソースプログラムを翻訳して目的プログラムを生成
する並列化コンパイル方法において、ソースプログラム
中のループに存在する代入文をマスクする条件式の構造
を解析するステップと、この代入文の構造を解析するス
テップと、変数のデータ依存関係から、代入文の構造を
マージするステップと、条件式の構造及びマージされた
代入文の構造に基づき、変数のデータ依存関係及び代入
文中の演算子を参照して、リダクションを検出するステ
ップとを有する並列化コンパイル方法を提供する。In view of the above problems, the first invention is a parallel compiling method for translating a source program to generate an object program for a parallel computer having a plurality of processors. The step of analyzing the structure of the conditional expression that masks the assignment statement existing in the loop inside, the step of analyzing the structure of this assignment statement, the step of merging the structure of the assignment statement from the data dependency of variables, and the condition And detecting the reduction by referring to the data dependency of the variable and the operator in the assignment statement based on the structure of the expression and the structure of the merged assignment statement.

【００１２】上記リダクションを検出するステップは、
代入文の右辺式の状態に応じて、以下のように場合分け
される。まず、代入文の右辺式が表現式の場合には、変
数のうちのリカレンス変数のデータ依存関係を参照する
と共に、代入文が交換法則及び結合法則が成立する演算
子であって、同種の演算子であることを参照して、リダ
クションを判断する。また、代入文の右辺式が単一の変
数の場合には、条件文の構造に基づき、リダクションを
判断する。The step of detecting the reduction is
Depending on the state of the expression on the right side of the assignment statement, it is divided into the following cases. First, when the expression on the right side of the assignment statement is an expression, the data dependency of the recurrence variable among the variables is referenced, and the assignment statement is an operator that satisfies the exchange law and the associative law, and the same kind of operation Determine reduction by referring to being a child. If the expression on the right side of the assignment statement is a single variable, reduction is determined based on the structure of the conditional statement.

【００１３】条件式の構造を解析するために、それぞれ
の条件式をノードに対応させ、それぞれの代入文を葉に
対応させた表現木を生成する。この条件式のモデルに基
づいて、コンパイラは、ループ全体の構造がリダクショ
ンを構成し得るものであるかどうかを解析する。In order to analyze the structure of the conditional expression, an expression tree in which each conditional expression is associated with a node and each assignment statement is associated with a leaf is generated. Based on the model of the conditional expression, the compiler analyzes whether the structure of the entire loop can constitute the reduction.

【００１４】代入文の構造を解析するために、代入文を
構成するそれぞれの要素（例えば、変数、演算子など）
をノードに対応させた表現木を、ループ中に存在する条
件式の構造を示す表現木の葉ごとに生成する。この代入
文のモデルに基づいて、コンパイラは、代入文の構造が
リダクションを構成し得るものであるかどうかを解析す
る。In order to analyze the structure of the assignment statement, each element (eg, variable, operator, etc.) that constitutes the assignment statement
An expression tree corresponding to a node is generated for each leaf of the expression tree showing the structure of the conditional expression existing in the loop. Based on the model of the assignment statement, the compiler analyzes whether the structure of the assignment statement can constitute reduction.

【００１５】また、第２の発明は、複数のプロセッサ要
素を有する並列計算機用に、ソースプログラムを翻訳し
て目的プログラムを生成する並列化コンパイル方法にお
いて、あるリダクション変数で制御されるリダクション
を検出するステップと、リダクションをプロセッサ毎に
ローカルな中間結果を並列に計算するために、リダクシ
ョン変数をプロセッサ毎のプライベートな変数に変換す
るステップと、プロセッサ毎に並列に計算されたローカ
ルな結果を集計するために、リダクション通信を生成す
るステップとを有する並列化コンパイル方法を提供す
る。A second aspect of the present invention is a parallelized compiling method for translating a source program to generate an object program for a parallel computer having a plurality of processor elements, and detecting a reduction controlled by a certain reduction variable. To convert the reduction variable to a private variable for each processor, in order to calculate the intermediate intermediate results in parallel for each processor, and for aggregating the local results calculated in parallel for each processor And a step of generating reduction communication.

【００１６】この第２の発明におけるリダクション通信
を生成するステップは、単一のループ中に複数のリダク
ションが存在する場合、それぞれのプロセッサが計算し
た複数の前記リダクションに関するローカルな結果を集
計するために、複数のリダクションに関するローカルな
結果をベクトル化して通信するように構成してもよい。The step of generating the reduction communication according to the second aspect of the present invention includes the step of summing up local results relating to the plurality of reductions calculated by the respective processors when there are a plurality of reductions in a single loop. , The local results regarding a plurality of reductions may be vectorized and communicated.

【００１７】また、第３の発明は、上記の構成を有する
並列化コンパイル方法を用いたコンパイル・プログラム
を記憶した記憶媒体を提供する。A third aspect of the present invention provides a storage medium storing a compile program using the parallelizing compile method having the above configuration.

【００１８】さらに、第４の発明は、上記の創世を有す
る並列化コンパイル方法によりコンパイルされた目的プ
ログラムを実行する並列計算機を提供する。Further, a fourth invention provides a parallel computer which executes an object program compiled by the parallelizing compilation method having the above-mentioned generation.

【００１９】[0019]

【作用】リダクションは、代入文自身に構造的な特徴が
あるため、この構造を解析することが必要である。しか
しながら、その代入文の変数が他の代入文と、どのよう
なデータ依存関係があるかを解析することも必要であ
る。第１の発明では、ループ中の条件式の関係を把握
し、かつ各代入文の関係を把握することによって、リダ
クションであるかどうかを判断する。データ依存関係に
着目して代入文の構造をマージするため、代入文の表現
に依存しない構造モデルを生成することができる。In the reduction, it is necessary to analyze this structure because the assignment statement itself has structural characteristics. However, it is also necessary to analyze what kind of data dependency the variable of the assignment statement has with other assignment statements. According to the first aspect of the present invention, it is determined whether or not reduction is performed by grasping the relation between the conditional expressions in the loop and grasping the relation between the assignment statements. Since the structure of the assignment statement is merged while paying attention to the data dependency, it is possible to generate a structural model that does not depend on the expression of the assignment statement.

【００２０】また、第２の発明では、リダクションは本
来的に並列実行が可能である点に着目して、リダクショ
ン変数を各プロセッサ毎にローカルなプライベート変数
に置き換えている。In the second aspect of the invention, the reduction variable is replaced with a private variable local to each processor, paying attention to the fact that the reduction can be originally executed in parallel.

【００２１】[0021]

【発明の実施の形態】以下、本発明における好ましい実
施例について説明する。本実施例における並列化コンパ
イラは、図１に示すようにそれぞれのプロセッサ要素毎
にローカルメモリを有するようなシステムを対象にして
いる。そして、このコンパイラの目的は、このようなシ
ステム上で効率的に実行できる目的プログラムを生成す
ることである。BEST MODE FOR CARRYING OUT THE INVENTION Preferred embodiments of the present invention will be described below. The parallelizing compiler in this embodiment is intended for a system having a local memory for each processor element as shown in FIG. The purpose of this compiler is to generate a target program that can be efficiently executed on such a system.

【００２２】図２は、本実施例における並列化コンパイ
ラのリダクションの検出からリダクション通信の最適化
までの基本的な動作フローを示す図である。まず、ルー
プ中からリダクションを検出する（ステップ２１）。次
に、この検出されたリダクションを効率的に並列実行が
可能なループに変換する（ステップ２２）。最後に、こ
の変換に基づくリダクション通信を生成し、これを最適
化する（ステップ２３）。FIG. 2 is a diagram showing a basic operation flow from reduction detection to optimization of reduction communication of the parallelizing compiler in this embodiment. First, reduction is detected in the loop (step 21). Next, the detected reduction is converted into a loop that can be efficiently executed in parallel (step 22). Finally, a reduction communication based on this conversion is generated and optimized (step 23).

【００２３】なお、以下の説明においては、適宜、次の
サンプル・プログラムを用いて、リダクションの検出か
ら通信を最適化するまでを説明する。このプログラム
は、配列rxmの絶対値の最小値を求めるものである。In the following description, the process from detection of reduction to optimization of communication will be described using the following sample programs as appropriate. This program finds the minimum absolute value of the array rxm.

【数１】 do 200 j = 1, m do 200 i = 1, n if (mod(i,2) = 1) then sum = sum + rm(i,j) ・・・・・・ (a) if ((abs(rx(i,j)) < abs(rxm)) .and. (mod(j,2) = 1)) then rxm = rx(i,j) ・・・・・・ (b) irxm = i ・・・・・・ (c) sum = sum + rx(i,j) ・・・・・・ (d) endif endif 200 continue[Equation 1] do 200 j = 1, m do 200 i = 1, n if (mod (i, 2) = 1) then sum = sum + rm (i, j) ・・・・・・ (a) if ((abs (rx (i, j)) <abs (rxm)) .and. (mod (j, 2) = 1)) then rxm = rx (i, j) ・・・・・・ (b) irxm = i ・・・・・・ (c) sum = sum + rx (i, j) ・・・・・・ (d) endif endif 200 continue

【００２４】リダクションの検出（ステップ２１）リダクションの検出は、図３に示すように、さらに３つ
のサブ・ステップから構成されている。まず、ループ中
の各代入文をマスクしているマスク表現に関するモデル
を作成する。つまり、各代入文について、それを実行す
るための条件に関する表現木を生成する（ステップ３
１）。Reduction Detection (Step 21) Reduction detection is made up of three sub steps, as shown in FIG. First, create a model for the mask expression that masks each assignment statement in the loop. That is, for each assignment statement, an expression tree relating to the condition for executing it is generated (step 3
1).

【００２５】ここでマスク表現とは、ｉｆ文などの条件
文のように、ある代入文の実行の前提となる条件をい
う。この表現木のノードは、論理積の各項によって表現
されている。表現木の生成において、「if,elseif,els
e」といったそれぞれの制御文をトラバースする度に、
新しいノードを生成していく。Here, the mask expression refers to a condition that is a prerequisite for executing a certain assignment statement, such as a conditional statement such as an if statement. The node of this expression tree is expressed by each term of the logical product. In the expression tree generation, if, elseif, els
Each time you traverse each control statement such as "e",
Create new nodes.

【００２６】図４は、ループ本体の代入文とこれから生
成される条件式の表現木の関係を示す図である。表現木
の各ノードはマスク表現としての条件式（例えば、条件
Ｃ１及びその否定＾Ｃ１）に対応し、表現木の各葉は代
入文（Ｓ１、Ｓ２など）に対応している。表現木をルー
ト（Ｃ１または、＾Ｃ１）からある葉までトラバースす
ると、その中間ノードの条件式の積が、葉の各代入文の
実行を束縛するマスク表現となる。例えば、葉｛Ｓ２、
Ｓ３｝に関するマスク表現は、以下のようになる。FIG. 4 is a diagram showing the relationship between the assignment statement of the loop body and the expression tree of the conditional expression generated from this. Each node of the expression tree corresponds to a conditional expression (for example, condition C1 and its negation C1) as a mask expression, and each leaf of the expression tree corresponds to an assignment statement (S1, S2, etc.). When the expression tree is traversed from the root (C1 or ^ C1) to a leaf, the product of the conditional expressions of the intermediate nodes becomes a mask expression that binds the execution of each assignment statement of the leaf. For example, leaves {S2,
The mask expression for S3} is as follows.

【数２】Ｃ１ and ＾Ｃ２ and Ｃ３[Equation 2] C1 and ^ C2 and C3

【００２７】上記のサンプル・プログラムに関して、こ
のような規則に基づき条件式の表現木を生成すると図５
のようになる。With respect to the above sample program, if an expression tree of conditional expressions is generated based on such rules, FIG.
become that way.

【００２８】次に、ループ中の代入文の表現木を生成す
る（ステップ３２）。ステップ３１において生成された
マスク表現の表現木の葉毎に、その葉に対応する代入文
の表現木を生成する。この代入文の表現木は、そのノー
ドが代入文を構成するそれぞれの要素に対応付けられて
いる。Next, the expression tree of the assignment statement in the loop is generated (step 32). For each leaf of the mask expression expression tree generated in step 31, an expression tree of an assignment statement corresponding to the leaf is generated. In the expression tree of the assignment statement, the node is associated with each element forming the assignment statement.

【００２９】ここで、代入文の要素とは、代入文を構成
する変数、演算子等をいう。例えば、ｓ＝ｓ＊ａ（ｉ）
という代入文においては、「ｓ」、「ｓ」、「＊」、
「ａ（ｉ）」が代入文の要素である。例えば、図６
（ａ）に示すプログラムの代入文の表現木は、図６
（ｂ）のようになる。代入文Ｓ₁、Ｓ₂のマスクとなる条
件式Ｃを表現式のルートとして、それぞれのノードに代
入文の要素を対応させている。Here, the elements of the assignment statement refer to variables, operators, etc. that form the assignment statement. For example, s = s * a (i)
In the assignment statement, "s", "s", "*",
“A (i)” is an element of the assignment statement. For example, FIG.
The expression tree of the assignment statement of the program shown in FIG.
It becomes like (b). The conditional expression C, which is a mask of the assignment statements S ₁ and S ₂ , is used as the root of the expression, and each node is associated with an element of the assignment statement.

【００３０】上記のサンプルプログラムに関して、この
ような規則に基づき代入文の表現木を作成すると図７の
ようになる。With respect to the above sample program, the expression tree of the assignment statement is created based on such a rule as shown in FIG.

【００３１】変数のデータ依存関係に基づき、ステップ
３２で得られた代入文の表現木をマージする（ステップ
３３）。図６の例では、２つの代入文Ｓ₁、Ｓ₂に関して
変数ｓは、図６（ｂ）の点線で示す矢印δ₀、δ₁よう
な、true（defからuse）のデータ依存関係を有してい
る。従って、このデータ依存グラフは、図６（ｃ）のよ
うに強連結グラフを形成している。The expression trees of the assignment statement obtained in step 32 are merged on the basis of the data dependence of the variables (step 33). In the example of FIG. 6, the variable s with respect to the two assignment statements S ₁ and S ₂ has a data dependency relationship of true (from def to use) such as arrows δ ₀ and δ ₁ shown by dotted lines in FIG. 6B. are doing. Therefore, this data dependence graph forms a strongly connected graph as shown in FIG.

【００３２】このようなデータ依存関係は、強連結コン
ポーネント(strongly connected component)と呼ばれ、
これが、リダクションを検出する単位となる。強連結コ
ンポーネントを構成する各代入文が、同一の条件式表現
木のノードを共有するならば、ループに独立なデータ依
存関係（図５(b)ではδ₀の矢印）を消去するように、代
入文の表現木をマージする。このようにして、図６
（ｂ）の表現木はマージされて、図６（ｃ）の表現木が
得られる。なお、マスク表現が異なる代入文間ではこの
ようなマージは行わない点に留意されたい。この操作を
すべての代入文において実行する。Such a data dependency is called a strongly connected component,
This is the unit for detecting reduction. If each assignment statement forming the strongly connected component shares the same conditional expression tree node, the loop-independent data dependency (arrow of δ _{0 in} FIG. 5B) is deleted. Merge expression trees of assignment statements. Thus, FIG.
The expression trees of (b) are merged to obtain the expression tree of FIG. 6 (c). Note that such merge is not performed between assignment statements with different mask expressions. Perform this operation on all assignment statements.

【００３３】このような規則のもとで生成した上記サン
プル・プログラムの代入文のデータ依存関係を示す表現
木が図８である。なお、この図においては、代入文
(b)、(c)は記載が省略されている。この図が示すよう
に、サンプル・プログラム中の代入文(a),(b),(c),(d)
に対する表現木（図７参照）におけるデータ依存関係を
調べる。代入文(a),(d)はその間のデータ依存関係から
強連結コンポーネントを構成しているが、条件式表現を
共有しないため、この表現木のマージは行われない。FIG. 8 shows an expression tree showing the data dependency relationship of the assignment statement of the sample program generated under such a rule. Note that in this figure, the assignment statement
Descriptions of (b) and (c) are omitted. As this figure shows, the assignment statements (a), (b), (c), (d) in the sample program
Check the data dependency in the expression tree for (see FIG. 7). Although the assignment statements (a) and (d) form a strongly connected component from the data dependency between them, the expression trees are not merged because they do not share the conditional expression expressions.

【００３４】ここで重要なことは、リダクションは、本
質的に、それを構成する代入文の間に強連結コンポーネ
ントとなるようなデータ依存関係（図６（ｃ）参照）が
あるということである。この性質に着目して、本アルゴ
リズムでは強連結コンポーネントに着目して、これらの
代入文の表現木をマージさせる。データ依存関係に着目
して代入文の構造をマージするため、代入文の表現に依
存しない構造モデルを基準にリダクションを判断するこ
とができる。What is important here is that the reduction essentially has a data dependency relationship (see FIG. 6 (c)) such that it becomes a strongly connected component between the assignment statements constituting it. . Focusing on this property, this algorithm focuses on the strongly connected components and merges the expression trees of these assignment statements. Since the structure of the assignment statement is merged while paying attention to the data dependency, the reduction can be judged based on the structural model that does not depend on the expression of the assignment statement.

【００３５】なお、代入文が強連結コンポーネントを構
成するからといって、それを一律にリダクションと判断
することはできない。その他にも、ループ全体を解析す
る必要がある。しかしながら、このステップで、強連結
コンポーネントを見つけ出すことにより、リダクション
の検出の候補を生成することができる。Even if the assignment statement forms a strongly connected component, it cannot be uniformly determined as reduction. Besides that, it is necessary to analyze the entire loop. However, in this step, reduction detection candidates can be generated by finding strongly connected components.

【００３６】上記の条件式の表現木及びデータ依存関係
に基づきマージされた代入文の表現木からリダクション
を構成しているかどうかを判定する（ステップ３４）。
すなわち、上記のステップでリダクションの候補となり
える記述が、実際にリダクションを構成しているかどう
かを判断するのである。この判断において重要なこと
は、変数のデータ依存関係及び代入文中の演算子の記述
を参照している点である。It is judged whether or not reduction is constructed from the expression tree of the assignment statement merged based on the expression tree of the conditional expression and the data dependency (step 34).
That is, it is determined in the above steps whether or not the description that can be a candidate for reduction actually constitutes a reduction. What is important in this judgment is that the data dependence of variables and the description of the operator in the assignment statement are referenced.

【００３７】この判断は、データ依存グラフの右辺式の
状態により、以下の２つのケースに分けられる。すなわ
ち、マージされた代入文の表現木における右辺が表現式
であるか、単一の変数参照であるかである。This judgment can be divided into the following two cases depending on the state of the right side expression of the data dependence graph. That is, whether the right side in the expression tree of the merged assignment statement is an expression or a single variable reference.

【００３８】ケースＡ．右辺式が、単一の変数参照では
なく、表現式である場合例えば、図６（ｄ）のように、ルート（＝）の右側が表
現式である場合である。この場合には、以下の４つの要
件の全てを具備する場合に、リダクションと判断され
る。 Case A. The right-hand side expression is a single variable reference
When there is no expression, for example, as shown in FIG. 6D , there is a case where the right side of the route (=) is an expression. In this case, reduction is determined when all of the following four requirements are met.

【００３９】（１）グラフのリカレンス変数(recurrenc
e valiable)がグラフの右辺に存在することが必要であ
る。ここでリカレンス変数とは、グラフの左辺の変数か
ら正方向のtrueのデータ依存関係のある右辺の参照変数
をいう。図６の例では、δ₁の矢印の先である参照変数
ｓが該当する。(1) Recurrence variable (recurrenc) of the graph
e valiable) must be on the right side of the graph. Here, the recurrence variable means a reference variable on the right side having a true data dependency in the positive direction from the variable on the left side of the graph. In the example of FIG. 6, the reference variable s which is the tip of the arrow of δ ₁ corresponds.

【００４０】（２）グラフにおいて、上記リカレンス変
数（δ₁の矢印先の変数ｓ）のノードからそのルート
（＝）までトラバースした場合に、その中間に存在する
ノードの演算子の全てが、交換法則及び結合法則が成立
する演算子で、かつ同種の演算子として閉じていること
が必要である。ここで、交換法則及び結合法則が成立す
る演算子とは、「＋、＊、max、min」などの演算子であ
る。また、同種の演算子とあるから、中間ノードに異種
の演算子が存在してはいけない。図６（ｄ）において
は、中間ノードには、「＊」と「＋」という異なる演算
子が存在するため、この要件を具備しない。従って、図
６（ａ）の表現は、リダクションではないことがわか
る。(2) In the graph, when the node of the recurrence variable (variable s at the arrow tip of δ ₁ ) is traversed from its root (=), all the operators of the nodes in the middle are exchanged. It is necessary for the operator to satisfy the laws of law and associative law and to be closed as an operator of the same kind. Here, the operators that satisfy the exchange law and the associative law are operators such as “+, *, max, min”. Also, since there are operators of the same type, different operators must not exist in intermediate nodes. In FIG. 6D, since the intermediate nodes have different operators “*” and “+”, this requirement is not satisfied. Therefore, it is understood that the expression in FIG. 6A is not reduction.

【００４１】（３）条件式の表現木におけるある葉に対
応するグラフＧ₀が、強連結コンポーネントの一部であ
る場合、その強連結コンポーネントに属する他のグラフ
Ｇ₁についても、条件（２）で得られた演算子と同種の
もので閉じていることが必要である。すなわち、グラフ
Ｇ₀とグラフＧ₁が同じ強連結コンポーネントに属し、論
理条件の表現木のノードを共有していなかったために、
ステップ３３でマージされなかった場合には、両グラフ
の中間ノードに同一の演算子が存在している必要があ
る。Ｇ₁が存在する場合には、異なる条件式でガードさ
れている複数の代入文がリダクションを構成している。(3) When the graph G ₀ corresponding to a certain leaf in the expression tree of the conditional expression is a part of the strongly connected component, the condition (2) also applies to the other graph G ₁ belonging to the strongly connected component. It needs to be closed with the same kind of operator obtained in. That is, since the graph G ₀ and the graph G ₁ belong to the same strongly connected component and do not share the node of the expression tree of the logical condition,
If they are not merged in step 33, the same operator must exist in the intermediate nodes of both graphs. When G ₁ exists, a plurality of assignment statements guarded by different conditional expressions constitute reduction.

【００４２】（４）上記リカレンス変数のノードが、グ
ラフの強連結コンポーネントの外部ノードとのデータ依
存関係がないことが必要である。(4) It is necessary that the node of the above-mentioned recurrence variable has no data dependency relationship with the external node of the strongly connected component of the graph.

【００４３】ケースＢ．グラフの右辺式がループ不変で
はない単一の変数参照である場合この場合には、グラフの対応する論理条件式の表現木の
ノードを、そのルートまでたどって、ノードの各論理式
Ｃとグラフの形をチェックする。具体的には、次の６つ
の要件を全て満たす場合に、リダクションと判断され
る。これらの要件は、リダクションのうち、条件文を使
って表現されたMAXVAL,MINVAL,MACLOC,MINLOCのリダク
ションを検出するためのものである。従って、それぞれ
の意味から以下の要件が導出される。換言すると、ｚ＝
Ｅという形をした代入文に対して、以下のような形をし
た論理式を有する条件文が存在するかどうかを調べてい
る。 Case B. The expression on the right side of the graph is loop invariant
In this case, the node of the expression tree of the corresponding logical conditional expression of the graph is traced to its root, and each logical expression C of the node and the shape of the graph are checked. Specifically, if all of the following six requirements are satisfied, it is judged as reduction. These requirements are for detecting reductions of MAXVAL, MINVAL, MACLOC, and MINLOC expressed by using conditional statements, among the reductions. Therefore, the following requirements are derived from the respective meanings. In other words, z =
For an assignment statement in the form of E, it is examined whether or not there is a conditional statement having a logical expression in the following form.

【数３】ｆ（ｚ）relop ｆ（Ｅ）但し、ｆ：エレメンタル（elemental）な関数 relop：＜、≦、＞、≧等の比較演算子## EQU00003 ## f (z) relop f (E) where f is an elemental function relop: comparison operator such as <, ≤,>, ≥

【００４４】（１）グラフの左辺変数ノードに対して、
外部ノードとのデータ依存関係がない。(1) For the variable node on the left side of the graph,
There is no data dependency with external nodes.

【００４５】（２）ノードの論理式Ｃが「or」で結合さ
れた論理式でない。(2) The logical expression C of the node is not a logical expression connected by "or".

【００４６】（３）論理式Ｃが不等号で表現された論理
式である。(3) Logical expression C is a logical expression expressed by an inequality sign.

【００４７】（４）論理式Ｃの不等号の両辺が単一の変
数参照、またはその変数にエレメンタル(elemental)な
演算が施された項である。エレメンタルな演算が施され
ている場合には、両辺に対して同一の演算子が適用され
ている。なお、ここで、エレメンタルな演算とは、スカ
ラーの引数に対して、スカラーの結果を返すような演算
をいう。配列に対しては、そのそれぞれの配列要素に対
してこの演算が適用される。例えば、「abs」、「sqr
t」、「sin」、「cos」などが挙げられる。(4) Both sides of the inequality sign of the logical expression C are a reference to a single variable or a term in which the element is subjected to an elemental operation. When an elemental operation is performed, the same operator is applied to both sides. Here, the elementary operation is an operation that returns a scalar result with respect to a scalar argument. For arrays, this operation is applied to each of its array elements. For example, "abs", "sqr
Examples include "t", "sin", and "cos".

【００４８】（５）論理式Ｃの不等号のいずれかの辺の
変数（インデックス変数も含む）が、グラフの右辺変数
と同一である。変数が一致しない場合には、論理式Ｃの
不等号における変数のデータ依存関係を遡ることによっ
て、グラフの右辺変数と同一になる代入文の右辺が存在
する。(5) The variable (including the index variable) on either side of the inequality sign of the logical expression C is the same as the right side variable of the graph. If the variables do not match, the right side of the assignment statement that is the same as the right side variable of the graph exists by tracing back the data dependence of the variable in the inequality sign of the logical expression C.

【００４９】（６）論理式Ｃの不等号の他方の辺の変数
（インデックス変数も含む）が、グラフの左辺変数と同
一である。または、データ依存関係を遡ることによっ
て、グラフの左辺変数と同一となる代入文の右辺が存在
する。この時、不等号の向き、インデックス変数であっ
たかを参照することにより、「maxval」、「minval」、
「maxloc」、「minloc」などのリダクションと判定され
る。または、条件（５）でインデックス変数とグラフの
右辺変数が同一だった場合には、同一の条件式を共有
し、ノードＣにおいて「maxval」または「minval」と判
定されている他のグラフが存在する。この場合は、この
判定に従って、「maxloc」、「minloc」と判定される。(6) The variable (including the index variable) on the other side of the inequality sign of the logical expression C is the same as the left side variable of the graph. Alternatively, by tracing back the data dependence, there is a right side of the assignment statement that is the same as the left side variable of the graph. At this time, by referring to the direction of the inequality sign and whether it was an index variable, "maxval", "minval",
It is judged to be reduction such as "maxloc" or "minloc". Alternatively, if the index variable and the right-hand side variable of the graph are the same in condition (5), there is another graph that shares the same conditional expression and is determined to be “maxval” or “minval” at the node C. To do. In this case, according to this determination, “maxloc” and “minloc” are determined.

【００５０】サンプル・プログラム中のリダクションの
候補が、実際にリダクションを構成しているかを調べ
る。すなわち、それぞれの表現木に対して、上記２種類
のケースのどちらかを適用し、判定が行われる。代入文
(a)の表現木に対しては、その右辺式が単一の変数参照
ではなく、表現式の形である。従って、ケースＡが適用
される。この場合、図８に示すように、表現木(a)の右
辺のリカレンス変数sumからスタートして、そのルート
（＝）までたどってみると、中間ノードの演算子は
「＋」だけである。従って、交換法則及び結合法則が成
立する演算子で閉じていることがわかる。また、表現式
(a)と条件式ノードを共有していないが、同じ強連結成
分に属する表現木(d)が存在する。従って、この表現木
(d)についても調べる必要がある。この場合、同様に上
記の演算子で閉じていることがわかる。従って、表現木
(a),(d)で和のリダクションを構成していると認識され
る。It is checked whether the reduction candidates in the sample program actually constitute the reduction. That is, the determination is performed by applying either of the two types of cases to each expression tree. Assignment statement
For the expression tree of (a), the right-hand side expression is not a single variable reference, but the expression form. Therefore, case A applies. In this case, as shown in FIG. 8, when starting from the recurrence variable sum on the right side of the expression tree (a) and tracing to the root (=), the operator of the intermediate node is only “+”. Therefore, it can be seen that the operators are closed by the exchange law and the associative law. Also, the expression
There is an expression tree (d) that does not share the conditional expression node with (a) but belongs to the same strongly connected component. Therefore, this expression tree
It is also necessary to investigate (d). In this case, it can be seen that the operator is closed by the above operator as well. Therefore, the expression tree
It is recognized that (a) and (d) constitute the reduction of the sum.

【００５１】代入文(b)の表現木に対しては、その右辺
式が単一の変数参照であるため、上記のケースＢが適用
される。この場合、条件式表現木のノードをそのルート
まで遡って、Ｃ３，Ｃ２，Ｃ１をトラバースする。Ｃ２
は、代入文rxm=rx(i,j)に対して、abs(rxm)≧abs(rx(i,
j)という表現であるから、上記の要件を満たすことがわ
かる。従って、これは配列vxに対しては絶対値が最小で
ある値を求めるリダクションであると認識される。代入
文(c)は、その右辺の変数が、条件式Ｃ２における配列
変数rxのインデックス変数となっており、かつ絶対値の
最小値を求めるリダクションが存在することから、これ
は、その時のインデックス変数を求めるリダクションで
あると認識される。For the expression tree of the assignment statement (b), case B above applies because the right-hand side expression is a single variable reference. In this case, the node of the conditional expression expression tree is traced back to its root, and C3, C2, and C1 are traversed. C2
Abs (rxm) ≧ abs (rx (i, j for the assignment statement rxm = rx (i, j)
The expression j) shows that the above requirements are satisfied. Therefore, this is recognized as a reduction for finding the value whose absolute value is the minimum for the array vx. Since the variable on the right side of the assignment statement (c) is the index variable of the array variable rx in the conditional expression C2, and there is a reduction for finding the minimum absolute value, this is the index variable at that time. Is recognized as a reduction that seeks.

【００５２】以上のステップにより、上記のサンプル・
プログラムが示すループは、代入文(b)が配列rxに対す
る絶対値の最小値、代入文(c)がその時のループインデ
ックス、代入文(a),(d)が配列rm及びrxに対する総和を
求めるリダクションであることが検出される。By the above steps, the above sample
In the loop shown by the program, the assignment statement (b) finds the minimum absolute value for the array rx, the assignment statement (c) finds the loop index at that time, and the assignment statements (a) and (d) find the sum of the arrays rm and rx. Reduction is detected.

【００５３】リダクションを並列化が可能なループに変
換（ステップ２２）次に、リダクションループは、複数のプロセッサで高速
に実行可能となるように、並列化が可能なループに変換
される。ここで、ステップ２１により検出されたリダク
ション構文は、一般的には、少なくとも一つの代入文を
有するインパーフェクト(imperfect)なループで表現さ
れており、必ずしも、リダクション自身で単独なリダク
ションループを形成しているとは限らない。このような
一般的な表現も考慮して、リダクションの計算を複数の
プロセッサで効率よく実行するためには、リダクション
を含むループの変換手順を決定する必要がある。このた
めには、リダクションの対象となる各配列オペランドに
対する通信解析を行うことが必要である。Conversion of Reduction into Parallelizable Loop (Step 22) Next, the reduction loop is converted into a parallelizable loop so that it can be executed by a plurality of processors at high speed. Here, the reduction syntax detected in step 21 is generally expressed by an imperfect loop having at least one assignment statement, and the reduction itself does not necessarily form a single reduction loop. Not necessarily. In consideration of such a general expression, it is necessary to determine the conversion procedure of the loop including the reduction in order to efficiently execute the reduction calculation by a plurality of processors. For this purpose, it is necessary to perform communication analysis for each array operand to be reduced.

【００５４】図９は、リダクションのループ変換のため
の動作フロー図である。まず、リダクションの通信解析
を行う（ステップ９１）。ここでは、ループ内のデータ
の依存関係及びその配列のデータを分割する方法の指定
に基づいてプロセッサ間で生じる通信を解析する。FIG. 9 is an operation flow chart for loop conversion of reduction. First, reduction communication analysis is performed (step 91). Here, the communication that occurs between the processors is analyzed based on the dependency of the data in the loop and the designation of the method of dividing the data of the array.

【００５５】この通信解析（ステップ９１）に基づき、
通信が必要であり、かつ配列オペランドがプリフェッチ
が可能であるかどうかを判断する（ステップ９２）。Based on this communication analysis (step 91),
It is determined whether communication is necessary and the array operand can be prefetched (step 92).

【００５６】ステップ９１による判断が、「Yes」の場
合には、この配列を対象とする全てのリダクション、及
び配列対象はないが、配列を実行時マスクの条件式オペ
ランドとして有するリダクションに対して、新たなルー
プを生成する（ステップ９３）。この新たなループは、
プリフェッチ可能なループのネストレベルにおいて、こ
れらのオペランドを対象とするリダクションを代入文と
して有している。このループは、全てのリダクション変
数をプライベート変数とすること、すなわちプライベー
ト化することによって、プロセッサ毎にローカルな演算
を実行するループとなるため、並列実行が可能となる。
このループは、プリフェッチのために配列データの個々
の要素を通信として送受信する代わりに、リダクション
のセマンティックスを検出したことから、個々のプロセ
ッサによって求められたローカルな結果のみを通信する
ことにより、効率化させたと考えることができる。If the determination in step 91 is "Yes", all reductions that target this array and reductions that have an array as a conditional expression operand of the runtime mask A new loop is generated (step 93). This new loop
At the nesting level of the prefetchable loop, reductions for these operands are included as assignment statements. This loop is a loop that executes a local calculation for each processor by making all the reduction variables private, that is, by making them private, so that parallel execution is possible.
This loop is streamlined by communicating only the local results sought by individual processors, as it detects reduction semantics instead of sending and receiving individual elements of array data as communications for prefetching. You can think that you made it.

【００５７】なお、ここでプライベート変数とは、ルー
プの繰り返しの中で、その定義と参照が閉じているもの
をいう。Here, the private variable means that its definition and reference are closed during the iteration of the loop.

【数４】 (Equation 4)

【００５８】上式で、変数ｘはループのそれぞれの繰り
返しの中で定義され、その同じ繰り返し（イタレーショ
ン）の中でのみ参照されているプライベート変数であ
る。ソースプログラムを並列化する場合、各イタレーシ
ョンごとに分割して、それを異なるプロセッサに与える
が、プライベート変数は、各プロセッサ毎のローカルな
変数として取り扱うことができるので、その計算におい
てはプロセッサ間通信が生じることがない。このよう
に、ループ中で使用される変数を、一時的にプロセッサ
毎のプライベート変数として扱うことは、イタレーショ
ン間の並列計算を正しく実行することを保証することが
でき、かつ通信のオーバーヘッドを低減させることがで
きるため重要である。In the above equation, the variable x is a private variable defined in each iteration of the loop and referenced only in that same iteration. When a source program is parallelized, it is divided for each iteration and given to different processors, but private variables can be handled as local variables for each processor, so interprocessor communication is used in the calculation. Does not occur. In this way, by temporarily treating the variables used in the loop as private variables for each processor, it is possible to ensure that parallel computation between iterations is executed correctly, and the communication overhead is reduced. It is important because it can be done.

【００５９】ステップ９１による判断が、「No」の場
合、すなわち、通信が必要ない場合、または配列オペラ
ンドがプリフェッチできない場合には、オリジナルのル
ープのリダクション変数をプライベート化する（ステッ
プ９４）。これにより、リダクションのオペレーション
によってループの有する並列性が損われることはない。
従って、ループが並列化できない場合は、その原因はル
ープの中の他の代入文におけるデータの依存関係にある
こととなる。このような場合が生じるのは、ループその
ものに並列性がない場合に限られる。ループが並列化で
きた場合には、そのループのネストレベルの外側で、リ
ダクションのグローバルな計算のための通信コールを挿
入する。If the determination in step 91 is "No", that is, if communication is not necessary or the array operand cannot be prefetched, the reduction variable of the original loop is made private (step 94). In this way, the reduction operation does not impair the parallelism of the loop.
Therefore, when the loop cannot be parallelized, the cause is the data dependency in other assignment statements in the loop. Such cases occur only when the loop itself has no parallelism. If the loop can be parallelized, insert a communication call for the global calculation of reduction outside the nesting level of the loop.

【００６０】サンプル・プログラムについて説明する
と、リダクションの対象配列rx及びrmは、ループ中の左
辺の配列変数が存在しないため、これらの配列オペラン
ドに対する通信を解析すると、通信が必要ないことがわ
かる。従って、新たなループを生成せずに、オリジナル
のループ中の全てのリダクション変数rxm,irxm,sumをプ
ライベート化するような変換が施され、並列化が可能な
ループとなる。この場合、ループの計算分割は、リダク
ションの対象配列である変数rxまたは変数rmに基づいて
行われる。Explaining the sample program, since the reduction target arrays rx and rm do not have array variables on the left side in the loop, it can be understood from the analysis of the communication with respect to these array operands that communication is not necessary. Therefore, conversion is performed so that all the reduction variables rxm, irxm, and sum in the original loop are privatized without generating a new loop, and the loop becomes parallelizable. In this case, the calculation division of the loop is performed based on the variable rx or the variable rm that is the reduction target array.

【００６１】リダクション変数のプライベート化及びそ
れに続くループの並列化は、それがリダクションである
と検出されたことで、はじめて可能となる点に特に留意
されたい。この変数のプライベート化により上記のサン
プルプログラムは、以下のように変換される。なお、こ
のプログラム中のアンダー・スコア付きの変数（例え
ば、_rxm）は、全てプライベート化されたリダクション
変数を表す。また、始めの３行は、プライベート化され
たリダクション変数の初期値設定である。さらにループ
は既に並列化された形になっており、ループの繰り返し
範囲が分割されている。It should be particularly noted that privatization of reduction variables and subsequent parallelization of loops is possible only when it is detected as reduction. By privateizing this variable, the above sample program is converted as follows. Variables with an underscore (for example, _rxm) in this program all represent private reduction variables. Also, the first three lines are initial value settings of the privateized reduction variables. Furthermore, the loop is already in parallel form, and the iteration range of the loop is divided.

【００６２】[0062]

【数５】 _rxm = largest_val _irxm = 0 _sum = 0 do 200 j = lb1, ub1 do 200 i = lb2, ub2 if (mod(i,2) = 1) then _sum = _sum + rm(i,j) ・・・・・・ (a) if ((abs(rx(i,j)) < abs(_rxm)) .and. (mod(j,2) = 1)) then _rxm = rx(i,j) ・・・・・・ (b) _irxm = i ・・・・・・ (c) _sum = _sum + rx(i,j) ・・・・・・ (d) endif endif 200 continue[Formula 5] _rxm = largest_val _irxm = 0 _sum = 0 do 200 j = lb1, ub1 do 200 i = lb2, ub2 if (mod (i, 2) = 1) then _sum = _sum + rm (i, j) ・・・・・・ (A) if ((abs (rx (i, j)) <abs (_rxm)) .and. (Mod (j, 2) = 1)) then _rxm = rx (i, j) ・・・・・・ (B) _irxm = i ・・・・・・ (c) _sum = _sum + rx (i, j) ・・・・・・ (d) endif endif 200 continue

【００６３】リダクション通信の生成及びその最適化
（ステップ２３）並列化されたリダクション・ループに対して、通信の生
成が行われる。ある単一のループが複数のリダクション
演算を含んでいるというケースは、実際のソースプログ
ラムにおいて非常によく見られる。このような場合、そ
れぞれのリダクション毎にプロセッサ間通信を独立して
実行することは効率的ではない。それよりも、複数のリ
ダクションについて、それぞれのプロセッサが求めたロ
ーカルな中間結果をまとめてベクトル化し、１回の通信
イベントで送受信する方が好ましい。実行時のシステム
全体の同期ポイントを減らすことができるからである。Generation of reduction communication and its optimization (step 23) Generation of communication is performed for the parallelized reduction loop. The case where a single loop contains multiple reduction operations is very common in actual source programs. In such a case, it is not efficient to execute inter-processor communication independently for each reduction. Rather, it is more preferable to collectively vectorize the local intermediate results obtained by the respective processors for a plurality of reductions and transmit / receive them in one communication event. This is because it is possible to reduce the synchronization points of the entire system at the time of execution.

【００６４】リダクション通信は、３つのステップで実
行される。すなわち、（１）リダクションのローカルな
計算ループの分割情報(local iteration set:LIS)に基
づいて、参加するプロセッサグループを求めるステッ
プ、（２）プロセッサグループのメンバーであるプロセ
ッサ間で通信を行い、あるプロセッサへと結果を集計す
るステップ、及び（３）このプロセッサで計算された最
終結果をプロセッサグループのメンバーへ転送するステ
ップ、である。Reduction communication is executed in three steps. That is, (1) a step of determining a participating processor group based on division information (local iteration set: LIS) of a local calculation loop of reduction, (2) communication between processors that are members of the processor group, and Aggregating the results to the processor, and (3) transferring the final result calculated by this processor to the members of the processor group.

【００６５】上記の３ステップは、プログラムとしては
以下の式のように記述することができる。なおここで、
gidとは、プロセッサグループを示している。The above three steps can be described as a program as the following equation. Here,
The gid indicates a processor group.

【数６】 gid=lis2gid(LIS) call reduce (gid, &reduce_function, reduction_variable) call broadcast (gid, reduction_variable)[Equation 6] gid = lis2gid (LIS) call reduce (gid, & reduce_function, reduction_variable) call broadcast (gid, reduction_variable)

【００６６】一般的に、複数のリダクション通信をマー
ジする場合、プロセッサの分割方法はリダクションごと
に異なる。従って、それぞれのプロセッサグループgid
1、gid2の和集合を計算し、この和集合のプロセッサグ
ループに対して通信を実行すればよい。Generally, when merging a plurality of reduction communications, the division method of the processor differs for each reduction. Therefore, each processor group gid
It suffices to calculate the union of 1 and gid2 and execute communication with the processor group of this union.

【数７】gid= gid1 ∪ gid2[Equation 7] gid = gid1 ∪ gid2

【００６７】また、リダクションのためのローカル変
数、グローバル変数の組だけでなく、各リダクションに
対するプロセッサグループ及びリダクション演算子につ
いてもベクトル化する。この場合、このベクトルの集計
はマルチリダクション関数を起動する。すなわち、ある
プロセッサグループの中で、ベクトル化された変数のそ
れぞれに対して、指定された異なるリダクション演算を
実行することで、通信回数を減らし、効率的に実行す
る。Further, not only a set of local variables and global variables for reduction but also a processor group and reduction operator for each reduction are vectorized. In this case, the aggregation of this vector triggers the multi-reduction function. That is, different designated reduction operations are executed for each of the vectorized variables in a certain processor group, thereby reducing the number of times of communication and executing efficiently.

【００６８】図１０は、マルチリダクション関数のプロ
グラムリストの例である。このプログラムリストは、整
数型変数のケースのみを示しているが、他のタイプの変
数の場合についても容易に対応させることができる。こ
こで、in1は、ローカル入力データ、in2をグローバル入
力データとしている。ここで示されているように、ベク
トル化された各要素に対するリダクション演算は、指定
されたプロセッサグループに対するメンバーチェックを
まず実行し、このチェックを満たすプロセッサのみが演
算を実行する。チェックを満たさないプロセッサは演算
を実行せずに、入力データをそのまま出力する。このよ
うに、リダクションのための通信イベントのパケット内
のデータ使用率を向上させ、かつ通信イベントの回数を
減らすことにより、並列計算機の実行速度を向上させて
いる。FIG. 10 is an example of a program list of a multi-reduction function. This program list shows only the case of integer type variables, but it can easily correspond to the case of variables of other types. Here, in1 is local input data and in2 is global input data. As shown here, the reduction operation for each vectorized element first performs the member check for the specified processor group, and only the processors that satisfy this check perform the operation. The processor that does not satisfy the check outputs the input data as it is without executing the operation. As described above, the execution rate of the parallel computer is improved by improving the data usage rate in the packet of the communication event for reduction and reducing the number of communication events.

【００６９】同一ループ内に存在する複数のリダクショ
ン通信をマージする効果は、次のように説明できる。２
つのリダクション通信のためのプロセッサグループをgi
d1、gid2で表し、その和集合をgid0とした場合、２つの
リダクションをマージすることなく、別々にリダクショ
ン通信が行われたケースは次のようになる。The effect of merging a plurality of reduction communications existing in the same loop can be explained as follows. Two
Gi processor groups for one reduction communication
If d1 and gid2 are used and the union is gid0, the reduction communication is performed separately without merging the two reductions.

【００７０】[0070]

【数８】 call reduce (gid1, SUM, s1, ・・・) call broadcast (gid1, s1, ・・・ ) call reduce (gid2, MAXVAL, s2, ・・・ ) call broadcast (gid2, s2, ・・・ )[Equation 8] call reduce (gid1, SUM, s1, ・・・) call broadcast (gid1, s1, ・・・) call reduce (gid2, MAXVAL, s2, ・・・) call broadcast (gid2, s2, ・・・)

【００７１】一方、リダクションをマージした場合のリ
ダクション通信は次のようになる。 #dOn the other hand, reduction communication when the reductions are merged is as follows. #d

【数９】 call reduce (gid0, (gid1, gid2), (SUM, MAXVAL), (s1,s2), ・・・ ) call broadcast (gid0, (gid1, gid2), (s1, s2) ・・・ )[Equation 9] call reduce (gid0, (gid1, gid2), (SUM, MAXVAL), (s1, s2), ・・・) call broadcast (gid0, (gid1, gid2), (s1, s2) ・・・)

【００７２】マージされた通信では、プロセッサグルー
プの組、リダクション演算の組及びリダクション変数の
組が引数として渡され、和集合のプロセッサグループgi
d0の間で通信が行われる。それぞれのプロセッサは、引
数として渡されたgid1、gid2に対してメンバーチェック
を行い、それぞれのリダクション及び通信に参加するか
を決定する。上記式中の「reduce」及び「broadcast」
は、ｎプロセッサに対して、log(n)で通信を行うことが
でき、それぞれのプロセッサを葉とした場合の木構造の
height reductionと考えることができる。In the merged communication, a set of processor groups, a set of reduction operations, and a set of reduction variables are passed as arguments, and the union of processor groups gi
Communication is performed between d0. Each processor performs a member check on gid1 and gid2 passed as arguments, and determines whether to participate in each reduction and communication. "Reduce" and "broadcast" in the above formula
Can communicate with n processors by log (n), and if each processor is a leaf,
It can be thought of as height reduction.

【００７３】ここでgid0、gid1、gid2のそれぞれのプロ
セッサ数をｐ、ｎ、ｍとすると、ｐ≦ｎ＋ｍであるか
ら、ｐ、ｍ、ｎ＞１の場合には、以下のようになる。Here, assuming that the numbers of processors of gid0, gid1, and gid2 are p, n, and m, p ≦ n + m. Therefore, when p, m, and n> 1, the following results.

【数１０】 log(p)≦log(n+m)≦log(nm)＝log(n)+log(m)(10) log (p) ≦ log (n + m) ≦ log (nm) = log (n) + log (m)

【００７４】従って、通信時間を比較すると、Therefore, comparing the communication times,

【数１１】 reduce(gid0)≦reduce(gid1)＋reduce(gid2) が成立する。[Equation 11] reduce (gid0) ≦ reduce (gid1) + reduce (gid2) holds.

【００７５】マージされたリダクション通信は、gid1、
gid2が互いにdisjointな場合において、全てのプロセッ
サへのブロードキャスト通信が少なくとも１回少なくて
すむ。また、gid1、gid2が同一プロセッサグループに存
在する場合には、通信時間は半分になる。一般的には、
gid1、gid2は互いに部分的にプロセッサを共有してお
り、この関係は図１１のように考えられる。図１１は、
リダクション演算がプロセッサを共有する場合におけ
るプロセッサを葉とする木構造の高さ縮約(height redu
ction)を示す図である。The merged reduction communication is gid1,
In the case where gid2 is disjoint with each other, there is at least one less broadcast communication to all processors. If gid1 and gid2 exist in the same processor group, the communication time will be halved. In general,
gid1 and gid2 partially share a processor with each other, and this relationship can be considered as shown in FIG. FIG.
When the reduction operation shares the processor, the height reduction of the tree structure having the processor as a leaf (height redu
FIG.

【００７６】上述のサンプル・プログラムのループは、
通信を最適化することなく通信を生成すると、３つのリ
ダクションのそれぞれに対して通信を生成するため、３
回の通信の同期ポイントが生じる。しかしながら、それ
ぞれのリダクションの計算分割から得られるプロセッサ
グループの和集合をとり、この拡大された和集合のプロ
セッサのグループに対して通信を行うことにより、１回
の通信にマージすることができる。リダクションは一つ
のループにまとめられており、計算分割は一種類だけと
なる。従って、この例では和集合を取る必要はない。こ
の計算分割からプロセッサグループを実行時に求め、複
数のリダクションを実行するための情報をベクトルとし
て実行時ルーチンに渡す。これによって、このプロセッ
サグループに属するプロセッサ間で、複数のリダクショ
ンを一回の通信で処理できるように最適化される。The loop of the above sample program is
If the communication is generated without optimizing the communication, the communication is generated for each of the three reductions.
There is a synchronization point for one communication. However, by taking the union of the processor groups obtained from the calculation divisions of the respective reductions and communicating with the group of processors of this expanded union, they can be merged into one communication. Reductions are grouped into one loop, and there is only one type of calculation division. Therefore, it is not necessary to take the union in this example. A processor group is obtained from this calculation division at runtime, and information for executing a plurality of reductions is passed as a vector to the runtime routine. This optimizes so that a plurality of reductions can be processed by one communication among the processors belonging to this processor group.

【００７７】以下のプログラムは、上記のサンプル・プ
ログラムの最終的な形を示している。最後の３行がそれ
ぞれ、実行時に計算分割からプロセッサグループ(gid)
を求め、このグループに属するプロセッサ間において３
種類のベクトルとして渡されたリダクションを実行し、
そして、集められたリダクションの結果を全プロセッサ
にブロードキャストする実行時ルーチンの呼び出しとな
っている。The following program shows the final form of the above sample program. The last 3 lines are the processor group (gid) from the calculation division at the time of execution.
Among the processors belonging to this group
Performs the reduction passed as a vector of kind,
Then, it is a call to a runtime routine that broadcasts the collected reduction results to all the processors.

【００７８】[0078]

【数１２】 _rxm = largest_val _irxm = 0 _sum = 0 do 200 j = lb1, ub1 do 200 i = lb2, ub2 if (mod(i,2) = 1) then _sum = _sum + rm(i,j) ・・・・・・ (a) if ((abs(rx(i,j)) < abs(_rxm)) .and. (mod(j,2) = 1)) then _rxm = rx(i,j) ・・・・・・ (b) _irxm = i ・・・・・・ (c) _sum = _sum + rx(i,j) ・・・・・・ (d) endif endif 200 continue gid = lis2gid (LIS) call reduce (gid, 3, (abs-minval, abs-minloc, sum), (rxm, irxm, sum), (_rxm, _irxm, _sum)) call broadcast (gid, (rxm, irxm, sum))(Equation 12) _rxm = largest_val _irxm = 0 _sum = 0 do 200 j = lb1, ub1 do 200 i = lb2, ub2 if (mod (i, 2) = 1) then _sum = _sum + rm (i, j) ・・・・・・ (A) if ((abs (rx (i, j)) <abs (_rxm)) .and. (Mod (j, 2) = 1)) then _rxm = rx (i, j) ・・・・・・ (B) _irxm = i ・・・・・・ (c) _sum = _sum + rx (i, j) ・・・・・・ (d) endif endif 200 continue gid = lis2gid (LIS) call reduce (gid, 3, (abs-minval, abs-minloc, sum), (rxm, irxm, sum), (_rxm, _irxm, _sum)) call broadcast (gid, (rxm, irxm, sum))

【００７９】本実施例では、リダクションループ中に多
くの条件式が存在するために従来の方法では検出できな
かったような複雑な表現のリダクションさえも正確にリ
ダクションとして検出できる。また、リダクション変数
をプライベート化することにより、並列実行が可能なよ
うに変形できるか、またはそれができなくとも、リダク
ションの存在がループ並列化を阻害する要件とならない
ように変形することが可能となる。さらに、ループ中に
複数のリダクションが存在する場合にも、通信コールを
ベクトル化することで、通信イベントの回数を減らし、
効率的な通信が可能となった。本実施例に示す方法は、
数値計算プログラム、特に繰り返し法によって近似解を
求めるアルゴリズムなどにおいて非常に有効である。In the present embodiment, even a reduction of a complicated expression that cannot be detected by the conventional method can be accurately detected as a reduction because many conditional expressions are present in the reduction loop. Also, by privateizing the reduction variable, it can be transformed so that parallel execution is possible, or even if it is not possible, it can be transformed so that the existence of reduction does not become a requirement that hinders loop parallelization. Become. Furthermore, even when there are multiple reductions in the loop, the number of communication events can be reduced by vectorizing the communication call.
Efficient communication has become possible. The method shown in this example is
It is very effective in numerical calculation programs, especially in algorithms for obtaining approximate solutions by the iterative method.

【００８０】最後に、本実施例における並列化コンパイ
ル方法はコンパイル・プログラムとして記憶媒体中に格
納しておいてもよい。Finally, the parallelizing compiling method in this embodiment may be stored in the storage medium as a compiling program.

【００８１】[0081]

【効果】本発明を用いた並列化コンパイラは、ソースプ
ログラムに頻繁に現れるループパターンであるリダクシ
ョンをソースプログラム中から正確に抽出して、効率的
に並列実行可能な目的プログラムを生成することができ
る。従って、並列計算機は、そのようにしてコンパイル
された目的プログラムを高速に実行することができる。[Effect] A parallelizing compiler using the present invention can accurately extract a reduction, which is a loop pattern that frequently appears in a source program, from the source program, and efficiently generate a target program that can be executed in parallel. . Therefore, the parallel computer can execute the target program thus compiled at high speed.

【図面の簡単な説明】[Brief description of drawings]

【図１】並列計算機の構成図である。FIG. 1 is a configuration diagram of a parallel computer.

【図２】リダクションの検出からリダクション通信の最
適化までの基本的な動作フロー図である。FIG. 2 is a basic operation flow diagram from detection of reduction to optimization of reduction communication.

【図３】リダクションの検出における動作フロー図であ
る。FIG. 3 is an operation flow chart in detection of reduction.

【図４】ループ本体の代入文とこれから生成される条件
式の表現木の関係を示す図である。FIG. 4 is a diagram showing a relationship between an assignment statement of a loop body and an expression tree of a conditional expression generated from the assignment statement.

【図５】サンプル・プログラムの条件式に関する表現木
である。FIG. 5 is an expression tree related to a conditional expression of a sample program.

【図６】代入文の表現木の例である。FIG. 6 is an example of an expression tree of an assignment statement.

【図７】サンプル・プログラムの代入文に関する表現木
である。FIG. 7 is an expression tree regarding an assignment statement of a sample program.

【図８】サンプルプログラムのデータ依存関係を示す表
現木である。FIG. 8 is an expression tree showing a data dependency of a sample program.

【図９】リダクションのループ変換のための動作フロー
図である。FIG. 9 is an operation flow diagram for loop conversion of reduction.

【図１０】マルチリダクション関数のプログラムリスト
の例である。FIG. 10 is an example of a program list of a multi-reduction function.

【図１１】は、リダクション演算がプロセッサを共有
する場合におけるプロセッサを葉とする木構造の高さ縮
約(height reduction)を示す図である。FIG. 11 is a diagram showing a height reduction of a tree structure having a processor as a leaf when the reduction operation shares the processor.

───────────────────────────────────────────────────── フロントページの続き (72)発明者小松秀昭神奈川県大和市下鶴間1623番地14 日本アイ・ビー・エム株式会社東京基礎研究所内 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Hideaki Komatsu 1623-14 Shimotsuruma, Yamato-shi, Kanagawa Japan IBM Japan, Ltd. Tokyo Research Laboratory

Claims

[Claims]

1. A parallelizing compiling method for translating a source program to generate an object program for a parallel computer having a plurality of processors, wherein the structure of a conditional expression masks an assignment statement existing in a loop in the source program. Analyzing the structure of the assignment statement existing in the loop, merging the structure of the assignment statement from the variable data dependency, the structure of the conditional expression and the merged And a step of referring to the data dependency of the variable and the operator in the assignment statement to detect reduction based on the structure of the assignment statement.

2. The parallelizing compiling method according to claim 1, further comprising the step of converting the reduction into a loop in which a plurality of the processors can execute in parallel.

3. In the step of detecting the reduction, when the right-hand side expression of the assignment statement is an expression, the data dependency of a recurrence variable among the variables is referred to, and the assignment statement is an exchange law and The parallelization compiling method according to claim 1 or 2, wherein the reduction is determined with reference to the operators that satisfy the associative law and that they are the same type of operator.

4. The step of detecting the reduction, wherein when the right-hand side expression of the assignment statement is a single variable, the reduction is judged based on the structure of the conditional statement. The parallelizing compilation method described in 2.

5. The step of analyzing the structure of the conditional expression generates an expression tree in which each of the conditional expressions is associated with a node and each of the assignment statements is associated with a leaf. The parallelized compiling method described in 1, 2, 3 or 4.

6. The step of analyzing the structure of the assignment statement is an expression in which each element forming the assignment statement is associated with a node for each leaf of the expression tree generated in the step of analyzing the structure of the conditional expression. The parallelizing compiling method according to claim 5, wherein a tree is generated.

7. The parallelized compiling method according to claim 1, further comprising the step of converting a reduction variable for controlling the detected reduction into a private variable for each processor.

8. A parallelization compiling method for translating a source program to generate an object program for a parallel computer having a plurality of processors, the method comprising: detecting a reduction controlled by a reduction variable; Converting the reduction variable into a private variable for each processor in order to perform parallel calculation for each processor to obtain a local result; and to aggregate the local results calculated in parallel for each processor And a step of generating reduction communication.

9. The step of generating the reduction communication to aggregate local results for the plurality of reductions calculated by each processor when the plurality of reductions are present in a single loop, The parallelized compiling method according to claim 8, wherein local results regarding a plurality of the reductions are vectorized and communicated.

10. A medium storing a compile program using the parallelizing compiling method according to any one of claims 1 to 9.

11. A parallel computer that executes an object program compiled by the parallelizing compilation method according to any one of claims 1 to 9.