JPH02132526A

JPH02132526A - Process leveling method and automatic paralleling and compiling method for parallel computer

Info

Publication number: JPH02132526A
Application number: JP28566088A
Authority: JP
Inventors: Shigeo Ihara; 茂男井原; Giichi Tanaka; 義一田中
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1988-11-14
Filing date: 1988-11-14
Publication date: 1990-05-22

Abstract

PURPOSE:To shorten the computing time for all processor elements PE by preparing the parallel sequentially executing programs and eliminating automatically the uneven loads of PEs caused at execution of programs or caused by the static and uneven allocation when the programs are executed. CONSTITUTION:The arithmetic operation of an inside loop index is carried out in division by the PEs and some correlation is secured between the computing time of each PE and an outside loop index K. Under such conditions, the information on the time elapsed that is obtained by computing the inside loop in division via PEs is preserved to the preceding K value. Then the allocation is controlled for the PEs to the index K based on the preserved information. Thus it is possible to eliminate the uneven times elapsed of each dynamic PE that is caused by the success or failure of an IF sentence against an index (K + 1). Then the time elapsed can be shortened for all PEs.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、並列計算機システムに係り、特に逐次実行型
の高級言語で記述さけたソース・プログラムから、並列
に実行するのに好適なオブジェクトプログラムを生成す
る方式に関する。[Detailed Description of the Invention] [Field of Industrial Application] The present invention relates to a parallel computer system, and particularly to an object program suitable for parallel execution from a source program written in a sequential high-level language. Regarding the method of generating .

[Conventional technology]

従来、マルチプロセッサのような並列処理システムにお
いては、ユーザインタフェースとして、逐次型のソース
プログラムに、並列化の手段、タスクの起動・同期など
の指示を明示的にユーザが記述する必要があった。ＡＣ
Ｍ−０−８９７９　１１７４−１−１２１８５−０１０
７，Ａ　ＤａｔａＦｌｏｗ　Ａｐｐｒｏａｃｈ　ｔｏ　
Ｍｕｌｔｉｔａｓｋｉｎｇ　ｏｎ　Ｃ　Ｒ　Ａ　Ｙ　Ｘ
−　Ｍ　Ｐ　Ｃｏｍｐｕｔ８ｒｓでは、４台のベクトル
プロセッサを並列に動作させるマルチタスキングの動作
とそのためのユーザの指示方法について述べられている
。これによれば、タスクの起動や同期の制御のためのラ
イブラリがシステムに用意されており、ユーザはＦＯＲ
ＴＲＡＮプログラムの中で、これらを呼び出すように記
述する。さらに、ループごとに並列化の手段をコメント
形式の制御文の形でコンパイラに指示する必要がある。Conventionally, in parallel processing systems such as multiprocessors, it has been necessary for the user to explicitly write instructions for means of parallelization, task activation and synchronization, etc. in a sequential source program as a user interface. A.C.
M-0-8979 1174-1-12185-010
7, A DataFlow Approach to
Multitasking on CRAYX
- M P Comput8rs describes a multitasking operation in which four vector processors are operated in parallel, and a user instruction method for the multitasking operation. According to this, a library for controlling task startup and synchronization is provided in the system, and users can use FOR
Write to call these in the TRAN program. Furthermore, it is necessary to instruct the compiler about the means of parallelization for each loop in the form of comment-style control statements.

一方、第５図に示すような並列処理システムでは，送信
処理と受信処理が独立に行なわれる。この種のシステム
は、特願昭６１−１８２３６１に詳細に記述されている
。この処理システムを以下簡単に説明すると、（１）このシステムはホスト計算機１２１と並列処理部
１２２から構成され、さらに並列処理部１２２は複数台
のプロセッサ１２３と任意のプロセッサ間でデータ転送
可能なネットワーク１２４から構成される。On the other hand, in a parallel processing system as shown in FIG. 5, transmission processing and reception processing are performed independently. This type of system is described in detail in Japanese Patent Application No. 61-182361. This processing system will be briefly explained below: (1) This system is composed of a host computer 121 and a parallel processing section 122, and the parallel processing section 122 is connected to a plurality of processors 123 and a network that can transfer data between arbitrary processors. It consists of 124.

（２）各プロセッサ１２３はプログラムやデータを保持
するローカルメモリ１２５と、ローカルメモリから順次
命令を読み出し、処理の実行を行う命令処理部１２６と
、送信部１２７および受信部１２８からなる。(2) Each processor 123 includes a local memory 125 that holds programs and data, an instruction processing section 126 that sequentially reads instructions from the local memory and executes processing, a transmitting section 127, and a receiving section 128.

（３）データ送信処理は、送信命令（Ｓｅｎｄ命令）を
実行することにより実現される。Ｓｅｎｄ命令が解読さ
れると、そのオペランドで指定したレジスタから、転送
先プロセッサ番号，データ識別子およびデータを送信部
２７のレジスタ３２にセットする。レジスタ１３２の３
つの情報が、メッセージとして、ネットワーク１２４に
送られる。ネットワーク上のメッセージは，メッセージ
内の転送先プロセッサ番号で示されたプロセッサの受信
部内受信バッファ１２９に，データ識別子１３０とデー
タ１３１の組として取り込まれる。(3) Data transmission processing is realized by executing a send command. When the Send command is decoded, the transfer destination processor number, data identifier, and data are set in the register 32 of the transmitter 27 from the register specified by the operand. Register 132-3
The information is sent to network 124 as a message. A message on the network is taken in as a pair of data identifier 130 and data 131 into the receiving buffer 129 in the receiving section of the processor indicated by the destination processor number in the message.

（４）データ受信処理は、受信命令（Ｒｅｃｅｉｖｅ命
令）を実行することにより実現される。Ｒｅｃｅｉｖｅ
命令が解読されると、そのオペランドを指定したレジス
タから検索用の識別子を取り出し、受信部１２８に送る
。受信部１２８では受信バッファ１２９の中から、検索
用の識別子と一致するデータ識別子を検索し、ないとき
は、一致するデータ識別子が到着するまで待ち、一致す
るものがあるときは命令処理部に報告する。命令処理部
では対応するデータを取り込む。(4) Data reception processing is realized by executing a receive command. Receive
When the instruction is decoded, a search identifier is extracted from the register that specified the operand and sent to the receiving unit 128. The receiving unit 128 searches the receiving buffer 129 for a data identifier that matches the search identifier, and if there is no matching data identifier, it waits until a matching data identifier arrives, and if there is a matching data identifier, it reports it to the instruction processing unit. do. The instruction processing unit takes in the corresponding data.

計算機全体の制御をするホスト計算機と、独立に演算あ
るいはデータの送受信を行なえるプロセッサエレメント
（ＰＥ）から成る計算機に対して、広く産業用で用いら
れているデバイス中の電子の輸送や中性子散乱の問題、
あるいは有限要素法等の問題で用いられるプログラムを
実行する場合を考える。A computer consists of a host computer that controls the entire computer and a processor element (PE) that can perform calculations or send and receive data independently. problem,
Or consider the case of executing a program used for problems such as the finite element method.

このような計算では、各ＰＥに最初各ＰＥで演算量通信
量が均等になるようにプログラムやデータおよびタスク
を割り当てたとしても，実行時にしか決まらない処理を
含むために、実行中の処理が各ＰＥで不均一になる場合
がある。そのためにＰＥのあいだに、実行中に，処理の
かたよりが生じ、互いに処理が終わるのを持つ持ち時間
を長くし、全処理の経過時間を長くしてしまうという問
題があった．一方、従来は、特に静的なタスク割り付けに対して例え
ば、Ｊ．Ｒａｍａｎｕｊａｎ，　ｅｔ．ａｌ　：　Ｔａ
ｓｋＡｌｌｏｃａｔｉｏｎ　ｄｙ　Ｓｉｍｕｌａｔｅｄ
　Ａｎｎｅａｌｉｎｇ，　ＴｈｉｒｄＩｎｔｅｒｎａｔ
ｉｏｎａｌ　Ｃｏｎｆｅｒｅｎｃｅ　ｏｎ　Ｓｕｐｅｒ
ｃｏｍｐｕｔｉｎｇ８８，ＩＣＳ８８．の方法が有効と
思われるが，そこで用いられている手法を、並列計算機
に対して逐次型のコンパイル言語から自動的に並列化す
るコンパイラのオブジェクトに挿入したり、動的な実行
時のＰＥの負荷の不均一の解消に実行時に用いたものは
なかった．さらに自動並列化の観点から実行時の動的な
ＰＥの負荷の不均一さを解消する方式に対しても有効な
手段はながった。In such calculations, even if programs, data, and tasks are initially assigned to each PE so that the amount of calculation and communication is equal between each PE, the processing being executed may be affected because it includes processing that can only be determined at runtime. There may be non-uniformity in each PE. As a result, there is a problem in that processing is unevenly distributed between PEs during execution, increasing the amount of time it takes for each PE to finish processing, and lengthening the elapsed time for all processing. On the other hand, conventionally, especially for static task allocation, for example, J. Ramanujan, et. al: Ta
skAllocation dy Simulated
Annealing, Third International
ional Conference on Super
computing88, ICS88. The method described above seems to be effective, but the method used there can be inserted into the compiler object that automatically parallelizes a sequential compiled language for a parallel computer, or it can be applied to PE during dynamic execution. Nothing was used during execution to resolve load unevenness. Furthermore, from the viewpoint of automatic parallelization, there has been no effective means for solving the dynamic non-uniformity of PE loads during execution.

[Problem to be solved by the invention]

上記従来は、並列処理が可能なハードウエアに対して、
静的割り付けが不充分だったり実行時に生ずるＰＥでの
負荷の不均一さを解消する手段を設け並列計算を実行す
る点およびこの手段を自動並列化に組み込む点について
は配慮されておらず、ハードウエアの特性を生がした資
源の有効利用ができなかった。さらに自動並列化につい
て述べれば、従来ユーザが持っていた逐次型のプログラ
ムはそのままの形で並列実行させることはできず、並列
処理向きに再コーティングさらにそのデバックが必要に
なりハードウエアの特性が変わるたびに並列化の指示を
変更する必要があり、他のシステムでは動かないなどユ
ーザプログラムの汎用性が損なわれるなどの問題があっ
た。Conventionally, for hardware capable of parallel processing,
There is no consideration given to implementing parallel calculations by providing a means to resolve uneven load on PEs that occurs when static allocation is insufficient or during execution, and incorporating this means into automatic parallelization. It was not possible to make effective use of resources that made use of the characteristics of the wear. Furthermore, regarding automatic parallelization, it is not possible to run the sequential programs that users have in the past in parallel as they are; they must be recoated for parallel processing and debugged, which changes the characteristics of the hardware. It was necessary to change the parallelization instructions every time, and there were problems such as the user program not being able to run on other systems, and the versatility of the user program being impaired.

本発明の目的は、上述の静的なタスク割り付（プで考慮
できない点や実行時に生ずるＰＥでの負荷の不均一さを
修正する手段をコンパイルかローディング時に設け、さ
らに上述のユーザの負担を軽減し、既存の逐次型のプロ
グラムはそのまま手を入れることなく自動並列化を行い
、新しくコーディングする場合でも、ハードウエアの特
性あるいは、実行時に生ずる動的負荷の解消手段を考慮
せずに効率のよいオブジェクトコードを生成できるよう
にすることにある。It is an object of the present invention to provide means for correcting the above-mentioned static task allocation (which cannot be taken into account in the process) and the non-uniformity of the load on the PEs that occurs during execution, at the time of compilation or loading, and to further reduce the above-mentioned burden on the user. Existing sequential programs can be automatically parallelized without modification, and even when coding new ones, efficiency can be improved without considering hardware characteristics or ways to eliminate dynamic loads that occur during execution. The goal is to be able to generate good object code.

[Means to solve the problem]

上記目的は、番号付けられた複数のプロセッサから構成
され、計算機全体を制御するホスト計算機と、独立に演
算を行うプロセッサ・エレメントおよびプロセッサエレ
メント間データ転送方式を用いるメモリ分散型の並列計
算機、あるいは共有メモリ型の計算機を対象とし、高級
言語で記述された逐次処理型ソースプログラムから該並
列プロセッサによる並列処理実行のためのオブジェクト
コードを生成するコンパイリングとローディングの過程
において，該逐次型実行ソースプログラムを並列実行型
プログラムに変換し、該並列実行型プログラムを実行す
るときの処理の流れのなかから各ＰＥにデータおよびプ
ログラムを割り付けたあと，もともとの割り付けから生
じたり、あるいは実行時に動的に生じる各ＰＥで該割り
付けられた処理を実行した経過時間を測定する手段を設
け、該時間のＰＥによるばらつきが、ＰＥに均等になる
ようにデータやプログラムを，各ＰＥに再割りつけする
処理を、各ＰＥに割り付けられた処理を実行しつつホス
ト計算機またはＰＥで動的に行なうタスクの平準化方法
を行うプログラムコードをソースプログラムのコンパイ
ラのオブジェクトコードに挿入することにより達成でき
る。The above purpose is to create a host computer that is composed of multiple numbered processors and controls the entire computer, and a memory distributed parallel computer that uses processor elements that perform operations independently and a data transfer method between processor elements, or a shared memory system. In the process of compiling and loading, which targets memory-based computers and generates object code for parallel processing execution by the parallel processor from a sequential processing source program written in a high-level language, the sequential execution source program is After converting to a parallel execution type program and allocating data and programs to each PE from the flow of processing when executing the parallel execution type program, each PE that arises from the original allocation or that occurs dynamically during execution A means is provided to measure the elapsed time when the PE executes the assigned process, and the process of reallocating data and programs to each PE is performed so that the variation in the time between PEs is evenly distributed among the PEs. This can be achieved by inserting into the object code of the source program compiler a program code that performs a task leveling method that is dynamically performed by the host computer or PE while executing the processing assigned to the PE.

[Effect]

上記方法によれば、逐次型実行プログラムを並列化し、
実行時により生ずる、あるいは静的に不均一に割り付け
たＰＥの負荷の不均一さを自動的に実行時に動的に解消
できるので、ユーザは従来から蓄積してきたプログラム
の財産を並列プロセッサ向けに書き直す必要がなくなる
．これにより上記目的を達成することができる。According to the above method, a sequential execution program is parallelized,
Uneven load on PEs that occurs during execution or is statically allocated unevenly can be automatically and dynamically resolved during execution, allowing users to rewrite the program assets they have accumulated for parallel processors. There will be no need. This makes it possible to achieve the above objective.

〔Example〕

以下、複数台のプロセッサエレメントと各プロセッサ相
互側でデータ転送を行うための通信路を備えた並列プロ
セッサのためのＦＯＲＴＡＮコンパイラに応用したとき
の本発明の方式の一実施例を図面を参考しつつ説明する
。Hereinafter, with reference to the drawings, an embodiment of the method of the present invention when applied to a FORTAN compiler for a parallel processor equipped with a plurality of processor elements and a communication path for data transfer between each processor will be explained. explain.

第１図は本発明の概要を示すものである．第１図では，
本発明の適用対象として第２図に示した分散メモリ型の
ＰＥと、ホスト計算機からなるシステムを例にとり、ホ
スト計算機の役割を重要した場合を想定し説明している
。しかし本発明は分散メモリ型のＰＥとホストからなる
計算機システムにおいて、第１図１の部分の計算を各Ｐ
Ｅで分担させて行うことも容易であり、さらに、ホスト
計算機とＰＥとがメモリを共有する型の並列計算機にお
いても本発明は容易に適用できる。Figure 1 shows an overview of the present invention. In Figure 1,
Taking as an example a system to which the present invention is applied, which includes a distributed memory type PE shown in FIG. 2 and a host computer, explanation will be given assuming that the role of the host computer is important. However, in the present invention, in a computer system consisting of a distributed memory type PE and a host, the calculation of the part shown in FIG.
The present invention can also be easily applied to a type of parallel computer in which a host computer and a PE share memory.

第１図ではＰＥ台数が３台の特別な場合を描いているが
本発明は任意のＰＥの台数に対して可能である。Although FIG. 1 depicts a special case in which the number of PEs is three, the present invention is applicable to any number of PEs.

本発明は、第７図のような実行時に生じる演算数に均一
性のある２重ループ（さらに多重であっても良い）があ
ったとき外側ループインデックスに対して、内側ループ
の各ＰＥでの演算量が外側ループインデックスのゆるや
かな関数であるとみなせる場合、各ＰＥで生じる演算量
の不均一さを、１つまえの外側ループインデックスに対
する内側ループのＰＥへの割り付け結果として得られる
各ＰＥでの経過時間を求め、その次の外側ループインデ
ックスに対して行なう内側ループの各ＰＥでの計算を各
ＰＥでの経過時間が平等になるようにホスト計算機とＰ
ＥあるいはＰＥ間でデータのやりとりを行ない、ＰＥで
受けもつタスクを調整し、２重ループの全体の処理時間
を短くするものである。In the present invention, when there is a double loop (or even multiple loops) in which the number of operations occurring during execution is uniform as shown in FIG. If the amount of computation can be considered to be a loose function of the outer loop index, the non-uniformity of the amount of computation occurring in each PE can be expressed as Find the elapsed time for the next outer loop index, and calculate the elapsed time at each PE in the inner loop for the next outer loop index.
Data is exchanged between the E or PE, the tasks handled by the PE are adjusted, and the overall processing time of the double loop is shortened.

まず具体例をあげて本発明の原理の概要を詳しく述べる
。外側Ｋループ３０は内側エループの計算回数を多く行
うことを要求している。内側エループインデックスを各
ＰＥに均等に分配すると各ＰＥでのプログラムは、第８
図のようになる。ところが４１．４１’　，４１’のＩ
Ｆ文によって各ＰＥの実行時間はＩＦ文の成立の成否に
より動的に決まってしまうために各ＰＥでの演算の経過
時間にばらつきが生じ．ＩＦ文で用いられるＶＡＬ（Ｉ
，Ｋ）がＫに対して依存しないランダムな量である場合
には、１回目のＫのインデックスに対するＶＡＬ（Ｉ，
１）と２回目のＫのインデックスに対するＶＡＬ（Ｉ，
２）との間には何の相関もないので、各ＰＥでの演算の
経過時間のばらつきはランダムである。ところが、もし
ＶＡＬ（Ｉ，Ｋ）がＫ依存する関数で、Ｋの値の変化に
対してゆるく変化する場合には、前のＫインデックスに
対する量ＶＡＬ（Ｉ，Ｋ）と次のＫインデックスに対す
る量ＶＡＬ　（Ｉ，Ｋ＋８），（ｆｉ＝１．２，３，−
）に何らかの相関があり、各ＰＥでの演算の経過時間の
ばらつきには相関がある。First, the outline of the principle of the present invention will be described in detail by giving specific examples. The outer K-loop 30 requires the inner El-loop to be calculated more times. If the inner eloop index is distributed evenly to each PE, the program at each PE will be the 8th
It will look like the figure. However, I of 41.41', 41'
Because the execution time of each PE is dynamically determined by the F statement depending on the success or failure of the IF statement, variations occur in the elapsed time of operations in each PE. VAL (I
, K) is a random quantity that does not depend on K, then VAL(I,
1) and the index of K for the second time VAL(I,
Since there is no correlation between 2) and 2), the variation in the elapsed time of calculations in each PE is random. However, if VAL (I, K) is a function that depends on K and changes slowly with changes in the value of K, then the amount VAL (I, K) for the previous K index and the amount for the next K index VAL (I, K+8), (fi=1.2,3,-
) has some correlation, and there is a correlation between variations in the elapsed time of calculations in each PE.

この場合には、内側エループに関する演算量をＰＥに分
割して演算したときの経過時間τＷ″（ｍはＰＥ番号）
は外側ループインデックスＫに対しての関数となる。そ
して例えば第６図（ａ），（ｂ），（ｃ）に示すように
Ｋに対して変化する。第６図（ａ）ではτ１が外側ルー
プインデックスに対してほぼ一定値をとる場合を示し、
第６図（ｂ）では、τ脚がある一定値のまわりでＫに対
してランダムに変動する場合を示し、第６図（ｃ）では
、τ１がＫに対してゆるやかに増加あるいは減少する場
合を示す。In this case, the elapsed time τW'' (m is the PE number) when the calculation amount regarding the inner eloop is divided into PEs and calculated.
is a function of the outer loop index K. For example, it changes with respect to K as shown in FIGS. 6(a), (b), and (c). FIG. 6(a) shows the case where τ1 takes a nearly constant value with respect to the outer loop index,
Figure 6(b) shows the case where the τ leg fluctuates randomly with respect to K around a certain constant value, and Figure 6(c) shows the case where τ1 gradually increases or decreases with respect to K. shows.

このように内側ループインデクスに関する演算をＰＥで
分割して行なったとき各ＰＥで演算を実行するときの経
過時間にＫに対して何らかの相関がある場合には、以前
のＫの値に対して内側ループをＰＥに分割して演算した
経過時間の情報をとっておき、次の外側ループのＫイン
デックスに対して、前にとっておいた情報をもとに内側
インデツクスのＰＥへの割り付けの調整を行なうことに
よりＫ＋１のインデックスに対するＩＦ文の成否によっ
て生じる動的な各ＰＥでの経過時間の不均一さが解消さ
れ、全ＰＥの経過時間を短くすることができる。以上が
本発明の原理である。In this way, when an operation related to the inner loop index is divided and performed by PEs, if there is some correlation with K in the elapsed time when executing the operation in each PE, the inner loop index is The information on the elapsed time calculated by dividing the loop into PEs is saved, and the allocation of the inner index to the PE is adjusted for the K index of the next outer loop based on the information saved previously, so that K+1 is calculated. The dynamic non-uniformity of the elapsed time in each PE caused by the success or failure of the IF statement for the index is eliminated, and the elapsed time of all PEs can be shortened. The above is the principle of the present invention.

次に処理の平準化のために各ＰＥの経過時間を測定する
回数と外側ループ回数との関係について述べる。Next, the relationship between the number of times the elapsed time of each PE is measured in order to equalize processing and the number of outer loops will be described.

第６図（ａ）のようにて１が外側ループインデックスＫ
に対してほぼ一定値をとる場合には、内側ループ演算に
要する各ＰＥの経過時間の情報は最初の１回をとる方が
何度も同じ情報をとるより手間が省け効率がよい。しか
し、第６図（ｂ）のように、τ１が，外側ループインデ
クスＫに対してある値を中心にランダムに変動する場合
には、Ｋインデックスに対して、各ＰＥの処理の経過時
間の情報を累積平均しつつ処理すると、各ＰＥの経過時
間の変動をならし、内側インデックスのＰＥへの割り付
けの調整のための目樟値がみつけやすくなる。ただしこ
の場合であっても、Ｋループのはじめの数ステップを調
べ、前述の目標値を設定し、それ以後のＫインデックス
に対する各ＰＥで処理する経過時間は求めないようにす
るか、ある間隔をおいて何度かに１回だけ経過時間を求
めるかした方が何度も同じ情報をとる手間が省け効率が
よい。従って、これらのことがあらかじめユーザに分か
つているときは，コンパイラの現行のオプション指定の
様に指示を与えることによって無駄な計算はしないです
むようにできる。よりきめ細かい処理の平準化が必要な
第６図（ｃ）のような場合には、外側ループインデック
ス毎に内側ループの各ＰＥでの演算の処理時間の情報を
とり、そのつど最適化の目標値を更新していった方が、
全体の処理時間をより短縮できる。従って以下では、第
６図（ｃ）のような場合を想定して議論するが、一般の
場合に対しても容易に適用可能である。As shown in Fig. 6(a), 1 is the outer loop index K.
When the value is approximately constant, it is more efficient to obtain information on the elapsed time of each PE required for the inner loop calculation once at the first time than to obtain the same information many times. However, as shown in FIG. 6(b), when τ1 randomly fluctuates around a certain value with respect to the outer loop index K, information on the elapsed time of processing of each PE with respect to the K index Processing while cumulatively averaging smoothes out fluctuations in the elapsed time of each PE, making it easier to find a target value for adjusting the allocation of inner indexes to PEs. However, even in this case, either check the first few steps of the K loop, set the above-mentioned target value, and then either do not calculate the elapsed time processed by each PE for the K index after that, or set a certain interval. It is more efficient to calculate the elapsed time only once every few times, as it saves the effort of obtaining the same information over and over again. Therefore, if the user knows these things in advance, unnecessary calculations can be avoided by giving instructions like the current option specifications of the compiler. In the case shown in Figure 6(c) where more fine-grained leveling of processing is required, information on the processing time of calculations in each PE of the inner loop is obtained for each outer loop index, and the optimization target value is set each time. It is better to update
The overall processing time can be further reduced. Therefore, although the following discussion assumes a case as shown in FIG. 6(c), it is easily applicable to general cases as well.

さらにＰＥでの処理の均等化に要する処理のために何回
かの反復計算をくり返すには、その反復回数と経過時間
の測定回数の比もユーザが指定できるものとする。Furthermore, in order to repeat calculations several times for the processing required to equalize the processing in the PE, the user can also specify the ratio between the number of repetitions and the number of times elapsed time is measured.

外側ループ回数と各ＰＥの経過時間を測定する回数およ
び、均等化のためのタスク分割を決定するための反復回
数は、ユーザが指定することができるが、その他に、系
が少ないＫの値をもとに予測して選択するように決める
ことも容易である。The number of outer loops, the number of times to measure the elapsed time of each PE, and the number of iterations to determine task division for equalization can be specified by the user, but in addition, the system can specify a small value of K. It is also easy to make a prediction based on the selection.

続いて、処理を行なうホスト計算機とＰＥの処理の役割
について述べる。Next, the processing roles of the host computer and PE that perform processing will be described.

ここでは外側ＤＯループのＫ＝１のところで、第１図に
示したように各ＰＥの経過時間にばらつきがあるとして
述べる。高速化と資源の有効利用のためにはＰＥでの負
荷を均一にし、しかも特定のＰＥに対するデータの送受
信の集中を避けつつ負荷の平準化をはかればよい。そこ
で動的に負荷分散をとのＰＥでどの程度行っていけばい
いかを決定する処理をホスト計算機１で行なう。そのた
めには、各ＰＥｉは次の外側ループＫ＝２が始まる前に
、Ｋ＝１のときの実行時間τＬを測定し、その結果をホ
スト計算機に送り、ホスト計算は各ＰＥのτ１の値から
、例えば第３，４図に示した動的分散方法によって各Ｐ
Ｅの処理を決定し、ホスト計算機から各ＰＥに対し、ど
のデータをどのＰＥに送ればよいか、あるいは、どのデ
ータをどのＰＥから受けとればよいかを指示する。これ
により．Ｋ＝２．３，・・・とＫのループが進むと内側
Ｉループに関する演算はしだいに各ＰＥで均等にするこ
とができる。そうすることによって、例えば、Ｋ＝１の
ままで何も処理の平準化をせずに実行した場合に対して
、Ｋ＝３では第１図のΔ１，Ｋ＝４ではΔ２だけ処理の
経過時間が短くなる。Here, description will be made assuming that at K=1 in the outer DO loop, there is variation in the elapsed time of each PE as shown in FIG. In order to increase speed and use resources effectively, it is necessary to equalize the load on the PEs and to avoid concentration of data transmission and reception on specific PEs. Therefore, the host computer 1 performs a process of dynamically determining how much load distribution should be performed by each PE. To do this, each PEi measures the execution time τL when K=1 before starting the next outer loop K=2, sends the result to the host computer, and the host calculation is performed from the value of τ1 of each PE. , for example, by the dynamic dispersion method shown in Figs.
The host computer instructs each PE which data should be sent to which PE, or which data should be received from which PE. Due to this. As the loop of K progresses to K=2.3, . By doing so, for example, compared to the case where K=1 is executed without any process leveling, the elapsed processing time is increased by Δ1 in Figure 1 when K=3, and Δ2 when K=4. becomes shorter.

第２図に本発明を適用するコンパイラ全体の構成を示す
。第２図内の構文解析処理７がＦＯＲＴＲＡＮのソース
プログラム６に入方とじ、これを中間語１０に変換する
。中間処理８はこの中間語６を入力として、できる限り
、最適化や並列化を行ない中間語６を変形する。中間処
理８の自動並列化の部分は例えば、岩沢他２名・・・中
研特許番号３１８７０５１７４並列化コンパイル方法に
従うとする。FIG. 2 shows the overall configuration of a compiler to which the present invention is applied. A syntax analysis process 7 in FIG. 2 inputs the FORTRAN source program 6 and converts it into an intermediate language 10. The intermediate processing 8 takes this intermediate word 6 as input and transforms the intermediate word 6 by performing optimization and parallelization as much as possible. For example, the automatic parallelization part of the intermediate processing 8 follows the parallelization compilation method of Chuken Patent No. 318705174 by Iwasawa et al.

本発明は、このような処理のあと、並列化が不充分な部
分の修正やあるいは動的に生じるＰＥの負荷の不均一さ
を処理１１によって解析し、各ＰＥの平均のτ．Ｎｋを求め、外側ループ回数の次の値に対して、τ皿−τ．
ｌ＝Ｏとなるようにインデックスエに関するデータをＰ
Ｅで送受信しあうような処理を中間コードに挿入するも
のである．処理１８では、どのＰＥからどのＰＥにどの程度のデー
タを送ればよいかを決定する。In the present invention, after such processing, a process 11 is performed to correct parts that are insufficiently parallelized or to analyze dynamically generated non-uniformity of PE loads, and calculates the average τ of each PE. Nk is determined, and for the next value of the number of outer loops, τ plate −τ.
The data regarding the index is set to P so that l=O.
This is to insert processing such as sending and receiving with E into the intermediate code. In process 18, it is determined how much data should be sent from which PE to which PE.

ここで演算の平準化を行なう処理の内容について詳述す
る．第２図の処理１１に関わる部分を、第２図の入力するソ
ースプログラム６の例として第７図のプログラムを入力
するとして説明する。第２図処理８により内側ループイ
ンデックスＩを均等にＰＥに分けるような仕方で第８図
のような並列化コードがＰＥに分配されているとする。Here we will explain in detail the process of leveling the calculations. The portion related to the process 11 in FIG. 2 will be explained assuming that the program in FIG. 7 is input as an example of the input source program 6 in FIG. 2. It is assumed that the parallelization code as shown in FIG. 8 is distributed to the PEs in such a way that the inner loop index I is evenly divided among the PEs by process 8 in FIG.

第２図処理１２では、内側ループに関する演算をＰＥに
動的に分配する方式に対して、初期値あるいはその他必
要な値を設定する。処理１３は、後述するように例えば
近接プロセッサ同士をグループ群に分ける初期処理を行
ないホスト計算機で実行される。In process 12 in FIG. 2, initial values or other necessary values are set for the method of dynamically distributing calculations related to the inner loop to PEs. Processing 13 is executed by the host computer, for example, by performing initial processing of dividing adjacent processors into groups, as will be described later.

処理１５は、各ＰＥで処理を行う部分であるが、各ＰＥ
での演算量Ｃｔ（ｉ；ＰＥ番号）を例えば実行命令ステ
ップ数等から求める。処理１６は、ｉ番目のＰＥとｊ番
目のＰＥとの間の通信量ｅｉ．とＰＥ−ｉとＰＥ−ｊの
間で定義される通信距離ｄ五ａ（例えばハミング距ＩＩ
Ｉ）より、通信負荷処理１７では各ＰＥでの演算量Ｃｉ
と通信負荷Ｃｃの期待値から、各ＰＥの内側■ループに
相当する演算の経過時間各ＰＥで求める。Processing 15 is a part performed by each PE.
The calculation amount Ct (i; PE number) is determined from, for example, the number of executed instruction steps. Processing 16 calculates the communication amount ei. between the i-th PE and the j-th PE. and the communication distance d5a defined between PE-i and PE-j (e.g. Hamming distance II
From I), in the communication load processing 17, the amount of calculations Ci in each PE is
From the expected value of the communication load Cc, the elapsed time of the calculation corresponding to the inner loop of each PE is determined for each PE.

処理１６一処理１７は、各ＰＥで実行するが、処理１８
はホスト計算機で行なう。例として述べる。ただしこの
部分もＰＥで並列実行可能である。Process 16 - Process 17 are executed in each PE, but Process 18
is performed on the host computer. I will give this as an example. However, this part can also be executed in parallel by PE.

処理１８は，各ＰＥでの内側エループに相当する演算の
経過時間τ’（ｉ　＝　１　，・・・Ｎｐｅ）の数値結
果をホスト計算機送るよう各ＰＥに指示を出し、上記数
値結果をもとにホスト計算機で内側エループに相当する
演算のＰＥへの割り付けを計算する。Process 18 instructs each PE to send the numerical result of the elapsed time τ' (i = 1,...Npe) of the calculation corresponding to the inner eloop to the host computer, and based on the above numerical result. Then, the host computer calculates the assignment of the operation corresponding to the inner eloop to the PE.

すなわちＰＥからどのＰＥにどのエループインデックス
に相当する演算を再び割り当てるかを決める。処理１８
については後に詳しく述べる。That is, it is determined from the PE that the operation corresponding to which eroop index is to be reassigned to which PE. Processing 18
This will be explained in detail later.

処理１９では，処理１８で決めたＰＥのタスクの分配を
ホスト計算機の指示によりＰＥ間でデータを転送し、か
つそのデータにみあった処理を行えるように例えば、第
９図処理４５におけるループインデックスの初期値Ｎｌ
と終値Ｎ２の値を調整し、タスク割り当てを変える部分
である。In process 19, data is transferred between PEs according to the instructions from the host computer to distribute the PE tasks determined in process 18, and the loop index in process 45 in FIG. initial value Nl of
This is the part that adjusts the final value N2 and changes the task assignment.

第２図処理１１によってホスト計算機とＰＥで実行すべ
きプログラムはそれぞれ第３図または第４図と第９図に
示す。本発明によって第７図のプログラムに対して、コ
ンパイラは、第３図または第４図と第９図に示した機能
を持つコードをオブジェクトコードとして出力すること
になる。The programs to be executed by the host computer and PE by the process 11 in FIG. 2 are shown in FIG. 3 or 4 and FIG. 9, respectively. According to the present invention, for the program shown in FIG. 7, the compiler outputs code having the functions shown in FIG. 3, FIG. 4, and FIG. 9 as an object code.

まず第２図処理１３のＰＥのグループ分けについて述べ
る。第１０図を用いて説明する．６０〜６７と８つのＰ
Ｅがある（並列計算機のＰＥ数は通常偶数）とする。２
つづつＰＥをえらんで、グループをつくり、さらにグル
ープ同士をまとめてあらたなグループを作るといった操
作を繰り返す。First, the grouping of PEs in process 13 in FIG. 2 will be described. This will be explained using Figure 10. 60-67 and 8 Ps
Suppose that there is E (the number of PEs in a parallel computer is usually an even number). 2
This process is repeated by selecting PEs one by one, creating groups, and then combining the groups to create new groups.

例えば、６０と６１を１つのグループとみて７ｏとし、
６２と６３を７１とし、６４と６５を７２とし、６６と
６７を７３とする．このグループをα＝１（レベル１）
のグループとする。次にＱ＝１のグループに対して，７
０と７１をあらたなグループにして８０とし、７２と７
３を８１としてまとめ、Ｑ＝２のグループとする。この
一連の操作をしてあるレベルでのグループが２つになっ
たところで操作をやめる。第１０図の場合ではα＝２で
操作を中止することになる。後の処理１８で必要になる
、グループ分けしたあとの、グループの単位で、みかけ
上の内側エループの演算のための経過時間を次のように
定める。すなわち、Ωレベルのグループでの経過時間は
Ｑ−１レベルのグループ単位ｐとｑの各々の内側エルー
プの演算量より求める。１単位時間の処理景はシステム
で既知なので、これらの値により経過時間に換算する。For example, consider 60 and 61 as one group and set it as 7o,
Let 62 and 63 be 71, 64 and 65 be 72, and 66 and 67 be 73. This group is α=1 (level 1)
group. Next, for the group of Q=1, 7
Make 0 and 71 into a new group and make it 80, then 72 and 7.
3 are grouped together as 81 to form a group of Q=2. Perform this series of operations and stop when there are two groups at a certain level. In the case of FIG. 10, the operation is stopped when α=2. The elapsed time required for the calculation of the apparent inner eloop in group units after grouping, which will be required in the subsequent process 18, is determined as follows. That is, the elapsed time in the Ω level group is determined from the calculation amount of each inner eloop of the Q-1 level group units p and q. Since the processing scene of one unit time is known in the system, it is converted into elapsed time using these values.

以上により、各ＰＥでの内側エループの演算のための経
過時間をもとに、全てのレベルに対して各レベル内での
グループでの内側エループの演算のための経過時間を求
めることができる。これらのグループ分けは、後で述べ
るホスト計算機で最適化の処理を行うときに生じる，最
適でなくても処理上は最適解のようにみえる状態（熱力
学でいう準平衡状態）に系がおちいることを避けるため
に本発明で新たに導入したものである。As described above, based on the elapsed time for computation of the inner eloop in each PE, it is possible to obtain the elapsed time for computation of the inner eloop in a group within each level for all levels. These groupings occur when the host computer performs optimization processing, which will be described later, when the system falls into a state that appears to be an optimal solution (quasi-equilibrium state in thermodynamics) even if it is not optimal. This is newly introduced in the present invention to avoid this problem.

処理１８について詳述する。Process 18 will be explained in detail.

第１の方法としては、第３図処理１８１−１８４に示し
たように全てのＰＥの番号に対して、あるＰＥ−ｎとＰ
Ｅ−ｍの内側エループの計算に要するＰＥでの経過時間
の差τ．−τ。を求める。処理１８１では、各ＰＥの経
過時間の平均の処理時間からのずれを臀、ヤ、Δτ．がある値以上だったときの処理１８２
以降の処理を行なうとする。ここでさらに１８２以降の
処理をＫの１回につき１回だけ行うか何度も行うかユー
ザ指示により決める。ここでａはグループのレベルの番
号。Ｎｐｅは全ＰＥの台数。As a first method, a certain PE-n and P
The difference τ in the elapsed time at PE required to calculate the inner eloop of E-m. −τ. seek. In process 181, the deviation of the elapsed time of each PE from the average processing time is calculated as Δτ. Processing when is greater than a certain value 182
Let us perform the following processing. Here, it is further determined by the user's instruction whether the processing from 182 onwards is to be performed only once for each K time or many times. Here, a is the group level number. Npe is the total number of PEs.

Ｔ＠　　ｔＨくＯのとき、ＰＥ−ｎからＰＥ−ｍに、τ
．−τ、にみあう分の演算量に対応する工のインデック
スを引数とする配列データを送ることを仮に決定する（
処理１８６）．そしてこれにみあったＰＥの経過時間を
変更する（処理１８７）。これらの操作を全てＰＥに対
して行なう。When T@tHkuO, from PE-n to PE-m, τ
．． −τ, we tentatively decide to send array data with the index of the process corresponding to the amount of computation that satisfies (
Processing 186). Then, the elapsed time of the PE is changed to match this (processing 187). All these operations are performed on the PE.

そして、全てのＰＥの対に対しての該経過時間の差がほ
とんど０になるまで、処理１８２から処理１８８を繰り
返す。Processes 182 to 188 are then repeated until the difference in elapsed time for all pairs of PEs becomes almost zero.

以上の結果として処理１８９は、各ＰＥでの経過時間を
等しくするには、どのＰＥからどのＰＥにどのインデッ
クスに相当する配列データを送りタスクを再分配すれば
よいかを決定できる。処理１９０ではホスト計算機から
各ＰＥに、データ転送の送受信命令およびそれに付加す
べきタスクを変更する命令を出す。As a result of the above, the process 189 can determine which PE should send array data corresponding to which index to which PE and redistribute tasks in order to equalize the elapsed time in each PE. In process 190, the host computer issues a data transfer transmission/reception command and a command to change the task to be added to each PE from the host computer.

以上は、決定論的に，各ＰＥの経過時間を均一にする方
式である。以上ではグループ分けをする場合についても
同様にできる。The above method deterministically equalizes the elapsed time of each PE. The above can be done in the same way when grouping.

この方式は、１〕Ｅの台数が少ないとき、第１図の外側
ループのＫ＝１のときやＫ＝２等、各々のＫに対して、
ただちにＫ＝４のように内側エループでのＰＥでの処理
経過時間を均等にしようとする場合に特に有効である。This method works as follows: 1] When the number of E is small, for each K, such as when K=1 or K=2 in the outer loop in Fig. 1,
This is particularly effective when attempting to equalize the elapsed processing time in the PE in the inner eloop, such as when K=4.

ところが、ＰＥの台数が増えたとき、ＰＥ間でのデータ
あるいはタスクを送ったり受けたりする組合わせが増え
、ホスト計算機での処理量が多くなったり、例えばＫ＝
１のとき特定のＰＥに負荷のかたよりが多かったとき、
データの転送あるいはタスクの変更のための処理が、特
定のＰＥにかたより、かえってそのための処理に要する
経過時間が長くなりすぎるという心配がある。そこで確
率論的方法ではあるが，ある特定のＰＥにかたよった演
算負荷や通信負荷を多くのＰＥ分散していく方法を次に
述べる。However, when the number of PEs increases, the number of combinations for sending and receiving data or tasks between PEs increases, and the amount of processing on the host computer increases.For example, when K=
When 1, when there is a large load bias on a specific PE,
There is a concern that the processing for data transfer or task change may be biased toward a specific PE, and the elapsed time required for such processing may become too long. Therefore, although it is a probabilistic method, a method for distributing the calculation load and communication load that is concentrated on a certain PE among many PEs will be described below.

？の方法は，粒子間に相互作用があるとし，粒子の集合
を全系とみなしたとき、系をある一定の温度の熱平衡状
態に近づけ、温度を下げていき、系の自由エネルギーが
最も低くなるように粒子の配置をかえていき、その配置
を決定していくというメトロポリスのモンテカルロ法に
以ている。? The method assumes that there is interaction between particles, and when a collection of particles is considered as a whole system, the system is brought closer to a state of thermal equilibrium at a certain temperature, and the temperature is lowered until the free energy of the system is the lowest. It uses the Metropolis Monte Carlo method, which changes the arrangement of particles and determines the arrangement.

第４図１８１′では，第３図１８１と同じ処理を行なう
。第４図１８２’　，１８３’は多数の粒子からなる系
の温度をゆっくりとＴ　ｍ　ｔ■まで冷やしていく状況
に対応し、αは冷却速度をコントロールするパラメータ
にたとえることができる。パラメータＴはコントロール
されるパラメータになっている。一方、Ｔ　１ｎは以下
で各ＰＥで分担する内側ループインデックスの演算のた
めに要する経過時間をコスト関数Ｃ１とするとき、（ｉ
：ＰＥ番号）Ｎｐｅ　　’＝１により求めたとき、どの程度まで、粒子の安定位置から
のずれを許すかを決めるのと同じように各ＰＥの演算の
経過時間を均一にしていくかを決定するパラメータであ
る。In FIG. 4 181', the same processing as in FIG. 3 181 is performed. 4, 182' and 183' correspond to a situation in which the temperature of a system consisting of a large number of particles is slowly cooled to T m t■, and α can be compared to a parameter that controls the cooling rate. Parameter T is a controlled parameter. On the other hand, T 1n is defined as (i
: PE number) When calculated by Npe' = 1, in the same way as deciding to what extent the particle should be allowed to deviate from its stable position, it is also decided to make the elapsed time of each PE's calculation uniform. It is a parameter.

処理１８４’　　　１９４’は、ＰＥをグループ別にし
て、ＰＥの経過時間を細かく見たり粗く見たりすること
によって，先ほどの多粒子の例でいうと粒子の配置がみ
かけ上変化しない状態になることを避ける。つまりタス
クが平準化していなくても，細く見る見方だけのときに
は扱い上はタスクが平準化していないのに取り扱い上そ
うみえるような状ｆｉ（準平衡状態）に系がおちいるの
を避けるために行う。Processing 184'194' divides PEs into groups and looks at the elapsed time of PEs in detail or coarsely, so that in the multi-particle example above, the arrangement of particles appears to remain unchanged. Avoid. In other words, even if the tasks are not leveled, if you only take a closer look, this is done to avoid the system falling into a state fi (quasi-equilibrium state) where the tasks appear to be leveled even though they are not leveled. .

処理１８６’，１８７’　　１８８’はどのＰＥ群から
どのＰＥ群にどのインデックスエの配列データを送るか
乱数を用いてランダムに選ぶ部分である。すなわち，処
理１８６′では０から１の間の乱数によりＰＥあるいは
ＰＥ群を１つ選ぶ部分である。今、ＰＥ（”）−ｍ　を
選んだとする。一方、処理１８７′では同様に乱数によ
って、今選んだＰＥあるいはＰＥ群中の内側エインデッ
クスの１つを選び、に付随した配列データを選ぶ選ばれ
たインデックスを１′とする。処理１８８′は、処理１
８６′と同様に乱数により、今度は先ほどのＰＥあるい
はＰＥ’　から１８７′で選んだデータを送る先のＰＥ
あるいはＰＥ群をランダムに乱数を用いて選ぶ（今、Ｐ
Ｅ’−ｎを選んだとする。）部分である。ここでＰＥ群
のうちのＰＥはランダムに任意に選べる。処理１８９’
−１９１’は、上で選んだインデックスによって決まる
データをＰ　Ｅ　（’）　−　ｍからＰＥ（’）一ｎに
送った方がΔτ（１）が増えるのか減るのかを調べる。Processes 186', 187' and 188' are a part of randomly selecting which index array data is to be sent from which PE group to which PE group using random numbers. That is, in process 186', one PE or a group of PEs is selected using a random number between 0 and 1. Now, suppose that PE('')-m has been selected. On the other hand, in process 187', the currently selected PE or one of the inner indexes in the PE group is selected using random numbers, and the array data associated with is selected. Let the selected index be 1'. Process 188' is the process 1
Similarly to 86', a random number is used to select the destination PE selected in 187' from the previous PE or PE'.
Alternatively, the PE group is randomly selected using random numbers (now, P
Suppose that E'-n is selected. ) part. Here, a PE from the PE group can be arbitrarily selected at random. Processing 189'
-191' investigates whether Δτ(1) increases or decreases when data determined by the index selected above is sent from PE(')-m to PE(')-n.

処理１９２′はΔτ（’）＜Ｏ　っまりコスト関数が負
のとき、Ｉ′のインデックスに関する配列をＰ　Ｅ　−
　ｍからＰＥ−ｎに移し、タスクもそれにあわせて変更
する。ここでタスクの変更については後述する。Δτ（
１）≧０のとき、つまりコスト関数が増えるとき、工′
のインデックスに関する配列のＰＥ−ｍからＰＥ−ｎへ
の移行の確率が、一Δτ（”）　／’　ａ　Ｔ　ｔな．
よう，。以下。手続をす．。Processing 192' sets the array related to the index of I' to P E −
Move from m to PE-n and change the task accordingly. Here, the task change will be described later. Δτ(
1) When ≧0, that is, when the cost function increases, the cost
The probability of transition from PE-m to PE-n of the array with respect to the index of is -Δτ('')/' a T t.
Yo,. below. Perform the procedure. .

ｅ（ここでｋＢ：はある定数）すなわち、乱数ｒ（０と１
の間）と・一Ａ　＠　（’）／ｋａＴとを比較し、・〈
Ａ　ｃ　（’）／ｋＢｒなら、工・のインデックスに関
するｅ配列のＰＥ−ｍからＰＥ−ｎへの移動およびそれに付随
するタスクの変更を行う。処理１９４では第３図１９０
と同じ処理を行うようホスト計算機からＰＥに命令を出
す。e (where kB: is a certain constant), that is, random number r (0 and 1
) and ・1A @ (')/kaT, ・〈
If A c (')/kBr, the e array related to the index of engineering is moved from PE-m to PE-n and the associated task is changed. In the process 194, the process 190 in FIG.
The host computer issues an instruction to the PE to perform the same process as .

上記操作を行なえば、第１図のように外側ループの何回
かの繰り返しのあとで、内側ループの各ＰＥでの処理の
経過時間は、全ＰＥでほぼ一定値とすることが物理法則
の対応関係から保証できる．以上がホスト計算機でＰＥ
の割り付を変更する一連の処理である。If the above operation is carried out, as shown in Figure 1, after the outer loop is repeated several times, the laws of physics state that the elapsed processing time at each PE in the inner loop will be approximately constant for all PEs. This can be guaranteed from the correspondence. The above is PE on the host computer.
This is a series of processes to change the allocation of

本発明で用いる処理１８すなわち第３図または第４図の
部分の手続を行うオブジェクトコードはホスト計算機で
実行されるようホスト計算機に送る。（ただし処理１８
はＰＥで分担して処理を行うことも可能である。）本発
明によって、第５図の逐次型プログラムの処理をＰＥで
分担させるとき，Ｑ番目のＰＥどのような内容のオブジ
ェクトコードが生成されるかを第９図に示す。ここでタ
スクの変更について述べる。処理４５は、第８図４０′
〜４４′に相当する処理を行う部分であり、ここでは、
各ＰＥでタスクを分担することを明示するためにＳＵＢ
ＲＯｕＴＩＮＥ　Ａ　（Ｎ　１　，　Ｎ　Ｚ）の形で示
す。The object code for carrying out the process 18 used in the present invention, that is, the procedure shown in FIG. 3 or 4, is sent to the host computer so that it is executed therein. (However, processing 18
It is also possible for PEs to share the processing. ) FIG. 9 shows what kind of object code is generated by the Q-th PE when processing of the sequential program shown in FIG. 5 is shared among PEs according to the present invention. Here we will discuss changing tasks. Process 45 is shown in FIG. 8 40'
This is the part that performs the processing corresponding to ~44', and here,
SUB to clearly indicate that tasks will be shared among each PE.
It is shown in the form ROutINE A (N 1 , NZ).

ここでＮ　１　＋　Ｎ　ｚは、ＤＯ文の初期値と終値で
ある。Here, N 1 + N z are the initial value and final value of the DO statement.

例えばまず、第６図ＰＥ＝１のときＳＵＢＲＯＵＴＩＮ
ＥＡ　（Ｌ　１０）が実行されると考える。処理１６で
経過時間τ１を算出し処理５１がホスト計算機にて藤の
値を送信する。処理５２でホスト計算機が処理１８を終
えるのを待つ。その結果処理１８により、例えばＰＥ−
１では工のインデックス１，２，３の部分と５．６，７
，８，９，１０．１１の部分を実行すべきで，工＝４に
関する部分は他で計算すべきことが決まったとする。そ
うすると、処理５３゜で，■＝４に関するデータＶ　（
４）　，ＶＡＬ　（４，Ｋ）を他．１７）　Ｐ　Ｅ　ニ
送信し、ｒ’Ｅ−ＺからＩ＝４１！ｉするデータＶ　（
１１），ＶＡＬ（１１，Ｋ）を受信する。このとき該適
用システムでは、データ識別子をつけ，　Ｓｅｒ＋ｄあ
るいはＲｅｃｅｉｖｅ命令を出す。そして処理１９で、
ＳＵＢＲＯＵＴＩＮＥ　Ａ（１　，　　３）　　とＳＵ
ＢＲＯＵＴＩＮＥ　Ａ（５　，１１）を計算するように
Ｎ１，ＮＺの値を調整し、処理４５のところで、ＣＡＬ
Ｌ　Ａ（１．１０）と設定してあるのをＣ，ＡＬＬ　Ａ
（１，３）とＣＡＬＬ（５，１１）に変更するというタ
スクの再割りあてを行う。以上では、ＣＡＬＬ文とＳＵ
ＢＲＯＵＴＩＮＥの形でタスクを記述したが、別の方法
に対しても同様の処理を行うことは容易である。For example, first, when PE=1 in Figure 6, SUBROUTIN
Consider that EA (L 10) is executed. In process 16, the elapsed time τ1 is calculated, and in process 51, the host computer transmits the value of Wisteria. In process 52, the host computer waits for completion of process 18. As a result, by processing 18, for example, PE-
In 1, the parts of index 1, 2, 3 and 5.6, 7
, 8, 9, and 10.11 should be executed, and it has been decided that the part related to engineering = 4 should be calculated elsewhere. Then, at a processing time of 53°, data V (
4) , VAL (4,K) and others. 17) Send P E and I=41 from r'E-Z! i data V (
11), receives VAL(11,K). At this time, the application system attaches a data identifier and issues a Ser+d or Receive command. And in process 19,
SUBROUTINE A(1, 3) and SU
Adjust the values of N1 and NZ to calculate BROUTINE A (5, 11), and in process 45, CAL
The one set as LA (1.10) is C, ALL A
The task is reallocated to CALL (1, 3) and CALL (5, 11). In the above, the CALL statement and SU
Although the task is described in the form of BROUTINE, it is easy to perform similar processing using other methods.

〔Effect of the invention〕

本発明によれば、従来の逐次型の逐次型のユーザプログ
ラムを、再コーディングすることなく、ＰＥの負荷を動
的に平準化して、並列処理システムで実行させることが
できる。またこのとき、ハードウエア資源を有効に用い
ることにより、経過時間が短く、実効効率の高いオブジ
ェクトコードを自動生成させることができる。According to the present invention, a conventional sequential user program can be executed on a parallel processing system by dynamically leveling the load on PEs without recoding. Further, at this time, by effectively using hardware resources, it is possible to automatically generate object code with a short elapsed time and high effective efficiency.

[Brief explanation of drawings]

第１図は本発明の模式図、第２図はコンパイラの全体図
と本発明の概要図、第３図，第４図は本発明のホストあ
るいはＰＥで行なう処理の一実施例の概要図、第５図は
実施例説明のための並列計算機システムを示す図、第６
図は外側ループと内側ループ処理時間の関係を示す図、
第７図は、実施例説明のためのソースプログラムを示す
図、第８図は、並列化コードを示す図、第９図は各ＰＥ
で行う処理の概要図，第１０図はＰＥのグルーピングの
手続図である。１・・・ホスト計算機、１１・・・動的平準化処理部、
１２・・・内側ループＰＥ分配処理初期化部、１３・・
・ＰＥのグループ分け処理部、１５，１６，１７・・・
外側ループに対する内側ループの演算および通信量から
各ＰＥの経過時間を求める部分、１８・・・内側ループ
の各ＰＥ分配を再決定、１９・・・データ通信等により
、内側ループのタスクの再調整。葛団葛　２　図第　７　図ＤＯ　　Ｚ　　　Ｋ＝１．　　７０００００　　％３０
？Ｏ　　　Ｉ　　　Ｉ．Ｉ．３ρ　　　　　一＼−３１
Ｉ１−　（　ＶＡＬ（ｒ，〆）．ＬＥＶＡＬＯ）　ｔｈ
ａ−ｒ　　　３ＺＲＶ（Ｅ）＝　Ｖ（ｒ）＊斧４　　　
　　　　　　　　■３３Ｅ／ｔ／つＩＦ　　　　　　　
　／一一一＼＼　　３／ｉｌ　　ＣＯＮ　ＩＴＩＮＵＥ
　　　　　　′一〜ゝ＼、−３５ｚ　　　ｃ０ｎ七１ｎ
ば　　　　　　　　　　　一一−一一一一〜−　　　　
３Δ第　／０　　図（良）FIG. 1 is a schematic diagram of the present invention, FIG. 2 is an overall diagram of the compiler and a schematic diagram of the present invention, and FIGS. 3 and 4 are schematic diagrams of an embodiment of processing performed by the host or PE of the present invention. Fig. 5 is a diagram showing a parallel computer system for explaining the embodiment, Fig. 6 is a diagram showing a parallel computer system for explaining the embodiment.
The figure shows the relationship between outer loop and inner loop processing time,
FIG. 7 is a diagram showing a source program for explaining the embodiment, FIG. 8 is a diagram showing parallelization code, and FIG. 9 is a diagram showing each PE.
FIG. 10 is a diagram of the procedure for grouping PEs. 1... host computer, 11... dynamic leveling processing unit,
12... Inner loop PE distribution processing initialization unit, 13...
・PE grouping processing unit, 15, 16, 17...
A part that calculates the elapsed time of each PE from calculations and traffic in the inner loop for the outer loop, 18... Redetermines the distribution of each PE in the inner loop, 19... Readjusts the tasks in the inner loop by data communication, etc. . Kuzu Dan Kuzu 2 Figure 7 Figure DO Z K=1. 700000%30
? O I I. I. 3ρ one\-31
I1- (VAL(r,〆).LEVALO) th
a-r 3ZRV(E)=V(r)*axe 4
■33E/t/tsu IF
/111\\3/il CON ITINUE
'1~ゝ\, -35z c0n71n
ba 11-1111~-
3Δ Fig. /0 (Good)

Claims

[Claims] 1. A host computer that is composed of a plurality of numbered processors and controls the entire computer, a processor element (PE) that performs operations independently of the computer, and an inter-processor data transfer method. In the process of compiling and loading that generates object code for parallel processing execution by the parallel processor from a sequential processing source program written in a high-level language for a memory-distributed parallel computer to be used, the sequential execution After converting a source program into a parallel execution type program and allocating data and programs to each PE from the flow of processing when executing the parallel execution type program, data and programs arising from the original allocation or dynamically generated during execution The process of reallocating data and programs to each PE is performed by measuring the elapsed time that occurs when each PE executes the process assigned to it, and then reallocating data and programs to each PE so that the variation in time due to PE is equalized among all PEs. The host computer or PE dynamically executes the processing assigned to each PE.
Processing leveling method for parallel computers. 2. The object code provided with the leveling method described in claim 1 is inserted into the object code of the source program as the object code for compilation into the host computer and processor at the time of compilation, and is output as the object code output by the compiler. Automatic parallelization compilation method. 3. It is written in a high-level language, targeting a so-called memory-sharing type parallel computer consisting of a host computer that controls the entire computer and a processor element (PE) that performs operations independently, consisting of multiple numbered processors. In the process of compiling and loading to generate an object code for parallel processing execution by the parallel processor from a serial source program, the sequential source program is converted into a parallel program, and the parallel program is converted into a parallel program. After allocating data and programs to each PE from the flow of processing when executing, execute the processing assigned to each PE that arises from the original allocation or that occurs dynamically during execution. The host measures the elapsed time and reallocates data and programs to each PE so that the variation in the time between PEs is equal across all PEs while executing the process assigned to each PE. A processing leveling method for parallel computers that is dynamically performed by a computer or PE. 4. Inserting the object code provided with the leveling method according to claim 3 into the object code of the source program as the object code for compilation into a host computer and processor at the time of compilation, and outputting it as the object code output by the compiler. Automatic parallelization compilation method. 5. For source programs containing two or more loops or two or more iterative operations, an inner DO loop divided into each PE, or Determine the elapsed time for each PE in the process of repeated calculations, and calculate the elapsed time for all PEs.
On the other hand, (1) the PEs are divided into two groups at a time, and the grouped PEs are further divided into two groups at a time.
Create groups of several levels, determine the magnitude of elapsed time for two PE groups in the same group for the same level of the group, and how to divide the data within the groups of the same level. (2) Create two pairs between all PEs without grouping them, and determine the size of the elapsed time. (3) Probabilistically transfer data to each other until the elapsed time is equal for all PEs.
Select E, take the difference in the elapsed time of two PEs, and stochastically determine the transfer probability between PEs for the processing amount corresponding to the difference, and calculate the progress of processing at each PE. Either by making time probabilistically uniform, each P
A processing leveling method for a parallel computer that eliminates and levels the uneven elapsed time of each PE that occurs in E. 6. For a source program containing two or more loops or two or more iterative operations, the inner DO is divided into each PE, once for every outer loop or outer iterative operation. Determine the elapsed time in each PE for processing a loop or repeated operation, and (1) divide the PEs into groups of two, and further divide the grouped PEs into groups of two. By performing the grouping operation one after another, we create groups of several levels, and for the same level of the group, we calculate the elapsed time of two PE groups from the same group. (2) Decide how to share data within groups at the same level, and then instruct which PE should send data to which PE to reprogram, or (2) decide how to share data among groups at the same level. Create two pairs between PEs, thoroughly check the magnitude of the elapsed time, and transfer data to each other until the elapsed time is equal for all PEs, or (3) Probabilistically transfer data between two PEs.
, calculate the difference in the elapsed time between two PEs, stochastically determine the transfer probability between the PEs of data corresponding to the difference, and calculate the elapsed time of processing at each PE. Each P
A processing leveling method for a parallel computer that eliminates and levels the uneven elapsed time of each PE that occurs in E. 7. For a source program containing two or more loops or two or more iterative operations, use each iteration of the outer loop or outer iterative operation to match the iterative part of the method described below, Calculate the elapsed time in each PE of the inner DO loop divided into PEs or the processing of the repetitive operation, and calculate the elapsed time (1
) Divide the PEs into groups two at a time, and then divide the PEs into two groups one after another, creating groups at several levels. 2 of the same group for the same level of
Decide on the size of the elapsed time in each PE group, decide how to share the data within the group at the same level, and decide which PE group
(2) Instruct all PEs to send data to which PE should change the program, or (2) send data to all PEs without grouping.
Create two pairs between PEs, thoroughly check the elapsed time, and transfer data to each other until the elapsed time is equal for all PEs, or (3) select two PEs probabilistically. The difference in elapsed time between two PEs is calculated, and the probability of transferring data between PEs corresponding to the difference is stochastically determined so that it follows a certain distribution, and the elapsed time of processing in each PE is stochastically determined. A processing leveling method for a parallel computer that eliminates and levels the non-uniformity of elapsed time that occurs in each PE by making it close to uniformity.