JPH1091591A

JPH1091591A - Parallel processing method for hierarchical multiprocessor

Info

Publication number: JPH1091591A
Application number: JP8261325A
Authority: JP
Inventors: Yoshiaki Hiroya; 良彰廣谷
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1996-09-10
Filing date: 1996-09-10
Publication date: 1998-04-10
Anticipated expiration: 2016-09-10
Also published as: JP2962241B2

Abstract

PROBLEM TO BE SOLVED: To provide a method for efficiently parallelly processing an intra- procedure loop composed of an inside loop and an outside loop through a hierarchical multiprocessor. SOLUTION: The initial value of loop parameter of the outside loop is set to an inter-cluster communication register GCRO and plural clusters 1 are activated. Each time the present cluster possesses the value of loop parameter of the outside loop to be partially charged while referring to and updating the inter-cluster communication register GCRO according to a Fetch & Add instruction, a representative processor 11 of each cluster 1 sets that value to an intra-cluster shared area IN and sets the initial value of the inside loop to an intra-cluster communication register LCRO. The representative processor 11 of each cluster 1 and other processors from 12-1 to 12-n possess the value of loop parameter of the inside loop to be partially charged by a present program while referring to and updating the intra-cluster communication register LCRO according to the Fetch & Add instruction and execute the processing of a repetition part determined by this value and the value of loop parameter of the outside loop set in the intra-cluster shared area IN as one task.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、プログラム中の手
続内ループを複数のプロセッサで並列処理する方法に関
し、特に内側ループと外側ループとから構成される手続
内ループを階層型マルチプロセッサを構成する各プロセ
ッサで並列処理する方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for parallel processing of an intra-procedure loop in a program by a plurality of processors, and more particularly to a multi-procedure loop including an inner loop and an outer loop. The present invention relates to a method for performing parallel processing in each processor.

【０００２】[0002]

【従来の技術】科学技術計算用のプログラムに多く現れ
る２重ＤＯループの如き、内側ループと外側ループとか
ら構成される手続内ループを、各々が複数のプロセッサ
を含むクラスタ複数台から構成される階層型マルチプロ
セッサを構成する各プロセッサで並列処理する従来の方
法として、例えば特開平１−１５２５７１号公報に示さ
れるように、内側ループを展開して一重ループの繰り返
し処理に変換し、この一重ループの繰り返し処理を全ク
ラスタのプロセッサ総数で均等に分割し、その各々を１
つのタスクとして各プロセッサで実行する方法（以下、
第１の従来方法と称す）がある。2. Description of the Related Art An intra-procedure loop composed of an inner loop and an outer loop, such as a double DO loop which often appears in a program for scientific calculation, is composed of a plurality of clusters each including a plurality of processors. As a conventional method of performing parallel processing in each processor constituting a hierarchical multiprocessor, for example, as shown in Japanese Patent Application Laid-Open No. 1-152571, an inner loop is expanded and converted into a single loop repetition processing. Is equally divided by the total number of processors in all clusters, and each
To run as a single task on each processor (hereafter,
(Referred to as a first conventional method).

【０００３】他方、複数のプロセッサを一次元的に結合
した通常の密結合マルチプロセッサにおいて、単一ルー
プから構成される手続内ループを、各プロセッサで並列
処理する方法として、特開平３−２１８５５６号公報に
記載された以下の２通りの方法がある。On the other hand, in a normal tightly-coupled multiprocessor in which a plurality of processors are connected one-dimensionally, Japanese Patent Laid-Open Publication No. 3-218556 discloses a method of performing parallel processing in a procedural loop composed of a single loop by each processor. There are the following two methods described in the gazette.

【０００４】ループ変数の始値と終値とを１つのプロセ
ッサ間共有レジスタに設定し、各プロセッサが、排他的
にプロセッサ間共有レジスタを参照更新して、プロセッ
サ間共有レジスタ中の始値が終値以下を示すとき、プロ
セッサ間共有レジスタ中の始値をループ１回当たりの増
分値だけ増加させると共に、その始値でループの実行を
行う方法（以下、第２の従来方法と称す）。The start value and the end value of the loop variable are set in one inter-processor shared register, and each processor exclusively refers to and updates the inter-processor shared register so that the open value in the inter-processor shared register is equal to or less than the end value. Is indicated, the start value in the inter-processor shared register is increased by an increment value per loop, and the loop is executed with the start value (hereinafter, referred to as a second conventional method).

【０００５】ループ変数の取り得る範囲を分割した各分
割範囲の始値と終値とを各プロセッサ対応のプロセッサ
間共有レジスタに設定し、各プロセッサが自身に対応す
るプロセッサ間共有レジスタに格納された終値の示す回
数に到達するまで手続内ループの処理部分を繰り返し実
行し、終値回数まで実行し終えると、他のプロセッサ対
応のプロセッサ間共有レジスタを参照して未処理の処理
部分が残っている場合には、その未処理のループ部分に
かかるループ変数の取り得る範囲の一部を自プロセッサ
対応のプロセッサ間共有レジスタに移送してループ処理
を続行する方法（以下、第３の従来方法と称す）。な
お、各プロセッサが自身に対応するプロセッサ間共有レ
ジスタに格納された終値まで実行し終えたときに、どの
他のプロセッサのプロセッサ間共有レジスタを参照する
かについては、残りの全てのプロセッサ間共有レジスタ
を参照する方法と、予め決められた１つまたは複数のプ
ロセッサ間共有レジスタを参照する方法とがある。[0005] The start value and the end value of each divided range obtained by dividing the range that the loop variable can take are set in an inter-processor shared register corresponding to each processor, and each processor stores the end value stored in its own inter-processor shared register. When the processing part of the intra-procedure loop is repeatedly executed until the number of times indicated by is reached, and when the execution is completed up to the closing number of times, if the unprocessed processing part remains by referring to the inter-processor shared register corresponding to another processor, Is a method of transferring a part of a range of a loop variable relating to the unprocessed loop part to an inter-processor shared register corresponding to the own processor and continuing the loop processing (hereinafter, referred to as a third conventional method). When each processor finishes executing up to the closing value stored in its own inter-processor shared register, it determines which other processor's inter-processor shared register to refer to by referring to all the remaining inter-processor shared registers. And a method of referring to one or more predetermined inter-processor shared registers.

【０００６】[0006]

【発明が解決しようとする課題】内側ループと外側ルー
プとから構成される手続内ループを階層型マルチプロセ
ッサを構成する各プロセッサで並列処理する場合に、第
１の従来方法のように一重ループ化して均等分割し、個
々の部分を各プロセッサに割り当てる方法では、全ての
二重ループが一重ループ化できるとは限らないため汎用
性に乏しい。また、一重ループ化できても、一重ループ
化に伴うオーバヘッドが発生する上、何れかのプロセッ
サの処理が遅延する場合には、他のプロセッサがそれを
引き取ることができないため、手続内ループ全体の処理
が完了するまでの時間が遅延するという問題がある。When an intra-procedure loop composed of an inner loop and an outer loop is processed in parallel by each processor constituting a hierarchical multiprocessor, a single loop is formed as in the first conventional method. In this method, all the double loops cannot always be formed into a single loop, so that the versatility is poor. Even if a single loop can be formed, the overhead associated with the single loop is generated, and if the processing of any of the processors is delayed, the other processor cannot take it over. There is a problem that the time until the processing is completed is delayed.

【０００７】このような問題を解決する為に、第１の従
来方法に第２の従来方法を適用することが考えられる。
即ち、二重ループを一重ループ化してそのループ変数の
始値と終値とを１つのプロセッサ間共有レジスタに設定
し、全クラスタ中の各プロセッサが、排他的にプロセッ
サ間共有レジスタを参照更新して、プロセッサ間共有レ
ジスタ中の始値が終値以下を示すとき、プロセッサ間共
有レジスタ中の始値をループ１回当たりの増分値だけ増
加させると共に、その始値でループの実行を行う方法で
ある。この方法によれば、全プロセッサが一つのプロセ
ッサ間共有レジスタを参照更新して動的に処理を引き受
けるため、何れかのプロセッサの処理が遅延しても全体
に与える影響は少ない。しかしながら、各プロセッサが
ループを１回実行する毎に、各プロセッサが１つのプロ
セッサ間共有レジスタを参照するため、ループの実行回
数だけプロセッサ間共有レジスタの競合やクラスタ間を
接続する相互結合網上でのデータ通信が発生し、それが
手続きの実行時間を増大させる原因となる。In order to solve such a problem, it is conceivable to apply the second conventional method to the first conventional method.
That is, the double loop is converted into a single loop, the start value and the end value of the loop variable are set in one inter-processor shared register, and each processor in all clusters exclusively refers to and updates the inter-processor shared register. When the opening value in the inter-processor shared register indicates a closing value or less, the opening value in the inter-processor shared register is increased by an increment value per loop, and the loop is executed with the opening value. According to this method, all the processors dynamically update the processing by referring to and updating one inter-processor shared register. Therefore, even if the processing of any one of the processors is delayed, the influence on the whole is small. However, each time each processor executes the loop once, each processor refers to one inter-processor shared register. Therefore, the number of executions of the loop is the same as the contention between the inter-processor shared registers and the interconnection network connecting the clusters. Data communication occurs, which causes the execution time of the procedure to increase.

【０００８】他の解決方法として、第１の従来方法に第
３の従来方法を適用することが考えられる。即ち、二重
ループを一重ループ化したループのループ変数の取り得
る範囲を分割した各分割範囲の始値と終値とを各プロセ
ッサ対応のプロセッサ間共有レジスタに設定し、各プロ
セッサが自身に対応するプロセッサ間共有レジスタに格
納された終値の示す回数に到達するまで手続内ループの
処理部分を繰り返し実行し、終値回数まで実行し終える
と、他のプロセッサ対応のプロセッサ間共有レジスタを
参照して未処理の処理部分が残っている場合には、その
未処理のループ部分にかかるループ変数の取り得る範囲
の一部を自プロセッサ対応のプロセッサ間共有レジスタ
に移送してループ処理を続行する方法である。この方法
によれば、他プロセッサに対応するプロセッサ間共有レ
ジスタの参照等は自プロセッサに割り当てられた回数の
ループの終了後に行うため、プロセッサ間共有レジスタ
の競合発生の可能性が少なくなる。しかしながら、この
方法では、各プロセッサが自身の割り当て分を処理し終
えたら他のプロセッサの未処理部分の一部を引き受ける
構成になっているものの、何れかのプロセッサの処理が
遅延した場合には、手続内ループ全体の処理が完了する
までの時間が遅延しがちになる。その理由は、各プロセ
ッサが自身に対応するプロセッサ間共有レジスタに格納
された終値まで実行し終えたときに、残りの全てのプロ
セッサ間共有レジスタを参照する方法では、プロセッサ
数が非常に多い場合に、未処理部分が残ってるプロセッ
サ間共有レジスタを見つけるまでに時間を要してしまう
ためである。また、予め定められた１つまたは複数のプ
ロセッサ間共有レジスタを参照する方法では、その参照
する側のプロセッサでも遅延が生じていた場合、他に処
理を完了しているプロセッサがあっても処理の引受けが
不可能になるからである。As another solution, it is conceivable to apply a third conventional method to the first conventional method. That is, the start value and the end value of each divided range obtained by dividing the range of the loop variable of the loop obtained by converting the double loop into the single loop are set in the inter-processor shared register corresponding to each processor, and each processor corresponds to itself. The processing part of the intra-procedure loop is repeatedly executed until the number of times indicated by the closing value stored in the inter-processor shared register is reached, and when the execution is completed up to the closing number of times, the processing is performed by referring to the inter-processor shared register corresponding to another processor. In this method, when the processing part remains, a part of the range of the loop variable relating to the unprocessed loop part is transferred to the inter-processor shared register corresponding to the own processor, and the loop processing is continued. According to this method, the reference to the inter-processor shared register corresponding to the other processor is performed after the end of the loop of the number of times assigned to the own processor, so that the possibility of occurrence of contention of the inter-processor shared register is reduced. However, in this method, although each processor is configured to take over a part of the unprocessed portion of the other processor after finishing its assigned portion, if the processing of any processor is delayed, The time until the processing of the entire loop in the procedure is completed tends to be delayed. The reason is that when each processor finishes executing up to the closing value stored in its own corresponding inter-processor shared register, the method of referring to all the remaining inter-processor shared registers, when the number of processors is very large, This is because it takes time to find an inter-processor shared register in which an unprocessed portion remains. Also, in the method of referring to one or more predetermined inter-processor shared registers, if a delay occurs in the referring processor, even if there is another processor that has completed the processing, the processing is not performed. This is because underwriting becomes impossible.

【０００９】そこで本発明の目的は、内側ループと外側
ループとから構成される手続内ループを、階層型マルチ
プロセッサで効率良く並列処理し得る方法を提供するこ
とにある。SUMMARY OF THE INVENTION It is an object of the present invention to provide a method capable of efficiently performing a parallel processing of an intra-procedure loop composed of an inner loop and an outer loop by a hierarchical multiprocessor.

【００１０】[0010]

【課題を解決するための手段】本発明は、内側ループと
外側ループとから構成される手続内ループの処理を複数
のタスクに分割し、個々のタスクを、それぞれが複数の
プロセッサから構成されるクラスタを相互結合網によっ
て複数台接続した階層型マルチプロセッサを構成する個
々のプロセッサで実行する並列処理方法であって、全ク
ラスタから参照更新可能なクラスタ間通信レジスタに外
側ループのループ変数の始値を設定し、各クラスタ内の
複数のプロセッサのうちの特定の一つのプロセッサが前
記クラスタ間通信レジスタを排他的に参照更新して自ク
ラスタで実行すべき外側ループのループ変数を取得し
て、自クラスタ内の全プロセッサから参照可能なクラス
タ内共有領域に設定すると共に、自クラスタ内の全プロ
セッサから参照更新可能なクラスタ内通信レジスタに内
側ループのループ変数の始値を設定し、各クラスタ内の
各プロセッサが自クラスタ内の前記クラスタ内通信レジ
スタを排他的に参照更新して自クラスタで実行すべき内
側ループのループ変数を取得し、この取得した内側ルー
プのループ変数と前記クラスタ内共有領域に設定された
外側ループのループ変数とで定まる手続内ループの処理
部分を１つのタスクとして実行することを特徴とする。According to the present invention, the processing of an in-procedure loop constituted by an inner loop and an outer loop is divided into a plurality of tasks, and each task is constituted by a plurality of processors. A parallel processing method executed by individual processors constituting a hierarchical multiprocessor in which a plurality of clusters are connected to each other by an interconnection network, wherein an initial value of a loop variable of an outer loop is stored in an intercluster communication register that can be referenced and updated from all clusters. Is set, and a specific one of the plurality of processors in each cluster exclusively refers to and updates the inter-cluster communication register to obtain a loop variable of an outer loop to be executed in the own cluster. Set the shared area in the cluster that can be referenced from all processors in the cluster, and update the reference from all processors in the cluster. The start value of the loop variable of the inner loop is set in the communication register within the cluster, and each processor in each cluster exclusively refers to and updates the intra-cluster communication register in its own cluster to execute the execution in its own cluster. Acquiring a loop variable of a loop, and executing a processing part of an in-procedure loop determined by the acquired inner-loop loop variable and an outer-loop loop variable set in the intra-cluster shared area as one task. And

【００１１】具体的には、（ａ）全クラスタから参照更
新可能なクラスタ間通信レジスタに外側ループのループ
変数の始値を設定するステップと、（ｂ）各クラスタ内
の複数のプロセッサのうちの特定の一つのプロセッサが
排他的に前記クラスタ間通信レジスタの値を取得すると
共に元の値を外側ループのループ変数の増分値だけ増加
するステップと、（ｃ）ステップｂで取得した値が外側
ループのループ変数の終値より大きくない場合に、前記
特定のプロセッサが自クラスタ内の全プロセッサから参
照可能な第１のクラスタ内共有領域に前記取得した値を
設定すると共に、自クラスタ内の全プロセッサから参照
更新可能なクラスタ内通信レジスタに内側ループのルー
プ変数の始値を設定するステップと、（ｄ）ステップｂ
で取得した値が外側ループのループ変数の終値より大き
い場合に、前記特定のプロセッサが自クラスタ内の全プ
ロセッサから参照可能な第２のクラスタ内共有領域に処
理終了の旨を設定し、全クラスタ間で同期をとるステッ
プと、（ｅ）前記ステップｃまたはｄの完了後に、各ク
ラスタ毎に全プロセッサで同期をとるステップと、
（ｆ）ステップｅの同期確立後に、各クラスタ内の各プ
ロセッサが自クラスタ内の前記第２のクラスタ内共有領
域を参照するステップと、（ｇ）ステップｆで参照した
前記第２のクラスタ内共有領域が処理終了を示す場合
に、各クラスタ内の各プロセッサが手続内ループの処理
を終了するステップと、（ｈ）ステップｆで参照した前
記第２のクラスタ内共有領域が処理終了を示していない
場合に、各クラスタ内の各プロセッサが排他的に自クラ
スタ内の前記クラスタ内通信レジスタの値を取得すると
共にその値を内側ループのループ変数の増分値だけ増加
するステップと、（ｉ）ステップｈで取得した値が内側
ループのループ変数の終値より大きくない場合に、その
値を取得したプロセッサが、その取得した内側ループの
ループ変数と前記第１のクラスタ内共有領域に設定され
た外側ループのループ変数とで定まる手続内ループの処
理部分を１つのタスクとして実行するステップと、
（ｊ）ステップｈで取得した値が内側ループのループ変
数の終値より大きい場合に、各クラスタ毎に全プロセッ
サで同期をとって前記ステップｂに戻るステップとを含
んでいる。Specifically, (a) a step of setting a start value of a loop variable of an outer loop in an inter-cluster communication register that can be referenced and updated from all clusters; and (b) a plurality of processors in each cluster. One specific processor exclusively obtaining the value of the inter-cluster communication register and increasing the original value by the increment value of the loop variable of the outer loop; and (c) changing the value obtained in step b to the outer loop. When the specific value is not larger than the final value of the loop variable, the specific processor sets the obtained value in the first cluster shared area that can be referred to by all processors in the own cluster, and sets the acquired value from all processors in the own cluster. Setting a start value of a loop variable of an inner loop in a reference-updatable intra-cluster communication register; and (d) step b.
If the value obtained in step 2 is larger than the final value of the loop variable of the outer loop, the specific processor sets the end of processing in a second cluster shared area that can be referred to by all processors in its own cluster, (E) synchronizing all processors for each cluster after completion of step c or d;
(F) after establishing the synchronization in step e, each processor in each cluster refers to the second intra-cluster shared area in its own cluster; and (g) the second intra-cluster shared area referred to in step f. When the area indicates the end of the processing, each processor in each cluster ends the processing of the intra-procedure loop; and (h) the second intra-cluster shared area referred to in step f does not indicate the end of the processing. In this case, each processor in each cluster exclusively obtains the value of the intra-cluster communication register in its own cluster and increases the value by the increment value of the loop variable of the inner loop; and (i) step h. If the value obtained in step 2 is not greater than the end value of the loop variable of the inner loop, the processor that has obtained the value And executing the processing portion of the procedure in the loop defined by the cluster in the loop variable of the set outer loop in the shared area as a single task,
(J) when the value obtained in step h is larger than the final value of the loop variable of the inner loop, the process returns to step b by synchronizing all processors for each cluster.

【００１２】また、本発明は、全クラスタから参照更新
可能な同期処理用のクラスタ間通信レジスタを使用して
各クラスタの特定のプロセッサが代表になって全クラス
タ間で同期をとり、自クラスタ内の全プロセッサから参
照更新可能な同期処理用のクラスタ内通信レジスタを使
用して各クラスタ毎に全プロセッサで同期をとるように
している。Further, according to the present invention, a specific processor of each cluster is used as a representative to synchronize among all clusters by using an inter-cluster communication register for synchronization processing which can be referenced and updated from all clusters, In this case, all processors are synchronized for each cluster by using an intra-cluster communication register for synchronous processing that can be referenced and updated from all processors.

【００１３】[0013]

【発明の実施の形態】次に本発明の実施の形態の例につ
いて図面を参照して詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, embodiments of the present invention will be described in detail with reference to the drawings.

【００１４】図１は本発明を適用した階層型マルチプロ
セッサの構成例を示すブロック図である。この例の階層
型マルチプロセッサは、複数のクラスタ１と、これらを
互いに通信可能に接続する相互結合網２とから構成され
ている。また、相互結合網２には、この相互結合網２を
通じて全クラスタ１から参照更新可能な１個以上のクラ
スタ間通信レジスタ３が接続されている。FIG. 1 is a block diagram showing a configuration example of a hierarchical multiprocessor to which the present invention is applied. The hierarchical multiprocessor of this example includes a plurality of clusters 1 and an interconnection network 2 that connects these clusters so that they can communicate with each other. One or more inter-cluster communication registers 3 that can be referenced and updated from all clusters 1 through the interconnection network 2 are connected to the interconnection network 2.

【００１５】各クラスタ１は、複数のプロセッサ１１，
１２−１〜１２−ｎと、これらで共有される共有メモリ
１３と、他のクラスタと相互結合網２を通じてデータ転
送する制御を司るデータ転送制御手段１５とを含む。こ
こで、手続内ループの並列処理に際しては、各クラスタ
１内の複数のプロセッサのうちの何れか一つが代表プロ
セッサとして動作する。図１では、代表プロセッサに符
号１１を付与し、その他のプロセッサに符号１２−１〜
１２−ｎを付与して区別している。また、共有メモリ１
３内には、そのクラスタ内の全プロセッサから参照更新
可能な１個以上のクラスタ内通信レジスタ１４が設けら
れている。Each cluster 1 includes a plurality of processors 11,
12-1 to 12-n, a shared memory 13 shared by these, and data transfer control means 15 for controlling data transfer with another cluster through the interconnection network 2. Here, during the parallel processing of the intra-procedure loop, one of the plurality of processors in each cluster 1 operates as a representative processor. In FIG. 1, reference numeral 11 is assigned to the representative processor, and reference numerals 12-1 to 12-1 are assigned to the other processors.
12-n is provided for distinction. Also, the shared memory 1
In 3, one or more intra-cluster communication registers 14 that can be referenced and updated by all processors in the cluster are provided.

【００１６】クラスタ間通信レジスタ３は、クラスタ１
間でアクセス頻度の高いデータの受け渡しや同期に用い
られ、クラスタ１間のその他のデータの受け渡しは共有
メモリ１３，データ転送制御手段１５および相互結合網
２を介して行われる。クラスタ内通信レジスタ１４は、
各クラスタ１内のプロセッサ間でアクセス頻度の高いデ
ータの受け渡しや同期に用いられ、プロセッサ間のその
他のデータの受け渡しは共有メモリ１３を介して行われ
る。The inter-cluster communication register 3 stores the cluster 1
It is used for the transfer and synchronization of frequently accessed data between the clusters. The transfer of other data between the clusters 1 is performed via the shared memory 13, the data transfer control means 15, and the interconnection network 2. The intra-cluster communication register 14
It is used for transferring and synchronizing frequently accessed data between processors in each cluster 1, and other data is transferred between processors via the shared memory 13.

【００１７】図２に並列処理の対象となる手続内ループ
の一般形式とその並列処理時の分割方法とを示す。同図
において、Ｊは外側ループのループ変数（イテレーショ
ン）であり、ここでは始値が１，終値がＭ，増分値が１
（図では省略されている）になっている。また、Ｉは内
側ループのループ変数であり、ここでは始値が１，終値
がＮ，増分値が１（図では省略されている）になってい
る。Ｔｉｊは２重ＤＯループの範囲内に含まれる実行文
を一般化したものである。本発明では、外側ループはク
ラスタ間で並列処理し、内側ループはクラスタ内で並列
処理する。即ち、Ｔ_1,1、Ｔ_2,1、Ｔ_3,1、…、
Ｔ_N,1、Ｔ_1,2、Ｔ_2,2、Ｔ_3,2、…、Ｔ_N,2、……、
Ｔ_1,M、Ｔ_2,M、Ｔ_3,M、…、Ｔ_N,Mの各々が１つのタ
スクとして実行され、外側ループのループ変数Ｊが同じ
タスク（例えばＴ_1,1、Ｔ_2,1、Ｔ_3,1、…、Ｔ_N,1の
組など）は同一のクラスタ内で処理される。FIG. 2 shows a general form of an intra-procedure loop to be subjected to parallel processing and a dividing method for the parallel processing. In the figure, J is a loop variable (iteration) of the outer loop, where the open value is 1, the end value is M, and the increment value is 1
(Omitted in the figure). I is a loop variable of the inner loop. Here, the start value is 1, the end value is N, and the increment value is 1 (omitted in the figure). Tij is a generalization of an executable statement included in the range of a double DO loop. In the present invention, the outer loop is processed in parallel between the clusters, and the inner loop is processed in parallel in the cluster. That is, T _1,1 , T _2,1 , T _3,1,.
T _{N, 1} , T _1,2 , T _2,2 , T _3,2 , ..., T _{N, 2} , ...,
Each of T _{1, M} , T _{2, M} , T _{3, M} ,..., T _{N, M} is executed as one task, and the loop variable J of the outer loop has the same task (for example, T _1,1 , T _{2, 1,} T _3,1, ..., etc. T _{N, 1} set) is processed in the same cluster.

【００１８】図２に示したような内側ループと外側ルー
プとから構成される手続内ループの処理を複数のタスク
に分割し、図１の階層型マルチプロセッサを構成する個
々のプロセッサで並列処理する場合の動作環境の例を図
３に示す。The processing of the intra-procedure loop composed of the inner loop and the outer loop as shown in FIG. 2 is divided into a plurality of tasks, and the individual processors constituting the hierarchical multiprocessor of FIG. 1 perform parallel processing. FIG. 3 shows an example of the operating environment in this case.

【００１９】図３において、図１と同一符号は同一部分
を示し、ＧＣＲ０，ＧＣＲ１はクラスタ間通信レジス
タ、ＬＣＲ０，ＬＣＲ１はクラスタ内通信レジスタ、Ｉ
Ｎ，ｆｉｎｉｓｈは共有メモリ１３上に設定されたクラ
スタ内共有領域である。In FIG. 3, the same reference numerals as those in FIG. 1 denote the same parts, GCR0 and GCR1 indicate inter-cluster communication registers, LCR0 and LCR1 indicate intra-cluster communication registers,
N and finish are shared areas in the cluster set on the shared memory 13.

【００２０】クラスタ間通信レジスタＧＣＲ０とクラス
タ内通信レジスタＬＣＲ０とは、何れもループ変数を保
持するレジスタである。このうちクラスタ間通信レジス
タＧＣＲ０は手続内ループにおける外側ループのループ
変数Ｊを保持し、クラスタ内通信レジスタＬＣＲ０は内
側ループのループ変数Ｉを保持する。クラスタ間通信レ
ジスタＧＣＲ０には、外側ループのループ変数Ｊの始値
１が初期設定され、以後、各クラスタ１の代表プロセッ
サ１１によって排他的に参照更新される。また、クラス
タ内通信レジスタＬＣＲ０には、各クラスタ１の代表プ
ロセッサ１１によって内側ループのループ変数Ｉの始値
１が初期設定され、以後、同クラスタ内の全プロセッサ
１１，１２−１〜１２−ｎによって排他的に参照更新さ
れる。The inter-cluster communication register GCR0 and the intra-cluster communication register LCR0 are registers for holding loop variables. Among these, the inter-cluster communication register GCR0 holds the loop variable J of the outer loop in the intra-procedure loop, and the intra-cluster communication register LCR0 holds the loop variable I of the inner loop. The initial value 1 of the loop variable J of the outer loop is initially set in the inter-cluster communication register GCR 0, and thereafter, is exclusively updated by the representative processor 11 of each cluster 1. Further, the initial value 1 of the loop variable I of the inner loop is initially set in the intra-cluster communication register LCR0 by the representative processor 11 of each cluster 1, and thereafter, all the processors 11, 12-1 to 12-n in the same cluster are set. Is updated exclusively by reference.

【００２１】クラスタ間通信レジスタＧＣＲ１とクラス
タ内通信レジスタＬＣＲ１とは、何れも同期用の変数を
保持するレジスタである。このうちクラスタ間通信レジ
スタＧＣＲ１はクラスタ間で一斉に同期（クラスタ間バ
リア同期）をとるための変数を保持し、クラスタ内通信
レジスタＬＣＲ１はクラスタ内の全プロセッサ間で一斉
に同期（クラスタ内バリア同期）をとるための変数を保
持する。クラスタ間通信レジスタＧＣＲ１には、値０が
初期設定され、クラスタ間で同期をとる際には各クラス
タ１の代表プロセッサ１１によって排他的に参照更新さ
れる。また、クラスタ内通信レジスタＬＣＲ１には、値
０が初期設定され、クラスタ内の全プロセッサ間で同期
をとる際にはそのクラスタ内の全プロセッサによって排
他的に参照更新される。The inter-cluster communication register GCR1 and the intra-cluster communication register LCR1 are registers for holding variables for synchronization. Among them, the inter-cluster communication register GCR1 holds a variable for simultaneously synchronizing between the clusters (inter-cluster barrier synchronization), and the intra-cluster communication register LCR1 is simultaneously synchronized between all the processors in the cluster (intra-cluster barrier synchronization). ) Is held. The value 0 is initially set in the inter-cluster communication register GCR1. When synchronizing between clusters, the representative processor 11 of each cluster 1 exclusively references and updates the value. A value 0 is initially set in the intra-cluster communication register LCR1, and when synchronization is established between all the processors in the cluster, the reference is exclusively updated by all the processors in the cluster.

【００２２】クラスタ内共有領域ＩＮは、各クラスタ１
の代表プロセッサ１１がプロセッサ間通信レジスタＧＣ
Ｒ０から取得した外側ループのループ変数の値を保持す
る領域である。このクラスタ内共有領域ＩＮは、代表プ
ロセッサ１１によって参照更新され、その他のプロセッ
サ１２−１〜１２−ｎによって参照される。The intra-cluster shared area IN is stored in each cluster 1
Representative processor 11 is an inter-processor communication register GC
This area holds the value of the loop variable of the outer loop acquired from R0. The intra-cluster shared area IN is updated by the representative processor 11 and is referred to by the other processors 12-1 to 12-n.

【００２３】クラスタ内共有領域ｆｉｎｉｓｈは、各ク
ラスタ１における手続内ループの並列処理の動作を終了
させるか否かを示す値を保持する領域である。このクラ
スタ内共有領域ｆｉｎｉｓｈには、初期値として０が設
定され、各クラスタ１の代表プロセッサ１１がプロセッ
サ間通信レジスタＧＣＲ０から取得した外側ループのル
ープ変数の値がその終値Ｍを超えていた場合に値１が設
定される。各クラスタ１内の各プロセッサはクラスタ内
共有領域ｆｉｎｉｓｈの値が１の場合には処理を終了す
る。The intra-cluster shared area “finish” is an area for holding a value indicating whether or not to end the parallel processing operation of the intra-procedure loop in each cluster 1. 0 is set as an initial value in the intra-cluster shared area finish, and when the value of the loop variable of the outer loop obtained by the representative processor 11 of each cluster 1 from the inter-processor communication register GCR0 exceeds its final value M. The value 1 is set. If the value of the intra-cluster shared area finish is 1, each processor in each cluster 1 ends the processing.

【００２４】図４，図５および図６は手続内ループの並
列処理を行う際に階層型マルチプロセッサを構成する各
プロセッサが実行する処理の一例を示すフローチャート
であり、そのうち図４は、自プロセッサで実行すべきタ
スク、つまり手続内ループのうち自プロセッサで実行す
べき処理部分を選択する、いわゆるタスクスケジューリ
ングにかかる処理のフローを、図５はクラスタ内バリア
同期のフローを、図６はクラスタ間バリア同期のフロー
を、それぞれ示す。なお、これらのプログラムは手続内
ループを含むソースプログラムのコンパイル時に生成さ
れたものであり、手続内ループの内側ループ，外側ルー
プの始値，終値，増分値やクラスタ数，クラスタ内プロ
セッサ数等が埋め込まれている。FIGS. 4, 5 and 6 are flowcharts showing an example of the processing executed by each processor constituting the hierarchical multiprocessor when performing the parallel processing of the intra-procedure loop. FIG. 5 shows a flow of a process related to so-called task scheduling for selecting a task to be executed by the processor, that is, a processing portion to be executed by its own processor in a loop within a procedure, FIG. 5 shows a flow of barrier synchronization within a cluster, and FIG. The flow of barrier synchronization will be described below. Note that these programs are generated at the time of compiling the source program including the intra-procedure loop, and the opening value, the closing value, the increment value, the number of clusters, the number of processors in the cluster, and the like of the inner loop and the outer loop of the intra-procedure loop. Embedded.

【００２５】以下、各図を参照して本実施例の動作を説
明する。The operation of this embodiment will be described below with reference to the drawings.

【００２６】図２に示したような手続内ループを図１の
階層型マルチプロセッサで並列処理する場合、図３に示
したような実行環境が構築され、クラスタ間通信レジス
タＧＣＲ０に外側ループのループ変数Ｊの始値１が、同
期用のクラスタ内通信レジスタＬＣＲ１およびクラスタ
間通信レジスタＧＣＲ１にそれぞれ１が、クラスタ内共
有領域ｆｉｎｉｓｈに１が、それぞれ設定される（図４
のＳ１）。そして、図４のステップＳ２以降の処理を実
行するプログラム，図５，図６のプログラムおよび手続
内ループの実行に必要なデータが各クラスタ１内の共有
メモリ１３にローディングされ、各クラスタ１の各プロ
セッサが起動される。When the intra-procedure loop as shown in FIG. 2 is processed in parallel by the hierarchical multiprocessor of FIG. 1, an execution environment as shown in FIG. 3 is constructed, and an outer loop loop is provided in the inter-cluster communication register GCR0. The initial value 1 of the variable J is set to 1 in the intra-cluster communication register LCR1 and the inter-cluster communication register GCR1 for synchronization, and 1 is set to the intra-cluster shared area finish (FIG. 4).
S1). Then, the program for executing the processing after step S2 in FIG. 4, the program in FIGS. 5 and 6, and the data necessary for executing the in-procedure loop are loaded into the shared memory 13 in each cluster 1, and each of the clusters 1 The processor is started.

【００２７】各プロセッサは起動されると、図４のステ
ップＳ２の処理から実行を開始し、先ず自プロセッサが
代表プロセッサか否かを判別する。図３の場合、代表プ
ロセッサは各クラスタ１内のプロセッサ１１に定められ
ており、その他のプロセッサ１２−１〜１２−ｎは代表
プロセッサではない。従って、プロセッサ１１は引き続
いてステップＳ３以降の処理を実行し、残りのプロセッ
サ１２−１〜１２−ｎはステップＳ３〜Ｓ８をスキップ
して、ステップＳ９へ進む。When each processor is started, it starts executing from the processing in step S2 in FIG. 4, and first determines whether or not its own processor is a representative processor. In the case of FIG. 3, the representative processor is defined as the processor 11 in each cluster 1, and the other processors 12-1 to 12-n are not representative processors. Accordingly, the processor 11 subsequently executes the processing after step S3, and the remaining processors 12-1 to 12-n skip steps S3 to S8 and proceed to step S9.

【００２８】ステップＳ３に進んだ各クラスタ１の代表
プロセッサ１１は、Ｆｅｔｃｈ＆Ａｄｄ命令によってク
ラスタ間通信レジスタＧＣＲ０を排他的に参照更新す
る。ここで、Ｆｅｔｃｈ＆Ａｄｄ命令は以下で示される
操作である。At step S3, the representative processor 11 of each cluster 1 exclusively refers to and updates the inter-cluster communication register GCR0 according to the Fetch & Add instruction. Here, the Fetch & Add instruction is an operation described below.

【００２９】Ｆ＆Ａ（ＣＲ，ｉｎｃ）；ｔｅｍｐ ← ＣＲ；ＣＲ ← ＣＲ＋ｉｎｃ；Ｒｅｔｕｒｎｔｅｍｐ；但し、ＣＲはアクセスする通信レジスタ、ｉｎｃはその
インクリメント値を示す。つまり、Ｆｅｔｃｈ＆Ａｄｄ
命令は、通信レジスタの値を返却すると同時に、指定さ
れた値（ｉｎｃ）で通信レジスタの値をインクリメント
する。F & A (CR, inc); temp ← CR; CR ← CR + inc; Return temp; where CR indicates a communication register to be accessed, and inc indicates an increment value thereof. In other words, Fetch & Add
The instruction returns the value of the communication register and, at the same time, increments the value of the communication register by the specified value (inc).

【００３０】今の場合、ステップＳ３のＦｅｔｃｈ＆Ａ
ｄｄ命令における通信レジスタはＧＣＲ０、ｉｎｃは外
側ループのループ変数の増分値である１に設定されてい
る。そして、クラスタ間通信レジスタＧＣＲ０の初期値
は外側ループのループ変数Ｊの始値１であり、Ｆｅｔｃ
ｈ＆Ａｄｄ命令は排他的に実行されるため、複数のクラ
スタ１の代表プロセッサ１１のうち最初にステップＳ３
を実行した代表プロセッサ１１は、返却値として１を取
得し、その時点でＧＣＲ０の値は２となる。従って、次
にステップＳ３を実行した代表プロセッサ１１は返却値
として２を取得し、その時点でＧＣＲ０の値は３とな
る。こうして、各代表プロセッサ１１は自クラスタで処
理すべき外側ループのループ変数Ｊの値を取得する。In this case, Fetch & A in step S3
The communication register in the dd instruction is set to GCR0, and inc is set to 1 which is the increment value of the loop variable of the outer loop. The initial value of the inter-cluster communication register GCR0 is the initial value 1 of the loop variable J of the outer loop,
Since the h & Add instruction is executed exclusively, step S3 is executed first among the representative processors 11 of the plurality of clusters 1.
The representative processor 11 that has executed (1) acquires 1 as a return value, and the value of GCR0 becomes 2 at that time. Therefore, the representative processor 11 that has executed step S3 next obtains 2 as the return value, and the value of GCR0 becomes 3 at that time. Thus, each representative processor 11 acquires the value of the loop variable J of the outer loop to be processed in its own cluster.

【００３１】外側ループのループ変数の値を取得した代
表プロセッサ１１は、次に、その値が外側ループの終値
Ｍより大きくないことを確認し（Ｓ４でＮＯ）、今回取
得した外側ループの値をクラスタ内共有領域ＩＮに設定
し（Ｓ５）、クラスタ内通信レジスタＬＣＲ０に内側ル
ープのループ変数Ｉの始値である１を設定し（Ｓ６）、
ステップＳ９に進む。また、取得した外側ループのルー
プ変数Ｊの値が終値Ｍより大きいときは（Ｓ４でＹＥ
Ｓ）、自クラスタが実行すべき外側ループが存在しない
ので、クラスタ内共有領域ｆｉｎｉｓｈに１を設定し
（Ｓ７）、クラスタ間バリア同期をとって（Ｓ８）、ス
テップＳ９に進む。このクラスタ間バリア同期について
は後述する。The representative processor 11 having acquired the value of the loop variable of the outer loop next confirms that the value is not larger than the final value M of the outer loop (NO in S4), and changes the value of the outer loop acquired this time. The intra-cluster shared area IN is set (S5), and the start value 1 of the loop variable I of the inner loop is set in the intra-cluster communication register LCR0 (S6).
Proceed to step S9. If the acquired value of the loop variable J of the outer loop is larger than the closing value M (YE in S4)
S) Since there is no outer loop to be executed by the own cluster, 1 is set to the intra-cluster shared area finish (S7), barrier synchronization between clusters is obtained (S8), and the process proceeds to step S9. This inter-cluster barrier synchronization will be described later.

【００３２】さて、ステップＳ９のクラスタ内バリア同
期では、各プロセッサは図５に示す処理を実行する。先
ず、クラスタ内バリア同期用のクラスタ内通信レジスタ
ＬＣＲ１のロックを取得し（Ｓ２１）、その値を参照し
て０か否かを調べる（Ｓ２２）。クラスタ内通信レジス
タＬＣＲ１の値が０の場合には、このレジスタＬＣＲ１
に自クラスタ内のプロセッサ数を設定し（Ｓ２３）、こ
の設定した値から１を減算し（Ｓ２４）、クラスタ内通
信レジスタＬＣＲ１をアンロックする（Ｓ２５）。他
方、クラスタ内通信レジスタＬＣＲ１の値が０でなけれ
ば（ステップＳ２２でＮＯ）、その値を１減算し（Ｓ２
４）、クラスタ内通信レジスタＬＣＲ１をアンロックす
る（Ｓ２５）。そして、プロセッサはクラスタ内通信レ
ジスタＬＣＲ１の値が０になるのを待ち（Ｓ２６）、０
になった時点でクラスタ内バリア同期処理を終了し、同
期後の処理へと進む。Now, in the intra-cluster barrier synchronization of step S9, each processor executes the processing shown in FIG. First, the lock of the intra-cluster communication register LCR1 for intra-cluster barrier synchronization is acquired (S21), and the value is checked to see if it is 0 (S22). If the value of the intra-cluster communication register LCR1 is 0, this register LCR1
Is set to the number of processors in the own cluster (S23), 1 is subtracted from the set value (S24), and the intra-cluster communication register LCR1 is unlocked (S25). On the other hand, if the value of the intra-cluster communication register LCR1 is not 0 (NO in step S22), the value is subtracted by 1 (S2
4), the intra-cluster communication register LCR1 is unlocked (S25). Then, the processor waits until the value of the intra-cluster communication register LCR1 becomes 0 (S26).
At this point, the intra-cluster barrier synchronization process ends, and the process proceeds to the post-synchronization process.

【００３３】クラスタ内通信レジスタＬＣＲ１は図４の
ステップＳ１で０に初期化されており、図５のステップ
Ｓ２２〜Ｓ２４はクラスタ内の全プロセッサで排他的に
行われるため、先ず最初にロックを取得したプロセッサ
がＬＣＲ１の値が０であることを認識し、クラスタ内プ
ロセッサ数を設定した後に１減算してステップＳ２６に
進む。次にロックを取得したプロセッサは、ＬＣＲ１の
値が０でないので、１減算し、ステップＳ２６に進む。
こうして、そのクラスタの最後のプロセッサがロックを
取得してＬＣＲ１の値を１減算すると、ＬＣＲ１の値が
０となり、そのクラスタの全プロセッサがステップＳ２
６から一斉に抜け出ることにより、クラスタ内バリア同
期が終了する。The intra-cluster communication register LCR1 has been initialized to 0 in step S1 in FIG. 4, and steps S22 to S24 in FIG. 5 are exclusively performed by all processors in the cluster. The recognized processor recognizes that the value of LCR1 is 0, and after setting the number of processors in the cluster, subtracts 1 and proceeds to step S26. Next, the processor that has acquired the lock subtracts 1 since the value of LCR1 is not 0, and proceeds to step S26.
Thus, when the last processor of the cluster acquires the lock and subtracts 1 from the value of LCR1, the value of LCR1 becomes 0, and all processors in the cluster execute step S2.
By exiting all at once, the intra-cluster barrier synchronization ends.

【００３４】なお、ステップＳ２６のプロセッサの待ち
はスピンロック、すなわち条件が成立するまで繰り返し
テストする方法を採っているが、サスペンドロック、す
なわち条件が成立するまで実行を中断する方法でも実現
できる。The waiting of the processor in step S26 employs a spin lock, that is, a method of repeatedly testing until a condition is satisfied. However, a suspend lock, that is, a method of interrupting execution until a condition is satisfied, can also be realized.

【００３５】図４のステップＳ９のクラスタ内バリア同
期を終えた各プロセッサは、クラスタ内共有変数ｆｉｎ
ｉｓｈの値を参照し（Ｓ１０）、１であれば処理を終了
するが、１でなければステップＳ１１以降の処理へ進
む。Each processor that has completed the intra-cluster barrier synchronization in step S9 of FIG.
With reference to the value of ish (S10), if it is 1, the process is terminated. If it is not 1, the process proceeds to step S11 and subsequent processes.

【００３６】ステップＳ１１に進んだ各クラスタ１の各
プロセッサは、Ｆｅｔｃｈ＆Ａｄｄ命令によってクラス
タ内通信レジスタＬＣＲ０を排他的に参照更新する。つ
まり、クラスタ内通信レジスタＬＣＲ０の現在の値を取
得すると共に元の値を内側ループのループ変数Ｉの増分
値１だけ加算する。クラスタ内通信レジスタＬＣＲ０は
ステップＳ６によって内側ループのループ変数Ｉの始値
１に初期設定されており、ステップＳ１１のＦｅｔｃｈ
＆Ａｄｄ命令はそのクラスタ内の全プロセッサで排他的
に実行されるため、まず最初にステップＳ１１を実行し
たプロセッサはクラスタ内通信レジスタＬＣＲ０から１
を取得し、その時点でＬＣＲ０の値は２となる。従っ
て、次にステップＳ１１を実行したプロセッサはＬＣＲ
０から２を取得し、その時点でＬＣＲ０の値は３にな
る。こうして、各プロセッサは自プロセッサで処理すべ
き内側ループのループ変数Ｉの値を取得する。Each processor of each cluster 1 that has proceeded to step S11 exclusively refers to and updates the intra-cluster communication register LCR0 by the Fetch & Add instruction. That is, the current value of the intra-cluster communication register LCR0 is obtained, and the original value is added by the increment value 1 of the loop variable I of the inner loop. The intra-cluster communication register LCR0 is initially set to the start value 1 of the loop variable I of the inner loop in step S6, and the Fetch of step S11 is performed.
Since the & Add instruction is executed exclusively by all the processors in the cluster, the processor that first executes step S11 transmits the & 1 instruction to the intra-cluster communication registers LCR0 to LCR1.
And the value of LCR0 is 2 at that time. Therefore, the processor that has executed the next step S11 is the LCR
It obtains 2 from 0, at which point the value of LCR0 becomes 3. Thus, each processor acquires the value of the loop variable I of the inner loop to be processed by its own processor.

【００３７】内側ループのループ変数の値を取得した各
プロセッサは、次に、その値が内側ループの終値Ｎより
大きくないことを確認し（Ｓ１２でＮＯ）、ステップＳ
１３へ進んでクラスタ内共有領域ＩＮの値を得る。次
に、このクラスタ内共有領域ＩＮの値を外側ループのル
ープ変数Ｊの値ｊ，ステップＳ１１で取得したクラスタ
内通信レジスタＬＣＲ０の値を内側ループのループ変数
Ｉの値ｉとして、手続内ループにかかるＴ_i,jの部分を
１つのタスクとして実行する（Ｓ１４）。そして、その
実行終了後にステップＳ１１に戻って上述した処理を繰
り返す。また、ステップＳ１２でクラスタ内通信レジス
タＬＣＲ０から取得した値が内側ループのループ変数Ｉ
の終値Ｎを超えていた場合は、自プロセッサが実行すべ
き内側ループが存在しないので、ステップＳ１５に進ん
で図５に示したクラスタ内バリア同期をとり、ステップ
Ｓ２に戻って、上述した処理を繰り返す。Each processor that has obtained the value of the loop variable of the inner loop next confirms that the value is not larger than the final value N of the inner loop (NO in S12), and proceeds to step S12.
The process proceeds to step S13 to obtain the value of the intra-cluster shared area IN. Next, the value of the intra-cluster shared area IN is set as the value j of the loop variable J of the outer loop, and the value of the communication register LCR0 in the cluster obtained in step S11 is set as the value i of the loop variable I of the inner loop. The part of T _{i, j} is executed as one task (S14). Then, after the execution is completed, the process returns to step S11 to repeat the above-described processing. The value obtained from the intra-cluster communication register LCR0 in step S12 is the loop variable I of the inner loop.
If there is no inner loop to be executed by the own processor, the process proceeds to step S15 to perform barrier synchronization within the cluster shown in FIG. 5, and returns to step S2 to execute the above-described processing. repeat.

【００３８】以上のような処理の結果、例えば或るクラ
スタ１の代表プロセッサ１１がステップＳ３でクラスタ
間通信レジスタＧＣＲ０から外側ループのループ変数Ｊ
の始値１を取得し、そのクラスタ内の複数のプロセッサ
１１，１２−１〜１２−ｎがステップＳ１１でクラスタ
内通信レジスタＬＣＲ０から内側ループのループ変数Ｉ
の値として１，２，…を取得したとすると、各プロセッ
サ１１，１２−１〜１２−ｎはステップＳ１４におい
て、Ｔ_1,1、Ｔ_2,1、…といったタスクの処理を並列し
て実行することになる。同様に、クラスタ間通信レジス
タＧＣＲ０から外側ループのループ変数Ｊの値としてｘ
（＝２，…，Ｍ）をとった別のクラスタでは、その複数
のプロセッサ１１，１２−１〜１２−ｎがステップＳ１
４において、Ｔ_1,X、Ｔ_2,X、…といったタスクの処理
を並列して実行することになる。As a result of the processing described above, for example, the representative processor 11 of a certain cluster 1 reads the loop variable J of the outer loop from the inter-cluster communication register GCR0 in step S3.
Is obtained, and the processors 11, 12-1 to 12-n in the cluster obtain the loop variable I of the inner loop from the intra-cluster communication register LCR0 in step S11.
.. Are acquired as the values of, the respective processors 11, 12-1 to 12-n execute the processing of tasks such as T _1,1 , T _2,1 ,. Will do. Similarly, as the value of the loop variable J of the outer loop from the inter-cluster communication register GCR0, x
(= 2,..., M), the plurality of processors 11, 12-1 to 12-n execute step S1.
In 4, the tasks such as T _{1, X} , T _{2, X} ,... Are executed in parallel.

【００３９】そして、今回取得した外側ループのループ
変数Ｊの値の範囲内における内側ループの繰り返し処理
を終了すると、そのクラスタ内の全プロセッサはステッ
プＳ１５で図５に示したクラスタ内バリア同期をとって
ステップＳ２に進み、その代表プロセッサ１１が次に処
理すべき外側ループのループ変数Ｊの値を取得し、上述
した処理を繰り返す。When the repetition processing of the inner loop within the range of the value of the loop variable J of the outer loop obtained this time is completed, all the processors in the cluster take the intra-cluster barrier synchronization shown in FIG. 5 in step S15. In step S2, the representative processor 11 acquires the value of the loop variable J of the outer loop to be processed next, and repeats the above-described processing.

【００４０】そして、ステップＳ３で取得したクラスタ
間通信レジスタＧＣＲ０の値が外側ループのループ変数
Ｊの終値Ｍを超えていた場合には、そのクラスタの代表
プロセッサ１１は、クラスタ内共有領域ｆｉｎｉｓｈに
１を設定し（Ｓ７）、クラスタ間バリア同期をとる（Ｓ
８）。If the value of the inter-cluster communication register GCR0 obtained in step S3 exceeds the final value M of the loop variable J of the outer loop, the representative processor 11 of that cluster stores 1 in the intra-cluster shared area finish. Is set (S7), and barrier synchronization between clusters is established (S7).
8).

【００４１】ステップＳ８のクラスタ間バリア同期で
は、各代表プロセッサ１１は図６に示す処理を実行す
る。先ず、クラスタ間バリア同期用のクラスタ内通信レ
ジスタＧＣＲ１のロックを取得し（Ｓ３１）、その値を
参照して０か否かを調べる（Ｓ３２）。クラスタ間通信
レジスタＧＣＲ１の値が０の場合には、このレジスタＧ
ＣＲ１に階層型マルチプロセッサを構成するクラスタ数
を設定し（Ｓ３３）、この設定した値から１を減算し
（Ｓ３４）、クラスタ間通信レジスタＧＣＲ１をアンロ
ックする（Ｓ３５）。他方、クラスタ間通信レジスタＧ
ＣＲ１の値が０でなければ（ステップＳ３２でＮＯ）、
その値を１減算し（Ｓ３４）、クラスタ間通信レジスタ
ＧＣＲ１をアンロックする（Ｓ３５）。そして、各代表
プロセッサはクラスタ間通信レジスタＧＣＲ１の値が０
になるのを待ち（Ｓ３６）、０になった時点でクラスタ
間バリア同期処理を終了し、同期確立後の処理へと進
む。In the inter-cluster barrier synchronization in step S8, each representative processor 11 executes the processing shown in FIG. First, the lock of the intra-cluster communication register GCR1 for inter-cluster barrier synchronization is acquired (S31), and the value is checked to see if it is 0 (S32). If the value of the inter-cluster communication register GCR1 is 0,
The number of clusters constituting the hierarchical multiprocessor is set in CR1 (S33), 1 is subtracted from the set value (S34), and the inter-cluster communication register GCR1 is unlocked (S35). On the other hand, the inter-cluster communication register G
If the value of CR1 is not 0 (NO in step S32),
The value is subtracted by 1 (S34), and the inter-cluster communication register GCR1 is unlocked (S35). Then, each representative processor sets the value of the inter-cluster communication register GCR1 to 0.
(S36), and when it becomes 0, the inter-cluster barrier synchronization process ends, and the process proceeds to the process after synchronization is established.

【００４２】クラスタ間通信レジスタＧＣＲ１は図４の
ステップＳ１で０に初期化されており、図６のステップ
Ｓ３２〜Ｓ３４はクラスタ間で排他的に行われるため、
先ず最初にロックを取得したクラスタの代表プロセッサ
がＧＣＲ１の値が０であることを認識し、クラスタ数を
設定した後に１減算してステップＳ３６に進む。次にロ
ックを取得した代表プロセッサは、ＧＣＲ１の値が０で
ないので、１減算し、ステップＳ３６に進む。こうし
て、最後のクラスタの代表プロセッサがロックを取得し
てＧＣＲ１の値を１減算すると、ＧＣＲ１の値が０とな
り、全クラスタの代表プロセッサがステップＳ３６から
一斉に抜け出ることにより、クラスタ間バリア同期が終
了する。The inter-cluster communication register GCR1 is initialized to 0 in step S1 in FIG. 4, and steps S32 to S34 in FIG. 6 are exclusively performed between clusters.
First, the representative processor of the cluster that has first acquired the lock recognizes that the value of GCR1 is 0, and after setting the number of clusters, subtracts 1 and proceeds to step S36. Next, since the value of GCR1 is not 0, the representative processor that has acquired the lock subtracts 1 and proceeds to step S36. In this way, when the representative processor of the last cluster acquires the lock and decrements the value of GCR1 by 1, the value of GCR1 becomes 0, and the representative processors of all clusters exit from step S36 all at once, and the inter-cluster barrier synchronization ends. I do.

【００４３】なお、ステップＳ３６のプロセッサの待ち
はスピンロック、すなわち条件が成立するまで繰り返し
テストする方法を採っているが、サスペンドロック、す
なわち条件が成立するまで実行を中断する方法でも実現
できる。The waiting of the processor in step S36 employs a spin lock, that is, a method of repeatedly testing until a condition is satisfied. However, a suspend lock, that is, a method of suspending execution until a condition is satisfied, can also be realized.

【００４４】この後、前述したように各クラスタ１内の
全プロセッサがクラスタ内バリア同期をとり（Ｓ９）、
クラスタ共有領域ｆｉｎｉｓｈの値が１であることを認
識して、処理を終了する。Thereafter, as described above, all the processors in each cluster 1 synchronize with the intra-cluster barrier (S9).
Upon recognizing that the value of the cluster shared area finish is 1, the processing ends.

【００４５】このように本実施例によれば、各プロセッ
サが受け持つ繰り返し部分の決定（タスクの選択）およ
び同期をクラスタ間通信レジスタとクラスタ内通信レジ
スタとを用い、２段階で行うので、通信レジスタへのア
クセス競合が軽減され、効率良く並列処理を行うことが
できる。As described above, according to the present embodiment, the determination of the repetition part (selection of a task) and synchronization performed by each processor are performed in two stages using the inter-cluster communication register and the intra-cluster communication register. Contention for access to is reduced, and parallel processing can be performed efficiently.

【００４６】[0046]

【発明の効果】以上説明したように本発明によれば以下
のような効果を得ることができる。As described above, according to the present invention, the following effects can be obtained.

【００４７】二重ループを一重ループ化することなく並
列処理するため、一重ループ化できない二重ループも処
理できて汎用性があり、また、一重ループ化に伴うオー
バヘッドがない。Since the double loop is processed in parallel without being formed into a single loop, a double loop that cannot be formed into a single loop can be processed, so that the present invention is versatile, and there is no overhead associated with the formation of a single loop.

【００４８】全クラスタが一つのクラスタ間通信レジス
タを参照更新して動的に処理を引き受け、また各クラス
タ内では全プロセッサが一つのクラスタ内通信レジスタ
を参照更新して動的に処理を引き受けるため、何れかの
クラスタや、クラスタ内の何れかのプロセッサの処理が
遅延しても全体に与える影響が少ない。Since all clusters dynamically update and refer to one inter-cluster communication register, and within each cluster, all processors dynamically update and refer to one intra-cluster communication register. Even if the processing of any one of the clusters or any of the processors in the cluster is delayed, the effect on the whole is small.

【００４９】クラスタ間通信レジスタは各クラスタ内の
特定の一つのプロセッサのみが参照更新し、各クラスタ
内のクラスタ内通信レジスタはそのクラスタ内のプロセ
ッサのみが参照更新するため、１つのクラスタ間通信レ
ジスタおよび一つのクラスタ内通信レジスタをアクセス
するプロセッサ数が減り、アクセス競合を緩和できる。Since only one specific processor in each cluster references and updates the inter-cluster communication register, and only the processor in that cluster references and updates the intra-cluster communication register in each cluster, one inter-cluster communication register is used. In addition, the number of processors accessing one communication register in a cluster is reduced, and access competition can be reduced.

【００５０】以上のことから、内側ループと外側ループ
とから構成される手続内ループを、階層型マルチプロセ
ッサで効率良く並列処理することが可能となる。As described above, the intra-procedure loop including the inner loop and the outer loop can be efficiently processed in parallel by the hierarchical multiprocessor.

[Brief description of the drawings]

【図１】本発明を適用した階層型マルチプロセッサの構
成例を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration example of a hierarchical multiprocessor to which the present invention has been applied.

【図２】並列処理の対象となる手続内ループの一般形式
とその並列処理時の分割方法とを示す図である。FIG. 2 is a diagram showing a general format of a loop in a procedure to be subjected to parallel processing and a dividing method at the time of the parallel processing;

【図３】内側ループと外側ループとから構成される手続
内ループの処理を複数のタスクに分割し、階層型マルチ
プロセッサを構成する個々のプロセッサで並列処理する
場合の動作環境の例を示す図である。FIG. 3 is a diagram showing an example of an operating environment in which the processing of an intra-procedure loop composed of an inner loop and an outer loop is divided into a plurality of tasks, and the individual processors constituting the hierarchical multiprocessor perform parallel processing. It is.

【図４】手続内ループの並列処理を行う際に階層型マル
チプロセッサを構成する各プロセッサが実行する処理の
うち、自プロセッサで実行すべき処理部分を選択する処
理の一例を示すフローチャートである。FIG. 4 is a flowchart illustrating an example of a process of selecting a processing portion to be executed by the own processor among processes executed by each processor constituting a hierarchical multiprocessor when performing parallel processing of an intra-procedure loop.

【図５】手続内ループの並列処理を行う際に階層型マル
チプロセッサを構成する各プロセッサが実行する処理の
うち、クラスタ内バリア同期の処理例を示すフローチャ
ートである。FIG. 5 is a flowchart illustrating an example of a process of barrier synchronization within a cluster, among processes executed by each processor configuring a hierarchical multiprocessor when performing parallel processing of an intra-procedure loop.

【図６】手続内ループの並列処理を行う際に階層型マル
チプロセッサを構成する各プロセッサが実行する処理の
うち、クラスタ間バリア同期の処理例を示すフローチャ
ートである。FIG. 6 is a flowchart illustrating an example of a process of barrier synchronization between clusters among processes executed by each processor configuring a hierarchical multiprocessor when performing parallel processing of an intra-procedure loop.

[Explanation of symbols]

１…クラスタ２…相互結合網３，ＧＣＲ０，ＧＣＲ１…クラスタ間通信レジスタ１１…代表となるプロセッサ１２−１〜１２−ｎ…その他のプロセッサ１３…共有メモリ１４，ＬＣＲ０，ＬＣＲ１…クラスタ内通信レジスタ１５…データ転送制御手段ＩＮ，ｆｉｎｉｓｈ…クラスタ内共有領域 DESCRIPTION OF SYMBOLS 1 ... Cluster 2 ... Interconnection network 3, GCR0, GCR1 ... Inter-cluster communication register 11 ... Representative processor 12-1 to 12-n ... Other processor 13 ... Shared memory 14, LCR0, LCR1 ... Intra-cluster communication register 15 ... Data transfer control means IN, finish ... Shared area in cluster

Claims

[Claims]

1. A process in an intra-procedure loop composed of an inner loop and an outer loop is divided into a plurality of tasks, and each task is divided into a plurality of clusters each composed of a plurality of processors by an interconnection network. A parallel processing method executed by individual processors constituting a connected hierarchical multiprocessor, in which an inter-cluster communication register that can be referenced and updated from all clusters is set to a start value of a loop variable of an outer loop, and a A specific one of the plurality of processors exclusively refers to and updates the inter-cluster communication register to obtain a loop variable of an outer loop to be executed in the own cluster, and can be referred to from all processors in the own cluster. Intra-cluster communication registers that can be set in The start value of the loop variable of the inner loop is set to the parameter, and each processor in each cluster exclusively refers to and updates the intra-cluster communication register in the own cluster to update the loop variable of the inner loop to be executed in the own cluster. Acquiring a loop variable of the acquired inner loop and a loop variable of the outer loop set in the intra-cluster shared area, and executing a processing portion of the intra-procedure loop as one task. A parallel processing method in a processor.

2. The process of an intra-procedure loop composed of an inner loop and an outer loop is divided into a plurality of tasks, and each task is divided into a plurality of clusters each composed of a plurality of processors by an interconnection network. (A) setting a start value of a loop variable of an outer loop in an inter-cluster communication register that can be referenced and updated from all clusters; b) a particular one of a plurality of processors in each cluster exclusively obtaining the value of the inter-cluster communication register and increasing the original value by the increment of the loop variable of the outer loop; (C) when the value obtained in step b is not larger than the final value of the loop variable of the outer loop, the specific processor The obtained value is set in the first intra-cluster shared area that can be referred to by all processors in the cluster, and the start value of the loop variable of the inner loop is set in the intra-cluster communication register that can be referenced and updated by all processors in the own cluster. Setting; and (d) if the value obtained in step b is greater than the closing value of the loop variable of the outer loop,
Setting the processing end in the second intra-cluster shared area that the specific processor can refer to from all processors in the own cluster, and performing synchronization between all clusters;
(E) synchronizing all processors for each cluster after the completion of step c or d; and (f) after establishing synchronization in step e, each processor in each cluster is connected to the second processor in its own cluster. Referring to the intra-cluster shared area; and (g) terminating the processing in the intra-procedure loop in each cluster when the second intra-cluster shared area referred to in step f indicates processing end. And (h) when the second intra-cluster shared area referred to in step f does not indicate the end of processing, each processor in each cluster exclusively stores the value of the intra-cluster communication register in its own cluster. Obtaining and increasing the value by the increment value of the loop variable of the inner loop; and (i) making the value obtained in step h higher than the final value of the loop variable of the inner loop. If not, the processor that has obtained the value sets the processing part of the in-procedure loop defined by the obtained loop variable of the inner loop and the loop variable of the outer loop set in the first intra-cluster shared area to 1 Executing as two tasks, and (j) step h.
If the value obtained in step (c) is larger than the final value of the loop variable of the inner loop, a step of synchronizing all the processors for each cluster and returning to step b. .

3. A specific processor of each cluster is used as a representative to synchronize among all clusters using an inter-cluster communication register for synchronous processing that can be referenced and updated from all clusters. 3. The parallel processing method in a hierarchical multiprocessor according to claim 2, wherein all processors are synchronized for each cluster using a communication register in the cluster for synchronous processing that can be referenced and updated.