JPH06149752A

JPH06149752A - Barrier synchronization system for parallel computer

Info

Publication number: JPH06149752A
Application number: JP4302216A
Authority: JP
Inventors: Hideki Saito; 秀樹齋藤
Original assignee: Kubota Corp
Current assignee: Kubota Corp
Priority date: 1992-11-12
Filing date: 1992-11-12
Publication date: 1994-05-31

Abstract

PURPOSE:To improve the throughput of the whole system by performing fast and irreducible necessary coherence operation as a factor of an overhead in spite of essentially the same synchronous timing with a conventional system. CONSTITUTION:Parallel computers are provided with plural cache memories CM, which are accessed by processors P0, P1, P2, and P3 and store parts of the storage contents of a main memory, between the processors and the main memory which is connected with a network, and the barrier synchronization system of the parallel computers is so constituted that the execution of a next series of processes is allowed after each processor recognizes the end of a series of processes executed by plural processors on the basis of the value of a synchronous variable 'count' updated each time each process ends a process. Then the coherence operation for the synchronous variable is performed only when barrier synchronism conditions are met each time the synchronous variable is updated.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、複数のプロセッサとネ
ットワークで接続されたメインメモリとの間に、前記プ
ロセッサからアクセスされ前記メインメモリの記憶内容
の一部を記憶する複数のキャッシュメモリを設けた並列
計算機において、前記複数のプロセッサにより実行され
る一連の処理の終了を、各プロセッサ毎の処理が終了す
る度に更新される同期変数の値に基づいて、各プロセッ
サが認識した後に次の一連の処理の実行を許容するよう
に構成してある並列計算機のバリア同期方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention provides a plurality of cache memories, which are accessed by the processors and store a part of the stored contents of the main memory, between a plurality of processors and a main memory connected by a network. In a parallel computer, the next series of processes is performed after each processor recognizes the end of the series of processes executed by the plurality of processors based on the value of the synchronization variable that is updated each time the processes of each processor are completed. The present invention relates to a barrier synchronization method for a parallel computer configured to allow the execution of the above process.

【０００２】[0002]

【従来の技術】従来より並列計算機において計算処理を
実行する場合、一連の処理を並列化されたループで構成
して、各ループを各プロセッサに割りつけることにより
高速処理を実現しているが、各プロセッサが前記各ルー
プの終了を認識した後に次の一連の処理の実行を許容す
る前記バリア同期方式を以下のように実現していた。即
ち、メインメモリに前記同期変数をマッピングして、各
プロセッサが自己に割り付けられたループの処理が終了
した時にキャッシュメモリを通じて前記同期変数を更新
操作するように構成して、自己のループ処理が完了した
プロセッサは、その同期変数がバリア同期条件の成立を
示す値になるまでの間、即ち、他のプロセッサのループ
処理が終了するまでの間、同期変数がアロケートされた
キャッシュを定期的に参照するスピン・ウェイト状態と
なるように構成していた。2. Description of the Related Art Conventionally, when executing a calculation process in a parallel computer, a series of processes are configured by parallelized loops and each loop is assigned to each processor to realize high-speed processing. The barrier synchronization method that allows the execution of the next series of processing after each processor recognizes the end of each loop has been realized as follows. That is, the synchronization variable is mapped to the main memory, and each processor is configured to update the synchronization variable through the cache memory when the processing of the loop allocated to itself is completed, and the loop processing of itself is completed. The specified processor periodically refers to the cache to which the synchronization variable is allocated until the synchronization variable reaches a value indicating that the barrier synchronization condition is satisfied, that is, until the loop processing of another processor ends. It was configured to be in the spin wait state.

【０００３】[0003]

【発明が解決しようとする課題】しかし、上述した従来
のバリア同期方式では、同期変数の更新の度にキャッシ
ュ間のデータの整合性を保つためのコヒーレンス動作が
必要となり同期操作自身のオーバーヘッドによりシステ
ム全体のスループットが下がるという欠点があった。コ
ヒーレンス動作として一般的に用いられる無効化方式の
プロトコル、即ち、メインメモリがキャッシュメモリに
データの無効を通知し、各キャッシュがプロセッサから
のアクセス時に発生するミスヒット処理としてメインメ
モリからデータをブロック転送する方式を採用するもの
では、任意のプロセッサによる同期変数の更新処理の度
に、他のプロセッサからの互いに競合するアクセスが集
中してネットワークの機能が低下するという欠点もあっ
た。例えば、図４に示すように、４台のプロセッサＰ
０，・・・，Ｐ３で並列計算機を構成したものでは、そ
れぞれに割り付けられたループ処理の終了後になされる
同期変数に対する更新処理（１，２，３，４）のうち、
（２，３，４）については、それぞれ、（１１，１２）
と（１３，１４〜１６）という無効化方式のプロトコル
を用いたコヒーレンス動作を行い、（１７〜２２）のリ
モートアクセスによって通常のキャッシュ・ライン・サ
イズのデータを転送処理することになるが、この時にメ
モリ競合（２３〜２６）が発生することになる。また、
コヒーレンス動作の度にメインメモリからキャッシュメ
モリへ転送されるデータ容量がバリア同期のために本来
的に必要なデータ容量よりも大となりオーバーヘッドの
低減に限界があるという欠点もあった。However, in the above-mentioned conventional barrier synchronization method, a coherence operation for maintaining the consistency of data between caches is required every time the synchronization variable is updated, and the overhead of the synchronization operation itself causes the system to operate. There was a drawback that the overall throughput decreased. Invalidation protocol generally used as coherence operation, that is, main memory notifies cache memory of invalidity of data, and each cache performs block transfer of data from main memory as mishit processing that occurs when processor accesses. However, the method of adopting the method has a drawback that every time a synchronization variable is updated by an arbitrary processor, competing accesses from other processors are concentrated and the network function is deteriorated. For example, as shown in FIG. 4, four processors P
In the parallel computer configured with 0, ..., P3, among the update processing (1, 2, 3, 4) for the synchronization variable performed after the end of the loop processing assigned to each,
For (2,3,4), (11,12) respectively
And (13, 14-16), the coherence operation using the invalidation protocol is performed, and the data of the normal cache line size is transferred by the remote access of (17-22). Sometimes memory conflicts (23-26) will occur. Also,
There is also a drawback in that the data capacity transferred from the main memory to the cache memory each time the coherence operation is performed is larger than the data capacity originally required for the barrier synchronization, and there is a limit in reducing the overhead.

【０００４】本発明の目的は、並列計算機におけるバリ
ア同期方式において、従来方式と本質的に同じ同期タイ
ミングを与えながらも、そのオーバーヘッドの要因とな
るコヒーレンス動作を高速、且つ、必要最小限に行わせ
て、システム全体としてスループットを高めることがで
きる並列計算機のバリア同期方式を提供することにあ
る。An object of the present invention is to allow a barrier synchronization system in a parallel computer to perform a coherence operation which is a factor of its overhead at a high speed and at a necessary minimum while giving essentially the same synchronization timing as the conventional system. It is therefore to provide a barrier synchronization method for a parallel computer capable of increasing the throughput of the entire system.

【０００５】[0005]

【課題を解決するための手段】この目的を達成するため
本発明による並列計算機のバリア同期方式の特徴構成
は、前記同期変数のキャッシングを許容し、前記周期変
数の各更新のうち、バリア周期条件の成立時にのみ前記
同期変数に対するコヒーレンス動作を行うように構成し
てある点にある。上述の構成において、前記コヒーレン
ス動作を、書き込み更新型のコヒーレンス動作としてあ
ることが好ましい。さらには、前記コヒーレンス動作に
より転送されるデータを、バリア同期条件の成立を示す
最小限のデータのみで構成してあることが好ましい。In order to achieve this object, a characteristic configuration of a barrier synchronization system of a parallel computer according to the present invention allows caching of the synchronization variable, and a barrier periodic condition among the updates of the periodic variable. The point is that the coherence operation for the synchronization variable is performed only when the above condition holds. In the above configuration, it is preferable that the coherence operation is a write / update type coherence operation. Further, it is preferable that the data transferred by the coherence operation is composed of only the minimum data indicating the establishment of the barrier synchronization condition.

【０００６】[0006]

【作用】つまり、自己のループ処理が終了したプロセッ
サは、同期変数によりバリア同期条件の成立タイミング
を判別できればよいのであって、バリア同期条件の成立
までの間に次々と変化していく同期変数の値そのものは
本質的に不要であるという点に着目すれば、前記同期変
数の各更新のうち同期条件の成立時にのみ前記同期変数
に対するコヒーレンス動作を行えば、従来方式と本質的
に同じバリア同期タイミングが得られることになり、そ
の間、一時的に複数のキャッシュメモリに存在する同期
変数の不整合を許容してもなんら不都合がない。In other words, the processor that has completed its own loop processing only needs to be able to determine the timing at which the barrier synchronization condition is satisfied from the synchronization variable, and the synchronization variable that changes one after another until the barrier synchronization condition is satisfied. Focusing on the fact that the value itself is essentially unnecessary, if the coherence operation for the synchronization variable is performed only when the synchronization condition is satisfied among the updates of the synchronization variable, the barrier synchronization timing is essentially the same as that of the conventional method. Therefore, there is no inconvenience even if the synchronization variables temporarily existing in a plurality of cache memories are allowed to be inconsistent.

【０００７】そして、バリア同期条件の成立を示す前記
同期変数の更新時にのみ行うコヒーレンス動作のプロト
コルとして、一般的に用いられる無効化方式のプロトコ
ル、即ち、メインメモリがキャッシュメモリにデータの
無効を通知し、各キャッシュがプロセッサからのアクセ
ス時に発生するミスヒット処理としてメインメモリから
データをブロック転送する方式ではなく、書き込み更新
方式のプロトコル、即ち、メインメモリがキャッシュメ
モリに対して有効なデータをブロック転送する方式を用
いることにより、オーバーヘッドをさらに低減すること
ができる。Then, as a protocol of the coherence operation which is performed only when the synchronization variable indicating the establishment of the barrier synchronization condition is established, a generally used invalidation protocol, that is, the main memory notifies the cache memory of invalidity of data. However, instead of the method of block transfer of data from the main memory as a mishit process that occurs when each cache accesses from the processor, a protocol of a write update method, that is, the main memory transfers a block of valid data to the cache memory. By using this method, the overhead can be further reduced.

【０００８】さらには、その際に転送するデータを、バ
リア同期条件の成立を示す最小限のデータのみ、即ち、
同期変数及びその関連データに制限することによりデー
タ転送時間を短縮してオーバーヘッドをさらに低減する
ことができる。Further, the data to be transferred at that time is limited to the minimum data indicating the establishment of the barrier synchronization condition, that is,
Limiting the synchronization variable and its associated data can reduce the data transfer time and further reduce the overhead.

【０００９】[0009]

【発明の効果】本発明によれば、従来方式と本質的に同
じ同期タイミングを与えながらも、そのオーバーヘッド
の要因となるコヒーレンス動作を高速、且つ、必要最小
限に行わせて、システム全体としてスループットを高め
ることができる並列計算機のバリア同期方式を提供する
ことができるようになった。According to the present invention, while giving essentially the same synchronization timing as the conventional method, the coherence operation that causes the overhead is performed at a high speed and to the minimum necessary, and the throughput of the entire system is improved. It has become possible to provide a barrier synchronization method for parallel computers that can improve the performance.

【００１０】[0010]

【実施例】以下に実施例を説明する。図３に示すよう
に、複数のプロセッサＰ０，Ｐ１，Ｐ２，Ｐ３に第一の
ネットワークＢ１を介して各別に接続された複数のキャ
ッシュメモリＣＭを設けて、それらキャッシュメモリＣ
Ｍを第二のネットワークＢ２を介してメインメモリＭＭ
に接続して並列計算機Ｃを構成してあり、並列化された
ループで構成される一連の処理のうち、各ループをそれ
ぞれのプロセッサＰ０，Ｐ１，Ｐ２，Ｐ３に割りつける
ことにより高速処理を実現するものである。EXAMPLES Examples will be described below. As shown in FIG. 3, a plurality of processors P0, P1, P2, P3 are provided with a plurality of cache memories CM respectively connected via a first network B1, and these cache memories C are provided.
M through main network MM via second network B2
A parallel computer C is configured by connecting to each other, and a high-speed processing is realized by allocating each loop to each processor P0, P1, P2, P3 among a series of processing configured by parallelized loops. To do.

【００１１】前記メインメモリＭＭは、前記キャッシュ
メモリＣＭとの間でのデータ転送制御や自身のデータ管
理等を行うメモリコントローラＭＭＣと、データを格納
するデータ部ＭＭＤとで構成してある。前記メインメモ
リＭＭのデータ部ＭＭＤと前記キャッシュメモリＣＭの
データ部ＣＭ３とは同じ大きさのブロックに分割してあ
り、通常は各ブロック単位にデータの転送が行われる。The main memory MM is composed of a memory controller MMC that controls data transfer with the cache memory CM and manages its own data, and a data section MMD that stores data. The data part MMD of the main memory MM and the data part CM3 of the cache memory CM are divided into blocks of the same size, and data is normally transferred in units of blocks.

【００１２】前記キャッシュメモリＣＭは、ディレクト
リ部ＣＭ２と、前記メインメモリＭＭからデータの一部
を読み出して格納するデータ部ＣＭ３、及び、前記複数
のプロセッサＰからのアクセスに対応して前記メインメ
モリＭＭとの間でデータの読み書き等の制御を行うキャ
ッシュ制御部ＣＭ１とから構成してある。The cache memory CM has a directory section CM2, a data section CM3 for reading and storing a part of data from the main memory MM, and the main memory MM corresponding to accesses from the plurality of processors P. And a cache control unit CM1 for controlling the reading and writing of data between them.

【００１３】前記ディレクトリ部ＣＭ２は、キャッシュ
メモリのセット数及びウェイ数に応じた容量を備えてあ
り、各単位はメインメモリＭＭ上で対応するデータの位
置を表すアドレスタグ部ＣＭ２１、即ちブロックアドレ
スと、当該ウェイがアドレスタグで示されるブロックと
してアロケートされているか否か、即ちデータの有効或
いは無効を表す有効ビットＶと、当該ウェイにアロケー
トされるべきブロックのうち、実際に転送すべきデータ
の先頭アドレス及びデータ量を示す属性部ＣＭ２２とで
構成してある。ここに、有効ビットＶは無効化方式のコ
ヒーレンス制御に用いられるフラグで、前記メインメモ
リＭＭから伝送されるデータの整合性が保てなくなった
旨の制御信号によりリセットされ、データ更新処理の対
象としてブロックが割り当てられた時、即ち、更新処理
の開始時にセットされるフラグであり、状態ビットＢ
は、更新処理の開始によりセットされ終了によりリセッ
トされるフラグである。The directory portion CM2 has a capacity corresponding to the number of sets of cache memory and the number of ways, and each unit is an address tag portion CM21 indicating the position of corresponding data on the main memory MM, that is, a block address. , Whether or not the way is allocated as a block indicated by an address tag, that is, a valid bit V indicating whether data is valid or invalid, and the head of data to be actually transferred among blocks to be allocated to the way. It is composed of an attribute portion CM22 indicating an address and a data amount. Here, the valid bit V is a flag used for the coherence control of the invalidation system, and is reset by the control signal indicating that the consistency of the data transmitted from the main memory MM cannot be maintained, and is targeted for the data update processing. This is a flag that is set when a block is allocated, that is, at the start of the update process, and the status bit B
Is a flag that is set by the start of the update process and reset by the end.

【００１４】前記キャッシュ制御部ＣＭ１は、第一のネ
ットワークＢ１を介して得られたアドレス情報と前記ア
ドレスタグ部ＣＭ２１のブロックアドレスとを比較する
比較部と、比較部によりアドレスが一致すると判別さ
れ、且つ、前記有効ビットＶがセットされている場合
（以下、「キャッシュ・ヒット」と記す）にそのプロセ
ッサＰとの間で該当データの授受を行うデータ転送制御
部、比較部により不一致と判別され、或いは、前記有効
ビットＶがリセットされている場合（以下、「キャッシ
ュ・ミスヒット」と記す）に前記第二のネットワークＢ
２を介して前記メインメモリＭＭに対してデータのリプ
レース処理を行うデータリプレース処理部と、それらの
動作を制御する論理回路等で構成してある。The cache control unit CM1 compares the address information obtained via the first network B1 with the block address of the address tag unit CM21, and the comparison unit determines that the addresses match. In addition, when the valid bit V is set (hereinafter, referred to as "cache hit"), the data transfer control unit and the comparison unit that exchange the relevant data with the processor P determine that they do not match, Alternatively, when the valid bit V is reset (hereinafter, referred to as "cache miss hit"), the second network B
2, a data replacement processing unit for performing data replacement processing on the main memory MM via 2, and a logic circuit for controlling the operations thereof.

【００１５】上述の並列計算機を用いて、並列化された
ループで構成される一連の処理（「ボディ」とよぶ）が
複数組み合わされたアプリケーションを実行する場合に
ついて説明すると、各プロセッサＰ０，Ｐ１，Ｐ２，Ｐ
３は、それぞれに割り付けられたループの処理を実行し
た後、図２に示すようなアルゴリズムに基づいてバリア
同期をとって、次のボディの処理に移る。A case will be described where the above-mentioned parallel computer is used to execute an application in which a plurality of series of processes (referred to as "body") composed of parallelized loops are combined, and each processor P0, P1, P2, P
After executing the processing of the loops assigned to each of the three, the barrier synchronization is taken based on the algorithm shown in FIG. 2 and the processing of the next body is started.

【００１６】以下に、図１に基づいて詳述する。前記メ
インメモリＭＭのデータ部ＭＭＤの一ブロックに、同期
変数を格納するカウンタ”ｃｏｕｎｔ”をマッピングし
て（ここでは、複数のボディの実行を想定して二種類の
カウンタをマッピングしてある）、初期値としてプロセ
ッサ数４を設定する。各プロセッサＰ０，Ｐ１，Ｐ２，
Ｐ３は、自己のループ処理を終了すると、それぞれのキ
ャッシュメモリＣＭに対してカウンタ”ｃｏｕｎｔ”の
減算命令”ｆｅｔｃｈａｎｄｄｅｃｒｉｍｅｎｔ”
を実行する。該当するキャッシュメモリＣＭは、メイン
メモリＭＭに対して、カウンタ”ｃｏｕｎｔ”の値を１
減算（デクリメント）した値を取り出す（フェッチ）。
（この場合、カウンタ”ｃｏｕｎｔ”の値を取り出し
（フェッチ）た後、その値を１減算（デクリメント）し
た値をメインメモリＭＭに格納してもよい。）各プロセ
ッサから上述の減算命令が実行される度にメインメモリ
ＭＭ上のカウンタ”ｃｏｕｎｔ”の値が減少して、他の
キャッシュメモリにアロケートされているカウンタ”ｃ
ｏｕｎｔ”の値がずれることになるが、バリア条件が成
立するまでメモリコントローラＭＭＣは各キャッシュメ
モリＣＭに対して上述の無効化方式のコヒーレンス制御
を行わない。今、プロセッサＰ０，Ｐ３，Ｐ２，Ｐ１の
順に処理が終了して、前記バリア条件が成立、即ち、カ
ウンタ”ｃｏｕｎｔ”値が０になると、メモリコントロ
ーラＭＭＣは各キャッシュメモリＣＭに対して書き込み
更新型のコヒーレンス制御を行う。つまり、全キャッシ
ュメモリＣＭに対して値が０となったカウンタ”ｃｏｕ
ｎｔ”データを転送する。このとき、転送データ容量
は、当該カウンタ”ｃｏｕｎｔ”データがマッピングさ
れているブロックの全データを転送するのではなく、当
該ブロックアドレスのカウンタ”ｃｏｕｎｔ”データが
マッピングされているアドレス情報とともに、カウン
タ”ｃｏｕｎｔ”データのみを転送する。A detailed description will be given below with reference to FIG. A counter "count" that stores a synchronization variable is mapped to one block of the data part MMD of the main memory MM (here, two types of counters are mapped assuming execution of a plurality of bodies), The number of processors 4 is set as an initial value. Each processor P0, P1, P2
When P3 finishes its own loop processing, the subtraction instruction "fetch and decrement" of the counter "count" is given to each cache memory CM.
To execute. The corresponding cache memory CM sets the value of the counter “count” to 1 with respect to the main memory MM.
Fetch the value that has been subtracted.
(In this case, after the value of the counter "count" is fetched (fetched), the value obtained by subtracting (decrementing) 1 from the value may be stored in the main memory MM.) The above subtraction instruction is executed from each processor. Every time the counter "count" on the main memory MM decreases, the counter "c" allocated to another cache memory.
Although the value of “outt” is deviated, the memory controller MMC does not perform the above-described invalidation-type coherence control for each cache memory CM until the barrier condition is satisfied. Now, the processors P0, P3, P2, and P1 When the barrier condition is satisfied, that is, when the counter “count” value becomes 0, the memory controller MMC performs write-update type coherence control for each cache memory CM, that is, all caches. A counter "cou" with a value of 0 for the memory CM
nt "data is transferred. At this time, the transfer data capacity does not transfer all the data of the block to which the counter" count "data is mapped, but the counter" count "data of the block address is mapped. Only the counter "count" data is transferred together with the existing address information.

【００１７】各プロセッサＰ０，Ｐ１，Ｐ２，Ｐ３が、
各キャッシュメモリＣＭ内のカウンタ”ｃｏｕｎｔ”値
が０であることを確認すると、次のボディの処理を開始
して、この後同様のバリア同期処理を繰り返してアプリ
ケーションを完了する。Each processor P0, P1, P2, P3
When it is confirmed that the counter "count" value in each cache memory CM is 0, the processing of the next body is started, and thereafter the same barrier synchronization processing is repeated to complete the application.

【００１８】以下に別実施例を説明する。先の実施例で
説明した並列計算機は、各プロセッサに固有のキャッシ
ュメモリを備えて構成したものを説明したが、複数のキ
ャッシュメモリのうち一のキャッシュメモリを一部のプ
ロセッサが共用する並列計算機であっても同様に適用で
きる。また、ボディの実行は必ずしも全プロセッサにル
ープを割り付けて行う必要はなく、任意数のプロセッサ
により実行されるものであればよい。このとき、同期変
数の初期値はボディの実行を担当するプロセッサの数を
設定ればよい。さらには、このとき異なるプロセッサの
グループ毎に異なるボディを実行させることも同様のバ
リア同期制御により行われる。Another embodiment will be described below. The parallel computer described in the above embodiment has been described as being configured with a cache memory unique to each processor. However, it is a parallel computer in which one cache memory among a plurality of cache memories is shared by some processors. Even if there is, it can be applied similarly. Further, the execution of the body does not necessarily need to be performed by allocating a loop to all the processors, and may be executed by any number of processors. At this time, the initial value of the synchronization variable may be set to the number of processors in charge of executing the body. Further, at this time, different bodies are executed by different processor groups by the same barrier synchronization control.

[Brief description of drawings]

【図１】バリア同期のフローチャートFIG. 1 Flowchart of barrier synchronization

【図２】バリア同期のアルゴリズムを示す説明図FIG. 2 is an explanatory diagram showing an algorithm of barrier synchronization.

【図３】並列計算機のブロック構成図FIG. 3 is a block diagram of a parallel computer.

【図４】従来例を示すバリア同期のフローチャートFIG. 4 is a flowchart of barrier synchronization showing a conventional example.

[Explanation of symbols]

ＣＭキャッシュメモリＭＭメインメモリＰ０，Ｐ１，Ｐ２プロセッサ CM cache memory MM main memory P0, P1, P2 processor

Claims

[Claims]

1. A parallel computer having a plurality of cache memories, which are accessed from the processor and store a part of the stored contents of the main memory, between the plurality of processors and a main memory connected by a network,
After each processor recognizes the end of the series of processes executed by the plurality of processors based on the value of the synchronization variable that is updated each time the process for each processor ends, the next series of processes is executed. A barrier synchronization method of a parallel computer configured to allow, allowing caching of the synchronization variable, and performing coherence operation for the synchronization variable only when a barrier synchronization condition is satisfied in each update of the periodic variable. A barrier synchronization method for parallel computers configured to perform.

2. The barrier synchronization method for a parallel computer according to claim 1, wherein the coherence operation is a write / update type coherence operation.

3. The barrier synchronization method for a parallel computer according to claim 1, wherein the data transferred by the coherence operation is composed of only a minimum amount of data indicating that the barrier synchronization condition is satisfied.