JPH02132575A

JPH02132575A - Parallel computer-vector register data flow synchronizing device and network presetting device

Info

Publication number: JPH02132575A
Application number: JP63285654A
Authority: JP
Inventors: Akira Muramatsu; 晃村松; Ikuo Yoshihara; 郁夫吉原; Yukisuke Sakota; 迫田　行介
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1988-11-14
Filing date: 1988-11-14
Publication date: 1990-05-22
Anticipated expiration: 2013-06-18
Also published as: JP2765882B2

Abstract

PURPOSE:To reduce the synchronizing overhead of the memory lock, etc., by providing a means which writes the information at one time into the same addresses of the memories of full element processors from a host computer and a data flow synchronizing means. CONSTITUTION:The title device includes a broadcast means 24 which writes the information at one time into the same addresses of the memories of full element processors 2 from a host computer 1, a full synchronizing means 26 which detects the end of the process of the processor 2, a mutual connection network 3 which is used for transfer of information between the optional processors 2. Furthermore a data flow synchronizing means consists of a synchronizing variable or synchronizing register which is set at each processor 2 to secure the writing/reading synchronization with the memory at transfer of information and an exclusive addition/subtraction circuit is provided. Thus the synchronizing overhead of the memory lock, etc., can be reduced.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、並列計算機に係り、特に繰返しループの負荷
分散処理が主体の数値計算用並列計算機に関する．〔従来の技術〕従来の数値計算用並列計算機には、下記文献１記載の局
所メモリ型の並列計算機、下記文献２記載の共有メモリ
型の並列計算機２下記文献３記載のベクトル計算機のマ
ルチプロセッサがある．文献１　チャールズ・エル・サ
イツ：ザ　コズミツク　キューブ，コミュニケーション
ズ　オブザ　エーシーエム，２８巻１号，２２〜３３頁
，１９８５年（Ｃｈａｒｌｅｓ　Ｌ．　Ｓｅｉｔｚ　：　Ｔｈｅ　Ｃ
ｏｓｉ＋ｉｃ　Ｃｕｂｅ，Ｃｏｍ＋ａｕｎｉｃａｔｉｏ
ｎｓ　ｏｆ　ｔｈｅ　ＡＣＭ，　ｖｏ　Ｑ　．２８，　
Ｎｎｌ，　ｐｐ．２２−３３．　１９８５）文献２　アラン　ゴットリーブ他：ザ　ＮＹＵウルトラ
コンピューターデザイニング　アンＭＩＭＤ　　シエア
ド　メモリ　パラレル　コンピュータ，ＩＥＥＥ　　ト
ランザクションズ　オンコンピューターズ　Ｃ−３２巻
，２号，１７５〜１８９頁，１９８３年（Ａｌｌａｎ　Ｇｏｔｔｌｉｇｂ　ｅｔ．ａｌ．：Ｔｈ
ｅ　ＮＹＵ　Ｕｌｔｒａｃｏｍｐｕｔｅｒ−Ｄｅｓｉｇ
ｎｉｎｇ　ａｎ　ＭＩＮＤ　Ｓｈａｒｅｄ　Ｍｅｍｏｒ
ｙ　ＰａｒａｌｌｅｌＣｏｍｐｕｔｅｒ，　ＩＥＥＥ　
Ｔｒａｎｓａｃｔｉｏｎｓ　ｏｎ　Ｃｏｍｐｕｔｓｒｓ
，ｖｏＱ　．Ｃ−３２，　Ｎ（Ｌ２，　ｐｐ．ｌ７５−
１８９．　１９８３）文献３　寺内和也：主記憶２Ｇバ
イトで液浸冷却方式のＣＲＡＹ−２スーパーコンピュー
タ，日経エレクトロニクス，１９８５．１２．１６号，
１９５〜２０９頁，１９８５年このうち、文献１記載の局所メモリ型の並列計算機では
、解くべき問題を使用する並列計算機の構成に合わせて
分割し，各要素プロセッサ毎のプログラムを作成する．
要素プロセッサ間でデータを交換する場合には，データ
の送受信命令，例えばＳＥＮＤ命令，　ＲＥＣＩＶＥ命
令を発行する。また、逐次処理が必要な場合は、どれか
１台の要素プロセッサが他の要素プロセッサと同期を取
った後これを実行する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a parallel computer, and more particularly to a parallel computer for numerical calculations that mainly performs load distribution processing of repetitive loops. [Prior Art] Conventional parallel computers for numerical calculations include a local memory type parallel computer described in the following document 1, a shared memory type parallel computer described in the following document 2, and a multiprocessor such as a vector computer described in the following document 3. be. Reference 1 Charles L. Seitz: The C
osi+ic Cube, Com+aunication
ns of the ACM, vo Q. 28,
Nnl, pp. 22-33. 1985) Reference 2 Alan Gottlieb et.al.: The NYU Ultracomputer Design Anne MIMD Shared Memory Parallel Computer, IEEE Transactions on Computers Vol. C-32, No. 2, pp. 175-189, 1983 (Allan Gottligb et.al.: Th
e NYU Ultracomputer-Desig
ning an MIND Shared Memory
y Parallel Computer, IEEE
Transactions on Computers
,voQ. C-32, N(L2, pp.l75-
189. 1983) Reference 3 Kazuya Terauchi: CRAY-2 supercomputer with 2 GB main memory and liquid immersion cooling method, Nikkei Electronics, 1985.12.16,
Pages 195-209, 1985 Among these, in the local memory type parallel computer described in Reference 1, the problem to be solved is divided according to the configuration of the parallel computer used, and a program is created for each element processor.
When exchanging data between element processors, a data transmission/reception command, such as a SEND command or a RECIVE command, is issued. Furthermore, if sequential processing is required, one of the element processors executes it after synchronizing with the other element processors.

共有メモリ型の並列計算機では、データは分割せずに共
有メモリに置き、プログラムを分割または複写して各要
素プロセッサで実行させる。このため、要素プロセッサ
間で送受信命令を用いてデータを交換する必要はなく、
代りに共有メモリを読み書きする．そのため、読み書き
の順序を制御するために，データを定義する側の要素プ
ロセッサと参照する側の要素プロセッサとの間で同期を
取る必要がある。代表的な同期手段としては、メモリの
ロック，アンロツク手続きがある。In a shared memory type parallel computer, data is stored in shared memory without being divided, and programs are divided or copied and executed by each element processor. Therefore, there is no need to exchange data between element processors using send/receive instructions.
Instead, read and write shared memory. Therefore, in order to control the order of reading and writing, it is necessary to synchronize the element processor that defines the data and the element processor that references the data. Typical synchronization means include memory locking and unlocking procedures.

ベクトル計算機のマルチプロセッサも、同様に共有メモ
リ型であり、要素プロセッサ間共有データは共有メモリ
に置いて、ロック／アンロツタ制御によりこれを読み書
きする．従って、ベクトル処理（ベクトルレジスタを使
用する）の並列処理は、ループ内の変数間に依存関係が
ない場合に限られる．下記文献４には、分散メモリ上に共有メモリを構築する
例が記載されている。この例では、各要素プロセッサが
自分のメモリ中に存在するデータをアクセスする場合は
高速であるが、他の要素プロセッサ中のメモリをアクセ
スする場合は、ネットワークを経由するため遅い．文献４　ジー・エフ・フイスター他：ザ　アイビーエム
　リサーチ　パラレル　プロセッサ　プロトタイプ（Ｒ
Ｐ３）：イントロダクション　アンド　アーキテクチャ
，プロシーデイングズ　オブザ　１９８５　インターナ
ショナル　コンファレンス　イン　パラレル　プロセシ
ング，７６４〜７７１頁，１９８５年（Ｇ．Ｆ．Ｐｆｉｓｔｓｒ　ｓｔ．ａｌ．　：　Ｔｈｅ
　ＩＢＭ　Ｒｅｓｅａｒｃｈ　ＰａｒａｌｌｅＩ　Ｐｒ
ｏｃｅｓｓｏｒ　Ｐｒｏｔｏｔｙｐｅ（ＲＰ３）　：　
Ｉｎｔｒｏｄｕｃｔｉｏｎ　ａｎｄＡｒｃｈｉｔｅｃｔ
ｕｒｅ，Ｐｒｏｃｅｅｄｉｎｇｓ　ｏｆ　ｔｈａ　１９
８５Ｉｎｔｅｒｎａｔｉｏｎａｌ　Ｃｏｎｆｅｒｅｎｃ
ｅ　ｉｎ　ＰａｒａｌｌｅｌＰｒｏｃｅｓｓｉｎｇｙ　
ｐｐ．７６４−７７１．　１９８５）〔発明が解決しよ
うとする課題〕まず、局所メモリ型の並列計算機は、利用者が？列計算
機の横成を意識して問題を分割しなくてはならないとい
う大きな問題がある．残る３種類の並列計算機■共有メモリ型の並列計算機，
ベクトル計算機のマルチプロセッサ，分散型共有メモリ
を持つ並列計算機■には、次のような問題点がある。The multiprocessor of a vector computer is also a shared memory type, and data shared between element processors is placed in the shared memory and read and written using lock/unlotter control. Therefore, parallel processing of vector processing (using vector registers) is possible only when there are no dependencies between variables in the loop. Document 4 listed below describes an example of constructing a shared memory on distributed memory. In this example, when each element processor accesses data in its own memory, it is fast, but when accessing the memory in other element processors, it is slow because it goes through the network. Reference 4 G.F. Feister et al.: The IBM Research Parallel Processor Prototype (R
P3): Introduction and Architecture, Proceedings of the 1985 International Conference on Parallel Processing, pp. 764-771, 1985 (G.F. Pfiststr st.al.: The
IBM Research Parallel Pr
ocessor Prototype (RP3):
Introduction and Architecture
ure、Proceedings of the 19
85International Conference
e in ParallelProcessing
pp. 764-771. 1985) [Problems to be Solved by the Invention] First of all, how can local memory type parallel computers be used by users? A major problem is that the problem must be divided keeping in mind the size of the column calculator. The remaining three types of parallel computers ■Shared memory type parallel computers,
Parallel computers with vector computer multiprocessors and distributed shared memory ■ have the following problems.

（１）共有メモリ型の並列計算機では、１台当りの性能
がベクトル計算機のように高くないので、システム全体
の性能を高くしようとすると多数台を結合しなくてはな
らない。これは、要素プロセッサと共有メモリを結合す
る装置のハードウエア量が増加し、メモリアクセスに時
間がかかる他，メモリアクセス競合を引き起こす等の問
題を生み出す。特に，データを複数台の要素プロセッサ
で共有する場合には，メモリロック等の同期オーバヘッ
ドが大きくなり、多数台の要素プロセッサを結合しても
性能が出ないという問題がある．（２）ベクトル計算機のマルチプロセッサでは，１台当
りの性能が高いので、多数台の要素プロセツサを結合す
る必要性は低い。しかし，やはりメモリロック等の同期
オーバヘッドは大きく、またデータ依存関係のあるルー
プのベクトル処理を並列に実行できないという問題があ
る。(1) In shared memory type parallel computers, the performance per unit is not as high as that of vector computers, so in order to increase the performance of the entire system, many computers must be connected. This increases the amount of hardware in the device that connects the element processors and the shared memory, which takes time for memory access and causes problems such as memory access contention. In particular, when data is shared by multiple element processors, synchronization overhead such as memory locking becomes large, and there is a problem that performance cannot be achieved even if many element processors are combined. (2) Since the performance of each multiprocessor in a vector computer is high, there is little need to combine a large number of element processors. However, there is still a problem in that synchronization overhead such as memory locking is large, and vector processing of loops with data dependencies cannot be executed in parallel.

（３）分散型共有メモリを持つ並列計算機は、データが
自メモリ中にある場合には高速にアクセスでき、メモリ
競合も発生しないので，多数台の要素プロセッサを結合
するのに向いた方式である。しかし、要素プロセッサ間
でデータを交換する場合には通信に時間がかかる。また
、一旦データを分散メモリ上に割り付けてしまうと、ベ
クトル計算機がベクトル処理を行うために内側ループと
外側ループとを交換してループの独立性を得るというよ
うなプログラムの変換ができず、必然的に依存型のルー
プをより多く対象としなければならなくなる。これは，
要素プロセッサ間の同期のオーバヘッドが増大すること
につながる．本発明の目的は，利用者が並列計算機の構成を意識して
問題を分割する必要がなく、さらに高性能を出すことの
できる並列計算機であるベクトル計算機のマルチプロセ
ッサ，分散型共有メモリを持つ並列計算機に共通の問題
点である。(3) Parallel computers with distributed shared memory can access data at high speed if it is in its own memory, and there is no memory contention, so it is suitable for connecting a large number of element processors. . However, when exchanging data between element processors, communication takes time. In addition, once data is allocated on distributed memory, it is impossible to convert the program such as exchanging the inner loop and outer loop to obtain loop independence in order for the vector computer to perform vector processing. Therefore, it becomes necessary to target more dependent loops. this is,
This leads to increased synchronization overhead between element processors. The purpose of the present invention is to develop a multiprocessor vector computer, which is a parallel computer that can achieve even higher performance without the user having to be aware of the configuration of the parallel computer to divide the problem, and a parallel computer with distributed shared memory. This is a common problem with computers.

メモリロツク等の同期オーバヘッドが大きいことを解決
する同期手段およびそれを用いた並列計算機およびベク
トル計算機を提供することにある。The object of the present invention is to provide a synchronization means that solves the problem of large synchronization overhead such as memory lock, and a parallel computer and vector computer using the synchronization means.

本発明のさらに他の目的は、ベクトル計算機のマルチプ
ロセッサに固有の問題である。Yet another object of the present invention is to address problems specific to multiprocessors in vector computers.

データ依存関係のあるループのベクトル処理を並列に実
行できないことを解決したベクトル計算機を提供するこ
とにある。An object of the present invention is to provide a vector calculator that solves the problem of not being able to execute vector processing in a loop with data dependence in parallel.

さらに、本発明の他の目的は分散型共有メモリを持つ並
列計算機に顕著である。Furthermore, another object of the present invention is particularly noticeable in parallel computers having distributed shared memory.

通信に時間がかかるという問題を解決した並列計算機を
提供することにある。The object of the present invention is to provide a parallel computer that solves the problem of communication taking time.

[Means to solve the problem]

上記問題点を解決するために、本発明の並列計算機では
，ホスト計算機から全要素プロセッサの記憶装置中の同
一アドレスに対して一度に情報を書き込む放送手段と，
全要素プロセッサの処理終了を検出する全同期手段と、
任意の要素プロセッサ間で情報の授受を行うための相互
結合ネットワークと、情報の授受を行うときにその記憶
装置への書き込み，読みだしに関する同期をとるために
各要素プロセッサに設けた同期用変数または同期用レジ
スタとその排他的加減算回路とから構成されるデータフ
ロー同期手段とを備える。In order to solve the above problems, the parallel computer of the present invention includes a broadcasting means for writing information from a host computer to the same address in the storage device of all element processors at once;
all synchronization means for detecting the completion of processing of all element processors;
An interconnection network for exchanging information between arbitrary element processors, and a synchronization variable or synchronization variable provided in each element processor to synchronize writing and reading from the storage device when exchanging information. It includes data flow synchronization means consisting of a synchronization register and its exclusive addition/subtraction circuit.

また、本発明の好ましい態様では、ベクトル計算機のマ
ルチプロセッサのように要素プロセッサがベクトル演算
装置を有し，１台の要素プロセッサのベクトルレジスタ
から他の１台または複数台の要素プロセッサのベクトル
レジスタに直接データを送るための経路を設定する手段
、およびその値が０のときデータのベクトルレジスタへ
の書き込みができ、その値が１のときデータのベクトル
レジスタからの読みたしができる、各語単位に設けたタ
グフィールドを持つベクトルレジスタと，タグフィール
ドの値を操作する手段とから成るベクトルレジスタ間デ
ータフロー同期装置を備える．さらに，通信のオーバヘ
ッド削減のために、相互結合ネットワークの接続パタン
をネットワーク利用時以前に設定するネットワーク接続
パタン設定回路と、送信元要素プロセッサ番号をそこか
ら送られてくるデータを格納するベクトルレジスタアド
レスまたは記憶装置中の格納領域アドレスに変換する格
納アドレス生成回路とから成るネットワークプリセット
装置を備える．〔作用〕並列実行させる一つの繰返しループの終了の検出を、全
要素プロセッサの処理終了を検出する全同期手段を用い
て高速に行い、該ループに引き続き並列実行させる他の
ループをホスト計算機から全要素プロセッサの記憶装置
中の同一アドレスに対して一度に情報を書き込む放送手
段により高速に開始し，このようにして両ループの間に
存在するデータ依存関係を満たすための同期を高速に取
ることが可能となる．また、情報の生産者側要素プロセ
ッサが情報を消費者側要素プロセッサに転送した後、消
費者側要素プロセッサの同期用変数または同期用レジス
タの内容を１だけ排他的加減算回路を用いて増加し，消
費者側要素プロセッサは自プロセッサ内の同期用変数ま
たは同期用レジスタの内容が正なら排他的加減算回討を
用いて１だけ減少させた後転送されてきた情報を参照す
る（または消費者側要素プロセッサが生産者側要素プロ
セッサの情報を参照した後、生産者側要素プロセッサの
同期用変数または同期用レジスタの内容を１だけ排他的
加減算回路を用いて増加し，生産者側要素プロセッサは
自プロセッサ内の同期用変数または同期用レジスタの内
容が正なら排他的加減算回路を用いて１だけ減少させた
後該情報を再定義する）ことにより、メモリをロック，
アンロックする手続きをせずに共有データをアクセスす
ることができ、一つの繰返しループ内に要素プロセッサ
間にまたがるデータ依存関係が存在しても，過大なオー
バヘッドを伴わずに並列処理することができる．とくに、要素プロセッサがベクトル演算装置を持つ場合
，要素プロセッサ間にまたがったデータ依存関係のある
ループのベクトル処理においては、異なる要素プロセッ
サに属すベクトルレジスタ間の経路設定手段を用いて、
該依存関係を表わすデータフローに従って１台の要素プ
ロセッサのベクトルレジスタから他の１台または複数台
の要素プロセッサのベクトルレジスタに直接データを送
るための経路を設定し，ベクトルレジスタの各語単位に
設けたタグフィールドの内容がデータの到着を示してい
ればその内容をベクトル演算器に入力し、また，タグフ
ィールドの内容がデータの未倒着を示していればそこに
データを書き込む．このようにして、本発明のベクトル
レジスタ間データフロー同期装置を用いることにより、
要素プロセッサ間にまたがったデータ依存関係のあるル
ープのベクトル処理を並列に実行することが可能となる
。Further, in a preferred embodiment of the present invention, the element processors have a vector operation unit, such as a multiprocessor of a vector computer, and the vector register of one element processor is transferred to the vector register of one or more other element processors. A means for setting a route for directly sending data, and a unit for each word that allows data to be written to a vector register when its value is 0, and that allows data to be read from a vector register when its value is 1. It has a data flow synchronization device between vector registers, which consists of a vector register with a tag field provided in the tag field, and a means for manipulating the value of the tag field. Furthermore, in order to reduce communication overhead, we have added a network connection pattern setting circuit that sets the interconnection network connection pattern before using the network, and a vector register address that stores the data sent from the source element processor number. Alternatively, it is equipped with a network preset device consisting of a storage address generation circuit that converts into a storage area address in a storage device. [Operation] The end of one repetition loop to be executed in parallel is detected at high speed by using all synchronization means that detects the completion of processing by all element processors, and other loops to be executed in parallel following this loop are completely removed from the host computer. It is possible to start quickly by using a broadcasting means that writes information at once to the same address in the storage device of the element processor, and in this way to quickly synchronize to satisfy the data dependence relationship that exists between both loops. It becomes possible. Further, after the information producer-side element processor transfers the information to the consumer-side element processor, the content of the synchronization variable or synchronization register of the consumer-side element processor is increased by 1 using an exclusive addition/subtraction circuit, If the content of the synchronization variable or synchronization register in its own processor is positive, the consumer element processor decrements it by 1 using exclusive addition/subtraction, and then refers to the transferred information (or After the processor refers to the information of the producer-side element processor, the content of the synchronization variable or synchronization register of the producer-side element processor is increased by 1 using an exclusive addition/subtraction circuit, and the producer-side element processor If the contents of the synchronization variable or synchronization register in
Shared data can be accessed without performing an unlock procedure, and even if there is a data dependency that spans element processors within a single iteration loop, parallel processing can be performed without excessive overhead. ．． In particular, when element processors have vector arithmetic units, in vector processing of loops with data dependencies across element processors, route setting means between vector registers belonging to different element processors is used.
A route is set for directly sending data from the vector register of one element processor to the vector register of one or more other element processors according to the data flow representing the dependency relationship, and is set for each word of the vector register. If the content of the tag field indicates that data has arrived, the content is input to the vector calculator, and if the content of the tag field indicates that data has not yet arrived, the data is written there. In this way, by using the vector register-to-register data flow synchronizer of the present invention,
It becomes possible to execute vector processing of loops with data dependencies across element processors in parallel.

さらに、ネットワーク接続パタン設定回路により事前に
通信路を定めれば，通信の宛先をデコードしてスイッチ
を切替る動作が不要になり，また、宛先自体も送る必要
がない。さらに、格納アドレス生成回路により送られて
きたデータの格納先を受信側ハードウエアで生成できる
ので、アドレスを送る必要がなく通信量が低減できる。Furthermore, if a communication path is determined in advance by a network connection pattern setting circuit, there is no need to decode the communication destination and switch, and there is no need to send the destination itself. Furthermore, since the storage destination of the data sent by the storage address generation circuit can be generated by the receiving side hardware, there is no need to send an address, and the amount of communication can be reduced.

〔Example〕

以下、本発明の実施例を図面により詳細に説明する．去ｍエ第２図は，本発明の並列計算機の全体構成図である．１
台のホスト計算機／と呼ぶ通常の逐次処理型計算機の下
に複数台の要素プロセッサーが接続されており、それら
が相互結合ネットワーク３で結合されている．ホスト計
算機／と要素プロセッサコの間には、制御信号・データ
を交換するための結合パスと、要素プロセッサーの処理
終了信号を要素プロセッサーからホスト計算機／に伝送
する全同期信号線が張られている．ホスト計算機ｌは、
結合バスダを用いて要素プロセッサλに情報を放送する
。全同期信号線６は途中でＡＮＤ回路乙によりＡＮＤが
とられ、要素プロセッサ全体が動作終了した場合にのみ
全同期信号がホスト計算機ｌに伝えられる。相互結合ネ
ットワーク３は，任意の要素プロセッサ間を結合するこ
とができるものとする。Hereinafter, embodiments of the present invention will be explained in detail with reference to the drawings. Figure 2 is an overall configuration diagram of the parallel computer of the present invention. 1
A plurality of element processors are connected under a normal sequential processing computer called a host computer, and these are connected by an interconnection network 3. Between the host computer and the element processor, there is a connection path for exchanging control signals and data, and a full synchronization signal line that transmits the processing end signal of the element processor from the element processor to the host computer. ．． The host computer l is
A coupling bus da is used to broadcast information to the element processors λ. The all-synchronizing signal line 6 is ANDed by an AND circuit (B), and the all-synchronizing signal is transmitted to the host computer (1) only when all the element processors have finished operating. It is assumed that the interconnection network 3 can connect arbitrary element processors.

第１図は，第１実施例の並列計算機の１台の要素プロセ
ッサと相互結合ネットワークの構成図である．要素プロ
セッサーは通常の逐次処理型計算機であり、処理ユニッ
ト．２／，メモリ制御ユニットコ一，局所メモリ．２Ｊ
，ＳＥＮＤユニット一ダ，ＲＥＣＥＩＶＥユニット−６
，全同期用レジスター６とから成っている。処理ユニッ
ト−２７はいわゆるＣＰＵであり、メモリ制御ユニット
一一は処理ユニット，２／，ＳＥＮＤ！ニット一ＩＩ　
，　ＲＥＣＥＩＶＥ　ユニット−６およびホスト計算機
／から局所メモリ，２３へのアクセス要求を調停する装
置である。メモリ制御ユニット．２．２は動的アドレス
変換装置を含んでもよい。処理ユニット．．２／，メモ
リ制御ユニット一一，局所メモリ−３は通常の計算機に
おけるものと同一であり、本発明に直接関係しないので
これ以上の説明は省略する．ＳＥＮＤユニット一ダは、処理ユニットコ／またはメモ
リ制御ユニット一一の指示により他の要素プロセッサー
の局所メモリ．２３にデータを送信する装置であり、Ｒ
ＥＣＥＩＶＥユニット．２ｊはこれを受信してメモリ制
御ユニット．２．２経由で局所メモリ．２３に書き込む
装置である。これらの装置は、さらにいくつかの複雑な
処理、例えばデータ要求情報の送信と返信、等を行うも
のとしてもよいが、それらの内容は本発明に直接関係し
ないのでこれ以上の説明は省略する。FIG. 1 is a configuration diagram of one element processor and mutual connection network of the parallel computer of the first embodiment. An element processor is a normal sequential processing computer, and is a processing unit. 2/, memory control unit, local memory. 2J
, SEND unit-6, RECEIVE unit-6
, and a total synchronization register 6. The processing unit 27 is a so-called CPU, and the memory control unit 11 is a processing unit 2/, SEND! Knit II
, RECEIVE unit 6 and the host computer/to the local memory 23. Memory control unit. 2.2 may include a dynamic address translation device. Processing unit. ．． 2/, memory control unit 11, and local memory 3 are the same as those in a normal computer, and are not directly related to the present invention, so further explanation will be omitted. The SEND unit 11 sends the local memory of other element processors under the direction of the processing unit/or memory control unit 11. It is a device that transmits data to R.
ECEIVE unit. 2j receives this and sends it to the memory control unit. Local memory via 2.2. This is a device for writing to 23. These devices may also perform some more complicated processing, such as sending and replying data request information, but the details thereof are not directly related to the present invention, so further explanation will be omitted.

相互結合ネットワークＪは，任意要素プロセッサ間を結
合することのできるものであれば何でもよい。第１図で
は完全結合のフルクロスバスイッチを図示している．ス
イッチの詳細な回路楕成は第３図〜第５図に示されてい
る。ＳＥＮＤユニット．２ダから送出される情報は，（
宛先要素プロセッサアドレス，書き込み領域のアドレス
，書き込みデータの値）とから構成されている。情報が
データ線／Ｏを経由してクロスバスイッチのデイストリ
ビュータ３／に到着すると、宛先要素プロセッサアドレ
スがデコーダＪ／．２　　（第３図）によりデコードさ
れて対応するセレクタ３．２　（第１図）が選択され、
そこに至るデータパスｊｔＩ−／〜ｊｔＩ−ｊのいずれ
かがデイストリビュータ３１／により選択される（第３
図）。このとき，情報がデータパス上に乗っていること
を示す制御信号が対応する信号線３３−ｌ〜３３−３の
いずれかに出力される。各セレクタ３．２では，同時に
到着する送信要求の中から一つを選択して要素プロセッ
サλのＲＥＣＥＩＶＥユニット−６に送る。この動作を
第４図，第５図を用いて説明する．データパス３ダー／
〜Ｊｌｌｔ−３にデータが、信号線３３−７〜Ｊ３−３
に制御信号が乗ってセレクタ３．．２に届くと，信号線
３３−／−ＪＪ−ｊはアドレスレジスタ３−ダを経由し
てＲＯＭＪ．２−に入力される．ＲＯＭｊ．２．２は（
この例では）５ビットのアドレスによりアクセスされる
メモリであり，最初の２ビットはＲＯＭｊ．．２．２の
前回の出力（３ビット）がエンコーダ３．２３によりエ
ンコードされて、残りの３ビットは信号線３３−ｌ〜３
３−３が使われる．第５図にはＲＯＭｊ．２，．２の内
容の１例が示されている．左側の表側は３ビットアドレ
スを，上側は２ビットアドレスを示している。２ビット
アドレスは、ＲＯＭｊ，２．２の前回の出力が１００の
場合はｏ　ｏ　ニ、０１０の場合は０　１　ニ、００１
の場合は１０にエンコードされる．すなわち、例えば先
頭の２ビットアドレスがｏｏである場合とは、前回の出
力が１００、つまり前回にはデータバスｊｌＩｔ−／が
選択された場合であることを意味する。従って．ＲＯＭ
Ｊ，．２．２のアドレスの先頭２ビットがＯＯであるア
ドレスには、データパスＪダー／にだけ出力要求がきて
いる場合（残り３ビットが１００）を除いて、他のデー
タバスが選択されるような出力パタンか格納されている
．このようにして、各出力要求は平等に受け付けられる
．　ＲＥＣＥＩＶＨユニット一Ｊでは、（書き込み領域
のアドレス，書き込みデータの値）を受け取って，メモ
リ制御ユニット．２一経由で局所メモリ，．２３に書き
込む．本発明では、実行すべきプログラムをＡ．繰返しループ（入出力を除く配列定義部）Ｂ．　　
　　　　以外の逐次処理部Ｃ．入出力部に分割し、Ａは要素プロセッサーのアレイに、Ｂ，Ｃは
ホスト計算機／に割り付けて実行する。ホスｌ・計算機
／のプログラムでは，Ａの実行命令，例えばＤｏ　　　１０　　　Ｉ＝１，１００Ｄｏ　　　１０　　　Ｊ＝１，１００Ａ　　（Ｉ，Ｊ）＝・・・・・・１　　０　　　Ｃｏｎｆｉｎｕｅ等の代りに要素プロセッサλのアレイに対するＡの実行
指令命令ＳＴＡＲＴ　　ＴＡＳＫＩＯ等が書かれている。この命令はＡの対応するプログラム
部分ＴＡＳＫ１０のエントリアドレスｔを全要素プロセ
ッサの記憶装置中の同一アドレス＃Ｐに放送して書き込
むものである。The interconnection network J may be any network as long as it can connect arbitrary element processors. Figure 1 shows a fully coupled full crossbar switch. Detailed circuit diagrams of the switch are shown in FIGS. 3-5. SEND unit. The information sent from 2 das is (
(destination element processor address, write area address, write data value). When the information arrives at the distributor 3/ of the crossbar switch via the data line /O, the destination element processor address is transferred to the decoder J/. 2 (Figure 3) and the corresponding selector 3.2 (Figure 1) is selected,
One of the data paths jtI-/ to jtI-j leading thereto is selected by the distributor 31/ (the third
figure). At this time, a control signal indicating that information is on the data path is output to one of the corresponding signal lines 33-l to 33-3. Each selector 3.2 selects one of the transmission requests arriving at the same time and sends it to the RECEIVE unit-6 of the element processor λ. This operation will be explained using Figures 4 and 5. data path 3der/
～Jllt-3 has data, signal line 33-7～J3-3
A control signal is applied to selector 3. ．． 2, the signal line 33-/-JJ-j passes through the address register 3-da to the ROMJ.2. 2- is input. ROMj. 2.2 is (
(in this example) is a memory accessed by a 5-bit address, the first 2 bits being ROMj. ．． The previous output (3 bits) of 2.2 is encoded by encoder 3.23, and the remaining 3 bits are sent to signal lines 33-l to 3.
3-3 is used. FIG. 5 shows ROMj. 2,. An example of the content of Section 2 is shown. The front side on the left side shows a 3-bit address, and the top side shows a 2-bit address. The 2-bit address is o o ni if the previous output of ROMj, 2.2 is 100, 0 1 ni if it is 010, 001
is encoded as 10. That is, for example, when the first 2-bit address is oo, it means that the previous output was 100, that is, the data bus jlIt-/ was selected last time. Therefore. ROM
J. 2. For addresses in which the first 2 bits of the address in 2 are OO, other data buses are selected, except when there is an output request only to the data path JD/ (the remaining 3 bits are 100). Contains output patterns. In this way, each output request is accepted equally. The RECEIVH unit 1J receives (address of write area, value of write data) and sends it to the memory control unit. Local memory via 21. Write on 23. In the present invention, the program to be executed is A. Repetition loop (array definition part excluding input/output) B.
Sequential processing units other than C. It is divided into input and output parts, with A being assigned to the array of element processors and B and C being assigned to the host computer for execution. In a program for a computer/host, an element is used instead of an execution instruction for A, such as Do 10 I=1,100 Do 10 J=1,100 A (I, J)=...1 0 Confinue, etc. An execution command command START TASKIO etc. of A for the array of processor λ is written. This command broadcasts and writes the entry address t of the program portion TASK10 corresponding to A to the same address #P in the memory devices of all element processors.

要素プロセッサーは実行すべき繰返しループ処理（ＴＡ
ＳＫＩＯ等）が終了すると、このアドレス＃Ｐに次の処
理のエントリアドレスｔ′が書かれるのを待っているの
で、ｔ′の放送が完了するとすぐにその実行に入る。そ
して、プログラム実行が終了すると全同期用レジスター
乙に１を書いて，再び次の処理のエントリアドレスｔが
書かれるのを待つ．全同期用レジスター乙の内容は、ＡＮＤ回路乙でＡＮＤ
されてホスト計算機／に全同期信号として入力される。The element processors are responsible for the iterative loop processing (TA
When SKIO, etc.) is completed, the entry address t' for the next process is waiting to be written to this address #P, so as soon as the broadcast of t' is completed, the execution begins. When the program execution is finished, it writes 1 to all synchronization registers B and waits for the entry address t of the next process to be written again. The contents of register B for all synchronization are ANDed by AND circuit B.
and input to the host computer as a total synchronization signal.

従って、全ての要素プロセッサーが処理を終了した段階
で即座にその状態がホスト計算機ｌに伝わる。Therefore, as soon as all element processors have finished processing, their status is immediately transmitted to the host computer l.

以上に述べたように、一つの繰返しループと該ループと
依存関係のある次の繰返しループの間で必要な同期は，
放送手段およびハードウェアにより全同期手段を用いて
高速に実現される。なお，互いに依存関係にない相互に
独立した複数のループは一まとめにして実行する。As mentioned above, the necessary synchronization between one iteration loop and the next iteration loop that has a dependency relationship with that loop is as follows:
This can be realized at high speed using all synchronization means by means of broadcasting means and hardware. Note that multiple mutually independent loops that are not dependent on each other are executed together.

次に、一つの繰返しループの内部に存在するデータ依存
関係の処理について述べる．　ＦＯＲＴＲＡＮプログラ
ムの例をとると、Ｄｏ　　１０　　Ｉ＝１，１００Ｄｏ　　１０　　Ｊ＝１，１００Ａ（Ｉ，Ｊ）＝Ａ（Ｉ−１，Ｊ）＋Ｂ（Ｊ）１　０　　
　ＣＯＮＴＩＮＵＨという繰返しループ（ＦＯＲＴＲＡＮプログラム例■）
を工について並列処理する場合、各要素プロセッサ２はＤｏ　　　１０　　　Ｊ＝１，　　１００Ａ（Ｉ，Ｊ）
＝Ａ（Ｉ−１，Ｊ）＋Ｂ（Ｊ）１　０　　　ＣＯＮＴＩ
ＮＵＥという内側ループを特定の１について担当する。Next, we will discuss the processing of data dependencies that exist within one iteration loop. Taking the FORTRAN program example, Do 10 I=1,100 Do 10 J=1,100 A(I,J)=A(I-1,J)+B(J)1 0
Repeated loop called CONTINUH (FORTRAN program example)
When performing parallel processing on the
=A(I-1,J)+B(J)1 0 CONTI
An inner loop called NUE is in charge of a specific one.

このとき配列Ａ　（Ｉ−１，Ｊ），Ｊ＝１，１００の各
要素については、一つ若い工を担当する要素プロセッサ
から定義後の値をもらって計算する必要がある。すなわ
ち、■については逐次処理が要求される．しかし，Ｊに
関しては各要素プロセッサで独立であるため、一つ若い
工を担当する要素プロセッサがＪの順に次々と定義値を
送ってくれば，これをパイプライン的に処理することに
より並列処理が可能となる．このように、データ依存関
係のある繰返しループでも、並列処理が可能である。本
発明では，このような依存型ループの並列処理のために
、局所メモリ．２３中に確保した同期用変数．２３／ま
たは専用に設けた同期用レジスター３−と，該同期用変
数−３７または同期用レジスターＪ一の値を排他的に１
だけ増減する排他的加減算回路．２／／を以下のように
用いる。At this time, for each element of the array A (I-1, J), J=1,100, it is necessary to obtain the defined value from the element processor in charge of the next younger engineer and calculate it. In other words, sequential processing is required for ■. However, since J is independent for each element processor, if the element processor in charge of the next younger module sends definition values one after another in the order of J, parallel processing can be achieved by processing these in a pipeline. It becomes possible. In this way, parallel processing is possible even in repeat loops with data dependencies. In the present invention, local memory is used for parallel processing of such dependent loops. Synchronization variables secured during 23rd. 23/or set the value of the dedicated synchronization register 3- and the synchronization variable-37 or synchronization register J-1 to 1 exclusively.
Exclusive addition/subtraction circuit that increases or decreases by 2// is used as follows.

いま，簡単のために、インデクスエを担当する要素プロ
セッサを要素プロセッサＩと記す。要素プロセッサＩ−
１は、Ａ（Ｉ−１，Ｊ）を定義した後、要素プロセッサ
Ｉにこの値を送信し、引き続き要素プロセッサエに制御
情報（宛先要素プロセッサアドレス，制御情報であるこ
とを示すコード，同期用変数または同期用レジスタアド
レス）を送信する。制御情報が到達すると、メモリ制御
ユニット一一がこれを判定して処理ユニット．２／に割
込みをかける．処理ユニット．２／の割込み処理プログ
ラムは排他的加減算回路一／ｌを用いて同期用変数−３
７または同期用レジスタ．２Ｊ７！の内容に１を加算す
る。一方、要素プロセッサＩはＡ（Ｉ−１，Ｊ）を参照
する前にこの同期用変数．２　Ｊ　／．または同期用レ
ジスタ．２Ｊ−の内容が正か否かチェックし、否の場合
はチェック動作を繰り返す（ｂｕｓｙ　ｗａｉｔ）。内
容が正である場合にはＡ　（Ｉ−１，Ｊ）の参照を行う
。以上は定義した変数を参照する依存関係の例であるが
、参照した変数を再定義する依存関係の場合も同様であ
る。For simplicity, the element processor in charge of indexing will be referred to as element processor I. Element processor I-
1 defines A(I-1, J), sends this value to element processor I, and then sends control information (destination element processor address, code indicating control information, synchronization) to element processor I. variable or synchronization register address). When the control information arrives, the memory control unit 11 judges it and sends it to the processing unit. Interrupts 2/. Processing unit. The interrupt processing program of 2/ uses the exclusive addition/subtraction circuit 1/l to set the synchronization variable -3.
7 or synchronization register. 2J7! Add 1 to the contents of. On the other hand, element processor I uses this synchronization variable before referring to A(I-1, J). 2 J/. Or a synchronization register. Check whether the contents of 2J- are correct or not, and if not, repeat the checking operation (busy wait). If the content is positive, refer to A (I-1, J). The above is an example of a dependency relationship that refers to a defined variable, but the same applies to a dependency relationship that redefines a referenced variable.

すなわち、ＤＯ　　１０　　Ｉ＝１，１００Ｄｏ　　１０　　Ｊ＝１，１００Ａ（Ｉ，Ｊ）＝Ａ（Ｉ＋１，Ｊ）＋Ｂ（．Ｔ）１　０　
　ＣＯＮＴＩＮｔｌＨ（ＦＯＲＴＲＡＮプログラム例■）を工について並列処
理する場合、要素プロセッサエは，Ａ（Ｉ＋１．，Ｊ）
を参照した後，要素プロセッサＩ＋１にこの値を送信し
、引き続き要素プロセッサＩ＋１に制御情報（宛先要素
プロセッサアドレス，制御情報であることを示すコード
，同期用変数または同期用レジスタアドレス）を送信す
る。制御情報が同期用変数または同期用レジスタに到着
すると、メモリ制御ユニット．２．２がこれを判定して
処理ユニット一／に割込みをかける。処理ユニット．２
／の割込み処理プログラムは排他的加減算回路．２／／
を用いてその内容に１を加算する。一方，要素プロセッ
サＩ＋１はＡ　（Ｉ＋１，Ｊ）を定義する前にこの同期
用変数，２３／または同期用レジスター３．２の内容が
正か否かチェックし、否の場合はチェック動作を繰り返
す（ｂｕｓｙ　ｗａｉｔ）。内容が正である場合には、
Ａ　（Ｉ＋１．１）の定義を行う。That is, DO 10 I=1,100 Do 10 J=1,100 A(I,J)=A(I+1,J)+B(.T)1 0
When processing CONTINtlH (FORTRAN program example) in parallel, the element processor is A(I+1.,J).
After referring to , this value is sent to element processor I+1, and then control information (destination element processor address, code indicating control information, synchronization variable or synchronization register address) is transmitted to element processor I+1. When control information arrives at the synchronization variable or synchronization register, the memory control unit. 2.2 determines this and interrupts the processing unit 1/. Processing unit. 2
/'s interrupt processing program is an exclusive addition/subtraction circuit. 2//
Add 1 to the contents using . On the other hand, before defining A (I+1, J), element processor I+1 checks whether the contents of this synchronization variable 23/or synchronization register 3.2 are correct, and if not, repeats the checking operation ( busy wait). If the content is positive,
Define A (I+1.1).

同期用変数．２３／または同期用レジスター３．２が計
数型であるため、上記例のいずれにおいても、インデク
スＩの若い方を担当する要素プロセッサはいくらでも処
理を先行させることができる。Synchronization variable. Since the synchronization register 3.23/or the synchronization register 3.2 is of the counting type, in any of the above examples, the element processors in charge of the smaller index I can advance any number of processes.

実施例２並列計算機の全体構成，要素プロセッサの主要構成部分
，プログラムの分割と割り当ておよび実行のさせ方は実
施例１と同じである。以下では、異なる部分について第
６図を用いて重点的に説明する。Embodiment 2 The overall configuration of a parallel computer, the main components of the element processors, and the method of program division, allocation, and execution are the same as in Embodiment 1. In the following, different parts will be mainly explained using FIG. 6.

本実施例は、実施例１にネットワークプリセット装置、
すなわち相互結合ネットワーク３のデイストリビュータ
Ｊ／からデコーダ３／−ｌを除き、代わりにデイストリ
ビュータＪ／／−０〜Ｊの接続パタン設定回路３３と格
納アドレス生成回路ｌ？を付加したものである。実施例
１で引用したＦＯＲＴＲＡＮプログラム例のでは、第Ｉ
−１要素プロセッサから第工要素プロセッサへデータお
よび制御情報を送信する必要があることがソースプログ
ラムを解析すれば分かる。本実施例では、コンパイラが
解析したこのような要素プロセッサ間結合パタンを、繰
返しループ処理を開始する前に相互結合ネットワーク３
の接続パタン設定回路３３に送ってデイストリビュータ
Ｊ／／−０〜３の接続を定める。また、受信側要素プロ
セッサの局所メモリａ２３中の受信領域先頭アドレス（
Ａ（Ｔ−．１．１）のアドレス）とその語長をそれぞれ
格納アドレス生成回路ｌ？中の格納領．域アドレスレジ
スタ／ヲ／−／〜／？／−ｊのいずれかと語長レジスタ
／？５に格納する。各要素プロセッサエにおいてＡ　（
Ｉ−１，Ｊ），Ｊ＝１，１００を同じアドレスに割り付
ければ、受信領域先頭アドレスと語長の格納はホスト計
算機ｌから放送することができる．もし，右辺にＡ（Ｉ
−２，Ｊ）等が現れるとき，すなわち複数の要素プロセ
ッサから同時に受信する可能性があるときも，それぞれ
の受信領域先頭アドレスと語長を送信元要素プロセッサ
に対応した格納領域アドレスレジスタｌヲ／−／〜／？
／−Ｊと語長レジスタ／９．５に格納する。ただし、本
装置は一本の式を定義する繰返しループに適用すること
を主眼に設計されており、一つの繰返しループ中で複数
の式を定義するプログラムの場合には、これを一本の式
を定義する繰返しループの系列に分解する。This embodiment includes a network preset device and a network preset device in Embodiment 1.
That is, the decoder 3/-l is removed from the distributor J/ of the interconnection network 3, and the connection pattern setting circuit 33 and the storage address generation circuit l? of the distributors J//-0 to J are replaced instead. is added. In the FORTRAN program example cited in Example 1, Section I
- It can be seen by analyzing the source program that it is necessary to transmit data and control information from the first element processor to the second element processor. In this embodiment, the interconnection pattern between element processors analyzed by the compiler is converted to the interconnection network 3 before starting the iterative loop processing.
The data is sent to the connection pattern setting circuit 33 to determine the connections of the distributors J//-0 to 3. Also, the receiving area start address (
A (T-.1.1) address) and its word length are respectively stored in the address generation circuit l? Storage area inside. Area address register /wo/-/~/? /−j and word length register /? Store in 5. In each element processor A (
I-1, J) and J=1,100 are assigned to the same address, the reception area start address and word length can be stored and broadcast from the host computer l. If A(I
-2, J), etc., that is, when there is a possibility of simultaneous reception from multiple element processors, the reception area start address and word length of each reception area are stored in the storage area address register lwo / corresponding to the source element processor. -/~/?
/-J and store in word length register /9.5. However, this device is mainly designed to be applied to repeat loops that define one formula, and in the case of a program that defines multiple formulas in one repeat loop, into a sequence of repeating loops that define

格納アドレス生成回路／フ中の格納領域アドレスレジス
タ／９／一／〜／？／−ｊは、セレクタｊ．．２−０〜
３への入力データバス（以後入力チャネルという）に対
応している。これは、各要素プロセッサ毎に定まるセレ
クタ３コーθ〜３への入力チャネルが分かれば送信側要
素プロセッサが分かるから、これに対応した受信領域の
アドレスを格納しておくためである。この図の例では、
送信側要素プロセッサ番号＝受信側要素プロセッサ番号十該セレクタへの入力チャ
ネル番号＋１　（ｍａｄ要素プロセッサ台数）という関係がある。従って、本発明では，入力チャネル
番号により格納領域アドレスレジスタ／？／−ｌ〜／’
ｌ／−ｊを選択できるよう、第６図に示すように、相互
結合ネットワーク３のセレクタｊ，２一〇〜３から出力
される入力チャネル番号（０，１．２）と，第７図に示
すよう送信情報中の制御情報か否かを表わす１ビットコ
ードとをデコーダｌ？一に入力してデコードし，その結
果によりセレクタ／９３のスイッチングを行う。制御情
報の場合（コード＝　’ｉ’　）は、同期用変数．２３
ｌまたは同期用レジスタｊＪ．２のアドレスを格納して
あるレジスタ／　？　／−１を選択する。データの場合
は、上記関係により定まる送信側要素プロセッサから送
られてくるデータの格納領域のアドレスを設定してある
レジスタ／？／−／から／９／−Ｊを選択する。Storage address generation circuit/Storage area address register in file/9/1/~/? /-j is selector j. ．． 2-0～
3 (hereinafter referred to as input channel). This is because if the input channel to the selector 3 θ to 3, which is determined for each element processor, is known, the transmitting element processor can be known, and the address of the receiving area corresponding to this can be stored. In the example in this diagram,
There is a relationship: sending side element processor number = receiving side element processor number + input channel number to the selector + 1 (number of mad element processors). Therefore, in the present invention, the storage area address register/? /-l~/'
In order to select l/-j, the input channel numbers (0, 1.2) output from the selectors j, 210 to 3 of the interconnection network 3 as shown in FIG. As shown, the decoder l? 1 and decoded, and the selector /93 is switched according to the result. In the case of control information (code = 'i'), it is a synchronization variable. 23
l or synchronization register jJ. The register that stores the address of 2/? /-1 is selected. In the case of data, the register /? is set to the address of the storage area for the data sent from the transmitting element processor determined by the above relationship. Select /9/-J from /-/.

てあり、格納領域アドレスレジスタ／９／−／〜／？／
−ダの一つが選択されるとその内容に語長レジスタの内
容が加算器／？Ｑにより加算され、選択されている格納
領域アドレスレジスタに書き込み制御回路／？０を経由
して戻される．この処理により１語分アドレスが進む．
ただし，同期用変数−３ｌまたは同期用レジスタ．２Ｊ
一の場合は語長はＯである．以上の装置を用いて、第１−１要素プロセッサがＡ　（
Ｉ−１．１）を定義した後、これを第Ｉ要素プロセッサ
に送信し，データフロー同期によりパイプライン処理す
る場合を第６図を用いて説明する．（１）全要素プロセッサがＡ　（Ｉ−１，Ｊ），Ｊ＝１
，１００を同一のａｏ番地から割り付ける。There is a storage area address register /9/-/~/? /
- When one of the registers is selected, the contents of the word length register are added to the contents of the adder/? The write control circuit/? is added to the selected storage area address register by Q. Returned via 0. This process advances the address by one word.
However, synchronization variable-3l or synchronization register. 2J
In the case of 1, the word length is O. Using the above device, the 1-1st element processor is A (
After defining I-1.1), the case where it is sent to the I-th element processor and pipelined by data flow synchronization will be explained using FIG. (1) All element processors are A (I-1, J), J=1
, 100 are allocated from the same ao address.

もし複数の■（以下工′等と記す）を担当するときは．
ａｏ番地から始まる領域に、Ａ（Ｉ−１，Ｊ），Ｊ＝１
，１００に引き続いてＡ（Ｉ’−１，Ｊ），Ｊ＝１，１
００等を割り付ける．（２）格納領域アドレスレジスタ
／デ／−３に、ホスト計算機が書き込み制御回路を経由
してａｏを格納する。格納領域アドレスレジスタ／？／
−／〜３はそれぞれ相互結合ネットワーク３の各セレク
タ３λ−θ〜３への入力チャネル０〜２（セレクタ３ノ
−θ〜３の箱の中に表示）に対応しており、入力チャネ
ル２はどのセレクタにおいても一つ若い番号（ただしプ
ロセッサ台数を法として）の要素プロセッサと接続して
いる。If you are in charge of multiple ■ (hereinafter referred to as ``work'', etc.).
A (I-1, J), J=1 in the area starting from address ao
, 100 followed by A(I'-1, J), J=1,1
Assign 00 etc. (2) The host computer stores ao in the storage area address register /D/-3 via the write control circuit. Storage area address register/? /
-/~3 correspond to input channels 0 to 2 (displayed in the box of selector 3 no -θ~3) to each selector 3λ-θ~3 of mutual coupling network 3, respectively, and input channel 2 is Every selector is connected to the element processor with the next lower number (modulo the number of processors).

（３）ホスト計算機／が語長レジスタ／？６にＡの語長
を格納する。(3) Is the host computer/word length register/? The word length of A is stored in 6.

（４）デイストリビュータ・パタン設定回路３３を各デ
イストリビュータの出力チャネル０に設定する（出力チ
ャネル番号はデイストリビュータＪ／／−０〜３の左に
表示），，この例では、デイストリビュータ３ｉｉ−ｏ
〜３の出力チャネルＯは各々セレクタ１２−０〜Ｊの入
カチャネル２と接続している．すなわち、送信先（受信側）要素プロセッサ番号＝送信元（送信側）要素プロセッサ番号十デイストリビ
ュータ出力チャネル番号＋１　（ｍａｄ，要素プロセッ
サ台数）という関係があるからである．（５）同期用変数．２Ｊ／または同期用レジスター３ノ
の値をＯに初期設定する。ここから繰返しループ処理に
入る．（６）第０要素プロセッサがＡ（１．１）をＳＥＮＤユ
ニット．２ｌｌにより送信する。(4) Set the distributor pattern setting circuit 33 to output channel 0 of each distributor (the output channel number is displayed to the left of distributors J//-0 to 3). Streamer 3ii-o
The output channels O of ~3 are connected to the input channels 2 of selectors 12-0~J, respectively. In other words, there is a relationship: destination (receiving side) element processor number = source (sending side) element processor number 10 distributor output channel number + 1 (mad, number of element processors). (5) Synchronization variables. 2J/or initialize the value of synchronization register 3 to O. From here, the loop process begins. (6) The 0th element processor sends A(1.1) to the SEND unit. Send by 2ll.

（７）デイストリビュータＪ／／−０の出力チャネル０
からセレクタ３−一／の入力チャネル２を経由して第１
要素プロセッサのＲＥＣＥＩＶＥユニット．２ｊにデー
タが渡される．一方、セレクタ３．２−／の入カチャネ
ル番号２がデータ中の制御情報コード０とともに第１要
素プロセッサのデコーダｌヲ一へ入力され、その結果セ
レクタ／？３により格納領域アドレスレジスタ／？／−
３が選択されて、その内容（ａｏ）がＲｌｌ！ＣＥＩＶ
Ｅユニットコ５から渡される受信データの格納先アドレ
スとしてメモリ制御ユニット，．２コへ送られる．（８）メモリ制御ユニット．２−は，値Ａ　（１，１）
をａｏ番地に書き込む．（９）加算器／？タによりセレクタ／？３出力のａｏに
語長（バイト単位。例えば、倍精度演算では８）が加算
され、ａｏ＋８が書き込み制御回路／？Ｏを経由して格
納領域アドレスレジスタ／９／−ｊに書き込まれる。(7) Output channel 0 of distributor J//-0
from input channel 2 of selector 3-1/ to the first
RECEIVE unit of element processor. Data is passed to 2j. On the other hand, the input channel number 2 of the selector 3.2-/ is input to the decoder l of the first element processor along with the control information code 0 in the data, and as a result, the selector /? 3 sets the storage area address register/? /-
3 is selected and its content (ao) is Rll! CEIV
The memory control unit, . Sent to 2. (8) Memory control unit. 2- is the value A (1,1)
Write to address ao. (9) Adder/? Selector/? The word length (in bytes, for example, 8 for double precision arithmetic) is added to the third output ao, and ao+8 is the write control circuit/? It is written to storage area address register /9/-j via O.

（ｌＯ）第０要素プロセッサが制御情報をＳＥＮＤユニ
ット一ダにより送信する。(lO) The 0th element processor sends control information through the SEND unit.

（１１）デイストリビュータＪ／／−０の出力チャネル
Ｏからセレクタ３−一／の入カチャネル２を経由して第
１要素プロセッサのＲＥＣＥＩＶＥユニット．２Ｊに制
御情報が渡される。デコーダ／？一へは，セレクタＪ２
−／の入カチャネル番号２の他に制御情報コード′１″
が入力される。その結果、同期用変数アドレスまたは同
期用レジスタアドレスの入っている格納領域アドレスレ
ジスタ／？／−９が選択され、メモリ制御ユニット一一
に送られた後、処理装置一ｌの割込み処理プログラムに
より排他的に１が加算される．（１２）第Ｏ要素プロセ
ッサはさらに次の繰返しに入り、Ａ　（１．２）を第１
要素プロセッサに送る．（１３）第１要素プロセッサは
．Ａ　（１．２）をａＯ＋８番地に書き込む．格納領域
アドレスレジスタ／’ｌ／−ｊの内容はａｏ＋１６とな
る。(11) From the output channel O of the distributor J//-0 to the RECEIVE unit of the first element processor via the input channel 2 of the selector 3-1/. Control information is passed to 2J. decoder/? To one, selector J2
-/input channel number 2 and control information code '1''
is input. As a result, the storage area address register containing the synchronization variable address or synchronization register address /? /-9 is selected and sent to the memory control unit 11, after which it is exclusively incremented by 1 by the interrupt processing program of the processing unit 11. (12) The O-th element processor further enters the next iteration and converts A (1.2) to the first
Send to element processor. (13) The first element processor is . Write A (1.2) to address aO+8. The contents of the storage area address register /'l/-j become ao+16.

（１４）第Ｏ要素プロセッサは制御情報を送り、第１要
素プロセッサはこれに排他的加算を行う．第１要素プロ
セッサの同期用変数．２３ｌまたは同期用レジスター３
，．２の値は２となる．（第Ｏ要素プロセッサの送信は
このようにいくら先行しても構わない．）（１５）第１要素プロセッサが同期用変数−３７または
同期用レジスタ，．２Ｊ一の内容が正か否かチェックし
，正なら排他的に１を減算する。（もし、ゼロまたは負
ならｂｕｓｙ　ｗａｉｔする）。(14) The O-th element processor sends control information, and the first element processor performs exclusive addition to it. Variable for synchronization of the first element processor. 23l or synchronization register 3
、． The value of 2 is 2. (It doesn't matter how much time the O-th element processor sends in advance in this way.) (15) The first element processor sends synchronization variable -37 or synchronization register . Check whether the content of 2J1 is positive or not, and if it is positive, 1 is exclusively subtracted. (If zero or negative, busy wait).

（１６）第１要素プロセッサはａｏ番地からＡ（１，１
）読みだし、それを用いてＡ　（２．１）を定義する．
結果は第２要素プロセッサに送る．以上のようにして、
アドレス情報の送信やデコ一ド／切替を行わずに効率良
く通信して、要素プロセッサ間でパイプライン演算を行
うことができる．ヌ】１１ｌ並列計算機の全体構成，要素プロセッサの一部構成部分
，プログラムの分割と割り当ておよび実行のさせ方は実
施例２と同じである。以下では、異なる部分について第
８，９図を用いて重点的に説明する．本実施例は，実施例２を要素プロセッサがベクトルプロ
セッサである場合に拡張したものである．要素プロセッ
サは局所メモリ，２３の他、スカラプロセッサｌ，５、
全同期用レジスタ．２Ａ、ロード／ストアパイプ７−／
，７−２，ベクトノレレジスタ／ｌ−／〜ｌ一−ダ、ベ
クトル演算器／ダー／〜／Ｑ−Ｊ、インタチェンジＡ／
Ａ、インタチェンジＢ／７、ＳＥＮＤパイプ８　，　Ｒ
ＥＣＥＩＶＥパイプフ、および格納アドレス生成回路／
？とから構成される。同期用変数，同期用レジスタは用
いない．以下に，各構成要素の機能について簡単に述べ
る。(16) The first element processor starts from address ao to A(1,1
) and use it to define A (2.1).
The result is sent to the second element processor. As above,
It is possible to perform pipeline operations between element processors by communicating efficiently without sending address information or decoding/switching. [11l] The overall configuration of the parallel computer, some of the constituent parts of the element processors, and the method of dividing, allocating, and executing programs are the same as in the second embodiment. Below, the different parts will be explained with emphasis using Figures 8 and 9. This embodiment is an extension of Embodiment 2 to the case where the element processors are vector processors. Element processors include local memory, 23, as well as scalar processors l, 5,
Register for all synchronization. 2A, load/store pipe 7-/
, 7-2, Vector register/l-/~l-da, Vector arithmetic unit/dar/~/Q-J, Interchange A/
A, Interchange B/7, SEND pipe 8, R
ECEIVE pipeline and storage address generation circuit/
? It consists of Synchronization variables and registers are not used. The functions of each component are briefly described below.

・局所メモリ−３とスカラプロセッサ／，５：通常の逐
次処理型計算機であり，要素プロセッサーに割り当てら
れたベクトル処理以外の処理を担当する．・全同期用レジスター乙：要素プロセッサコ全体で同期
をとるためのレジスタ。実施例１，２に同じ．・ロード／ストアパイプ７−／ｔ　　７−．２：ベクト
ルレジスタ／一一／〜／．２−Ｑと局所メモリー３間で
データの転送を高速に行う装置。通常のベクトル計算機
で使われているものと同じ．・ベクトルレジスタ／，２
−／−／ｊ−Ｉｌ：ベクトル演算に使用するデータを格
納するテンポラリ・レジスタ．通常のベクトル計算機で
使われているものと異なり，語単位に１ビットのタグ・
フィールドｌ３−／〜／３−ダが用意されていて、ベク
トルレジスタ／．２−／〜／コーダにデータをロードす
ると１にセットされる。また、ベクトル演算器ｌダーｌ
〜／ｌＩ−ｊはタグ・フィールド／３一７〜／　Ｊ−１
の値が１である場合に限ってその語を入力し、そのタグ
・フィールド／３−／〜／３−ダの値を０にリセットす
る。繰返し参照する定数データがベクトルレジスタ／一
一／〜／２−２に入っている場合は、命令によりタグ・
フィールド／３−ｌ〜／　ｊ−ｌｌの値を０にリセット
しない。・Local memory 3 and scalar processor/, 5: A normal sequential processing computer, responsible for processing other than vector processing assigned to element processors.・Register B for all synchronization: Register for synchronizing all element processors. Same as Examples 1 and 2.・Load/store pipe 7-/t 7-. 2: Vector register /11/~/. A device that transfers data between 2-Q and local memory 3 at high speed. It is the same as that used in ordinary vector calculators.・Vector register/,2
-/-/j-Il: Temporary register that stores data used in vector operations. Unlike those used in ordinary vector calculators, each word has a 1-bit tag.
Fields l3-/~/3-da are prepared, and vector registers /. 2-/~/ Set to 1 when data is loaded into the coder. In addition, the vector operator
~/lI-j is the tag field /3-7~/J-1
Enter the word only if the value of is 1, and reset the value of the tag field /3-/ to /3-da to 0. If constant data to be referenced repeatedly is stored in vector registers /11/~/2-2, the tag/
Do not reset the values of fields /3-l to /j-ll to 0.

・ベクトル演算器／ｌ１ｔ−／〜／％−ｊ：通常のベク
トル計算機で使われているものと同じ。- Vector calculator /l1t-/~/%-j: Same as the one used in normal vector calculators.

・インタチェンジＡ／乙：ベクトルレジスタ７．２−／
〜／ｄ−９とロード／ストアパイプ７−／，７−．２、
ＳＥＮＤパイプ８、ＲＥＣＥＩＶＥパイプフを相互結合
するデータパス。・Interchange A/B: Vector register 7.2-/
~/d-9 and load/store pipe 7-/, 7-. 2,
A data path that interconnects the SEND pipe 8 and the RECEIVE pipe.

・インタチェンジＢ１７：ベクトル演算器ｌｌ１ｔ一／
〜／ダー３とベクトルレジスタ／Ｕ−／〜／コーダを相
互結合するデータパス。・Interchange B17: Vector operator ll1t-/
A data path interconnecting ~/der 3 and vector register /U-/~/coder.

・ＳＥＮＤパイプ８：ベクトルレジスタ／一一／〜／，
？−１から他の要素プロセッサのベクトルレジスタ／一
一／〜／．２−１へデータを高速に転送する装置。・SEND pipe 8: Vector register /11/~/,
? -1 to the vector registers of other element processors /11/~/. A device that transfers data to 2-1 at high speed.

クトルレジスタ／．２−／〜／一−９から高速に転送さ
れてきたデータをインタチェンジＡ／Ａを経由して自ベ
クトノレレジスタ／Ｊ−／〜／一−９に格納する装置．・格納アドレス生成回路：　ＲＥＣＥＩＶＥバイプ７か
ら出力されるデータを格納するベクトルレジスタ／．２
−／〜／．．２−１のアドレスを、受信チャネルから生
成する装置．このアドレスによりインタチェンジＣ／８
の接続パスが設定される．機能的には実施例２に類似し
ているが、格納領域アドレスレジスタ／？／−／〜３に
はベクトルレジスタアドレスが格納され、語長レジスタ
や加算回路がない点が異なる．相互結合ネットワーク３は実施例２と同じであり、デイ
ストリビュータ・パタン設定回路３３によりデイストリ
ビュータＪ／／−０〜Ｊ／／−ｊの接続パタンを設定し
てアドレスデコードやスイッチングを不要にしたもので
ある。vector register/. A device that stores data transferred at high speed from 2-/~/1-9 into its own vector register /J-/~/1-9 via interchange A/A.・Storage address generation circuit: Vector register/. 2
−／〜／．．． A device that generates an address of 2-1 from a receiving channel. By this address, Interchange C/8
The connection path for is set. Although it is functionally similar to the second embodiment, the storage area address register/? The difference is that vector register addresses are stored in /-/~3, and there is no word length register or addition circuit. The mutual coupling network 3 is the same as in the second embodiment, and the connection pattern of the distributors J//-0 to J//-j is set by the distributor pattern setting circuit 33, eliminating the need for address decoding and switching. This is what I did.

次に，本実施例の並列計算機の動作について述べる。プ
ログラムは，実施例１のＦＯＲＴＲＡＮプログラム例■
を用いて説明する。Next, the operation of the parallel computer of this embodiment will be described. The program is a FORTRAN program example of Example 1■
Explain using.

（１）ベクトル処理に入る前に、ホスト計算機ｌが相互
結合ネットワーク３の結合パタンを設定する。すなわち
、ディストリビュータ・パタン設定回路３３により各デ
イストリビュータ３／／−０〜Ｊ／／−Ｊの出力チャネ
ルを０に設定する。この例では、デイストリビュータの
出力チャネル０はそれぞれセレクタ３−−０〜ｊ２　−
３の入力チャネル２と接続しているからである。(1) Before starting vector processing, the host computer l sets the coupling pattern of the mutual coupling network 3. That is, the output channel of each distributor 3//-0 to J//-J is set to 0 by the distributor pattern setting circuit 33. In this example, output channel 0 of the distributor is connected to selector 3--0 to j2-, respectively.
This is because it is connected to input channel 2 of No. 3.

（２）ベクトルレジスタ／一一／への受信命令を発行す
る．すなわち、格納領域アドレスレジスタ／？／−．．
？に、ホスト計算機１が書き込み制御回路／９０を経由
して受信用ベクトルレジスタ／一一／のアドレスを格納
し，同時にインタチェンジＡ／４のデータパスの一つを
ベクトルレジスタ／２−／にアサインする。具体的には
（第９図）、ホスト計算機１中のベクトル命令制御回路
／．５０が信号線１０を用いてセレクタ／乙０−／を信
号線１／９に接続し、信号線１／０によりＲＥＣＥＩＶ
Ｅ指示発生制御回路？０に起動信号を送る。また、信号
線１ダによりベクトルレジスタアクセス制御回路ヲ一に
も起動信号を送る．　ＲＥＣＥＩＶＥ指示発生制御回路
？Ｏは起動され、信号線？Ｊ−，２から受信ベクトルレ
ジスタアドレスが入力されるまで待機する．格納領域ア
ドレスレジスタ／’７／−／〜３は相互結合ネットワー
ク３の各セレクタ３．２−０〜３−−３への入力チャネ
ル０〜２に対応しており、入カチャネル２はどの要素プ
ロセッサのセレクタにおいても一つ若い番号の要素プロ
セッサと接続している。従って，これで全要素プロセッ
サのベクトルレジスタ／２−／は一つ若い番号の要素プ
ロセッサから送信されてくるベクトルデータを受信でき
る体制が整ったことになる。そのタグ・フィールド／３
−／は０に初期設定される．（ただし、第Ｏ要素プロセ
ッサだけは受信命令の代りにベクトルレジスタ／一一／
への初期データのロード命令を発行する。(2) Issue a receive command to vector register /11/. In other words, the storage area address register/? /-. ．．
? , host computer 1 stores the address of receiving vector register /11/ via write control circuit /90, and simultaneously assigns one of the data paths of interchange A/4 to vector register /2-/. do. Specifically (FIG. 9), the vector instruction control circuit/. 50 uses the signal line 10 to connect the selector /Otsu0-/ to the signal line 1/9, and the signal line 1/0 connects the RECEIV
E instruction generation control circuit? Send a start signal to 0. It also sends a start signal to the vector register access control circuit via signal line 1. RECEIVE instruction generation control circuit? O is activated and the signal line? Wait until the receive vector register address is input from J-,2. The storage area address register /'7/-/~3 corresponds to input channels 0~2 to each selector 3.2-0~3-3 of the interconnection network 3, and input channel 2 corresponds to which element processor. It is also connected to the element processor with the next lower number in the selector. Therefore, the vector registers /2-/ of all element processors are now ready to receive vector data transmitted from the element processor with the next lower number. That tag field/3
-/ is initialized to 0. (However, only the O-th element processor uses a vector register /11/
Issue an instruction to load initial data to.

この場合は、タグ・フィールド／３−ｌは１となる。）（３）ベクトノレレジスタ７．２−一にＢ　（Ｊ），Ｊ
＝１，１００のロードを開始する。これはロードパイプ
７−／を用いて行い、タグフィールド／３一一には１が
セットされていく。具体的には、信号線１０によりセレ
クタ／乙Ｏ−一を信号１／乙と接続し、信号線１ｊによ
りリクエスト発生制御回路７０−／に起動信号，要素数
，データ幅を，信号線１乙によりアドレス発生制御回路
７／−／にＢ　（Ｊ）の先頭アドレスと増分を送る。ま
た、信号線１／によりベクトルレジスタアクセス制御回
路７８−ｌに起動信号とベクトルレジスタ／．２−２の
アドレスを送る。これにより、ベクトルレジスタアクア
クセス制御回路７８−ｌはベクトルレジスタ／一−−へ
の書き込みを制御できる．アドレス発生制御回略７／−
／が生成するアドレスはアドレスレジスタ７−一ｌに格
納された後、優先制御回路７３を通ってアドレスレジス
タ７ｌＩに入り、局所メモリ．２３の読み出しに用いら
れる。優先制御回路２３は所定のサイクル数経過後に，
セレクタ７６，ベクトルレジスタアクセス制御回路２８
−／、にそれぞれ選択情報、書き込み指示信号を送り、
局所メモリ．２３から出力されたデータをセレクタｌ乙
Ｏ−一を経由してベクトルレジスタ／一−一に書き込ん
でいく。このとき，タグフィールド／Ｊ−，２にも１が
書き込まれる。In this case, the tag field /3-l will be 1. ) (3) Vector register 7.2-1 B (J), J
Start loading =1,100. This is done using the load pipe 7-/, and 1 is set in the tag field /3-1. Specifically, the selector/O-1 is connected to the signal 1/O via the signal line 10, and the activation signal, number of elements, and data width are sent to the request generation control circuit 70-/ via the signal line 1j. The start address and increment of B (J) are sent to the address generation control circuit 7/-/. Further, the signal line 1/ sends an activation signal to the vector register access control circuit 78-l and the vector register/. Send the address of 2-2. This allows the vector register access control circuit 78-l to control writing to the vector registers/1--. Address generation control circuit 7/-
The address generated by / is stored in the address register 7-1I, passes through the priority control circuit 73, enters the address register 7II, and is transferred to the local memory 7-1I. 23 is used for reading. After a predetermined number of cycles, the priority control circuit 23
Selector 76, vector register access control circuit 28
Send selection information and write instruction signals to -/, respectively,
Local memory. The data outputted from 23 is written into the vector register/1-1 via the selector lO-1. At this time, 1 is also written in the tag field /J-,2.

（４）同時にベクトル加算命令を発行し、ベクトルレジ
スタ／．２−／とベクトルレジスタ／２−２の内容を加
算して、ベクトルレジスタ／，２−ｊとベクトルレジス
タ／，２−１への出力を開始する．第０要素プロセッサ
以外は，ベクトルレジスタ／一−／のタグ・フィールド
／３−／は０だからすぐには計算に入れない。しかし、
第Ｏ要素プロセッサは計算を始めることができ、その結
果をインタチェンジＢ／７を経由してベクトルレジスタ
／ｔ２−Ｊ〜／．．２−１に出力していく。出力された
語に対応するタグ・フィールド／３−３〜／Ｊ−ダは１
となる。(4) At the same time, a vector addition instruction is issued and the vector register/. Adds the contents of 2-/ and vector register /2-2 and starts outputting to vector register /, 2-j and vector register /, 2-1. For processors other than the 0th element processor, the tag field /3-/ of vector register /1-/ is 0, so it is not immediately included in the calculation. but,
The Oth element processor can start calculations and send the results to vector registers /t2-J~/. through interchange B/7. ．． Output to 2-1. The tag field /3-3~/J-da corresponding to the output word is 1
becomes.

（５）ベクトルレジスタ／，２−Ｊからの送信命令を発
行する。これにより、インタチェンジＡ／Ａ上にベクト
ルレジスタ／一−３からＳＥＮＤバイプ８へのデータバ
スができ、ＳＥＮＤパイプ８はタグ・フィールド／３−
３が１の内容を相互結合ネットワーク３に送り出す。具
体的には、ベクトル命令制御回路／ＪＯが信号１０を用
いてセレクタ／乙０−Ｊを信号線１／８と接続し，信号
線１？を用いてＳＥＮＤ指示発生制御回路８０に起動信
号を送り、さらに、ベクトルレジスタアクセス制御回路
８３に起動信号とベクトルレジスタ／．２−Ｊのアドレ
スを送る．ベクトルレジスタアクセス制御回路８３の信
号によりベクトルレジスタ７．２−Ｊから読み出された
データはセレクタ／乙０−ｊを経由して信号線１／８上
に出力され，データレジスタ８−に格納される．このと
き、各語の先頭のタグ・フィールドの内容が１であれば
，ＳＥＮＤ指示発生制御回路８０からベクトルレジスタ
アクセス制御回路８３に次の読み出し指示信号が送られ
，ベクトルレジスタ／．．２−Ｊの次の語が読み出され
る。また、読み出したデータは、タグ部を除いてデータ
線／０に出力され，信号線１／／には送信信号が出力さ
れる。タグ・フィールドの内容がＯであればＳＥＮＤ指
示発生制御回路８０からは次の読み出し指示信号は送ら
れず、繰り返し同じ語を読みだす。また、信号線１／７
には送信信号は出力されない。(5) Issue a transmission command from vector register/, 2-J. As a result, a data bus from vector register /1-3 to SEND pipe 8 is created on interchange A/A, and SEND pipe 8 is connected to tag field /3-3.
3 sends the contents of 1 to the interconnection network 3. Specifically, the vector instruction control circuit/JO connects the selector/Otsu0-J to the signal line 1/8 using the signal 10, and the signal line 1? The activation signal is sent to the SEND instruction generation control circuit 80 using the SEND instruction generation control circuit 80, and the activation signal and the vector register/. 2-Send J's address. The data read from the vector register 7.2-J by the signal of the vector register access control circuit 83 is outputted to the signal line 1/8 via the selector/Otsu0-j and stored in the data register 8-J. Ru. At this time, if the content of the tag field at the beginning of each word is 1, the next read instruction signal is sent from the SEND instruction generation control circuit 80 to the vector register access control circuit 83, and the vector register/. ．． 2-J's next word is read. Further, the read data is outputted to the data line /0 except for the tag part, and a transmission signal is outputted to the signal line 1//. If the content of the tag field is O, the SEND instruction generation control circuit 80 does not send the next read instruction signal and repeatedly reads the same word. Also, the signal line 1/7
No transmission signal is output.

相互結合ネットワーク３では、デイストリビュータＪ／
／−０〜Ｊ／／−ｊの出力チャネル０は一つ大きい番号
の要素プロセッサのセレクタ３−一〇〜３．２−３の入
力チャネル２と接続しているから、第０要素プロセッサ
から送信されたデータは第１要素プロセッサに送られる
。In the interconnection network 3, the distributor J/
Since the output channel 0 of /-0 to J//-j is connected to the input channel 2 of the selector 3-10 to 3.2-3 of the element processor with the next higher number, the data is sent from the 0th element processor. The processed data is sent to the first element processor.

（６）ベクトルレジスタ／．２−１のストア命令を発行
する。ベクトルレジスタ／，２−Ｚにはベクトルレジス
タ／．２−．７と同じ内容が格納されている。この命令
により、送信とは独立に自メモリへの格納が実行される
。格納はもう一本のストアパイプ７−λを用いて行われ
る．具体的には，ベクトル命令制御回路／．５０が信号
１０によりセレクタ／６０−ｔＩを信号線１／７と接続
し、信号線１７，信号線１８にそれぞれ起動信号，要素
数，データ幅，Ａ　（Ｉ，Ｊ）の先頭アドレスを乗せて
リクエスト発生制御回路７０−Ｊ，アドレス発生回路７
／−２に送る．さらに、ベクトルレジスタアクセス制御
回路７Ｂ−２に起動信号とベクトルレジスタ／．２−Ｉ
Ｉのアドレスを送る。ロードのときと同様にして局所メ
モリにＡ　（Ｉ，Ｊ）のアドレスが順に送られ、また、
ベクトルレジスタ／一一９から読みだされたデータがセ
レクタ／乙Ｏ−ｔＩ，信号線１ｌ７，データレジスタ７
７−．．２を経由して局所メモリー３に書き込まれる。(6) Vector register/. Issue the store command 2-1. Vector register/, 2-Z has vector register/. 2-. The same contents as 7 are stored. With this command, storage into the own memory is executed independently of transmission. Storage is performed using another store pipe 7-λ. Specifically, vector instruction control circuit/. 50 connects selector/60-tI to signal line 1/7 by signal 10, and puts the start signal, number of elements, data width, and start address of A (I, J) on signal line 17 and signal line 18, respectively. Request generation control circuit 70-J, address generation circuit 7
Send to /-2. Furthermore, the vector register access control circuit 7B-2 receives an activation signal and the vector register/. 2-I
Send I's address. The addresses of A (I, J) are sequentially sent to local memory in the same way as when loading, and
The data read from vector register/119 is sent to selector/O-tI, signal line 1l7, and data register 7.
7-. ．． 2 to the local memory 3.

（７）一つ若い番号の要素プロセッサから送られてきた
データは、相互結合ネットワーク３の当該セレクタの入
力チャネル２からＲＢＣＥＩＶＥバイプヲに送られる。(7) Data sent from the element processor with the next lower number is sent from the input channel 2 of the selector in the interconnection network 3 to the RBCEIVE pipe.

同時に、セレクタの入力チャネル番号１２′がデコーダ
ｌ？−へ入力され，その結果セレクタ／？３により格納
領域アドレスレジスタ／９／−Ｊが選択されて、その内
容（受信用ベクトルレジスタ／ｊ−／のアドレス）がＲ
ＥＣＥＩＶＥバイプ２から渡される受信データの格納先
ベクトルレジスタのアドレスとしてインタチェンジＡＩ
乙へ送られる．すなわち、信号線？３−一上のベクトル
レジスタ／一一／のアドレスがＲＥＣＥＩＭＥ指示発生
制御回路？０に渡され、信号線１／ｊを経由してベクト
ルレジスタアクセス制御回路？一に書き込み指示信号と
共に入力される。この入力にもとづいて，ベクトルレジ
スタアクセス制御回路？一は信号線９３一／上のデータ
とＲＥＣＥＩＶＥ指示発生制御回路？０が生成したタグ
・フィールドの値１を、データレジスタ？／，信号線１
ｌ？，セレクタ／４０−／を経由してベクトルレジスタ
／一一ｌに書き込む．こうして，受信データはベクトルレジスタ／．２−／に
格納され、そのタグ・フィールド／Ｊ−／は書き込みさ
れた語単位に１にセットされていく．第１要素プロセッ
サ以降は，この値を用いてベクトル処理を行う．以上のようにして，アドレス情報の送信やデコード／切
替を行わずに効率良く通信して、要素プロセッサ間にま
たがるベクトル演算を行うことができる。At the same time, the input channel number 12' of the selector is changed to the decoder l? − is input to the result selector /? 3 selects storage area address register /9/-J, and its contents (address of reception vector register /j-/) are set to R.
Interchange AI as the address of the vector register where the received data passed from ECEIVE byte 2 is stored.
Sent to Party B. In other words, the signal line? 3-Is the address of the upper vector register /11/ the RECEIME instruction generation control circuit? 0 and is passed to the vector register access control circuit via signal line 1/j? It is input together with the write instruction signal. Based on this input, vector register access control circuit? 1 is the data on the signal line 931/ and the RECEIVE instruction generation control circuit? The value 1 of the tag field generated by 0 is transferred to the data register? /, signal line 1
l? , writes to vector register /11 via selector /40-/. In this way, the received data is transferred to the vector register/. 2-/, and its tag field /J-/ is set to 1 for each written word. From the first element processor onward, vector processing is performed using this value. As described above, it is possible to efficiently communicate and perform vector operations across element processors without transmitting address information or decoding/switching.

〔Effect of the invention〕

本発明では，一つの繰返しループと該ループと依存関係
のある次の繰返しループの間で必要な同期は、放送手段
およびハードウエアによる全同期手段を用いて高速に実
現される。また、繰返しループ内のデータ依存関係を満
足するためには，要素プロセッサがスカラプロセッサの
場合、同期を取り合う要素プロセッサで制御情報の送信
＆加算命令とチェック＆ｂｕｓｙ−％ｔａｉｔ命令を発
行するだけでよく、共有メモリのロック・アンロツクの
ように他の処理を阻害することがない。さらに、排他的
加減算は同期用変数または同期用レジスタの存在する側
の要素プロセッサが制御情報を受け取った後行うので、
不必要にネットワークを専有して他の処理を阻害するこ
ともない．また、要素プロセッサがベクトルプロセッサ
の場合、要素プロセッサ間にまたがってベクトルレジス
タを結合しておいてから、ベクトル処理を行なうことが
でき、データ依存関係のある繰り返しループのベクトル
処理を並列に実行することが可能となる．さらに、ネッ
トワーク・ブリセット装置により宛先情報の送信が不要
となり、通信時の宛先のデコードやスイッチ切り替えが
省略できるので通信が高速化できるという効果がある．In the present invention, the necessary synchronization between one repeating loop and the next repeating loop that is dependent on the loop is achieved at high speed by using broadcast means and hardware-based total synchronization means. Furthermore, in order to satisfy data dependencies in a repeat loop, if the element processors are scalar processors, all that is required is to issue control information transmission & addition instructions and check & busy-%tait instructions in the element processors that maintain synchronization. , does not interfere with other processes like locking/unlocking shared memory. Furthermore, exclusive addition and subtraction are performed after the element processor on the side where the synchronization variable or register exists receives the control information.
It does not monopolize the network unnecessarily and interfere with other processing. In addition, if the element processors are vector processors, vector processing can be performed after connecting vector registers across element processors, and vector processing of repeating loops with data dependencies can be executed in parallel. becomes possible. Furthermore, the network brisset device eliminates the need to send destination information and eliminates decoding and switching of destinations during communication, resulting in faster communication.

[Brief explanation of the drawing]

第１図は，本発明の第１実施例の全体構成図、第２図は
、本発明の各実施例に共通の並列計算機の概念図，第３
図は、相互結合ネットワーク中のデイストリビュータの
構成図，第４図は、相互結合ネットワーク中のセレクタ
の構成図、第５図は、セレクタの選択論理を表わすＲＯ
Ｍの一例，第６図は、本発明の第２実施例の全体構成図
、第７図は，第２実施例の送信情報の内容説明図，第８
図は、本発明の第３実施例の全体構成図，第９図は，本
発明の第３実施例のベクトル処理装置の詳細な構成図で
ある．剃ｉｍｐ／テニタ織別コーｋ図４　石レ腎ハ゜ス５　４ト用１む弓イ自号イ橋ヒ６　　ＡＮＤ回路Figure 1 is an overall configuration diagram of the first embodiment of the present invention, Figure 2 is a conceptual diagram of a parallel computer common to each embodiment of the present invention, and Figure 3 is a conceptual diagram of a parallel computer common to each embodiment of the present invention.
FIG. 4 is a configuration diagram of a distributor in an interconnected network, FIG. 4 is a configuration diagram of a selector in an interconnected network, and FIG.
An example of M, FIG. 6 is an overall configuration diagram of the second embodiment of the present invention, FIG. 7 is an explanatory diagram of the contents of transmission information of the second embodiment, and FIG.
The figure is an overall block diagram of a third embodiment of the present invention, and FIG. 9 is a detailed block diagram of a vector processing device according to a third embodiment of the present invention. Razor imp/Tenita Oribetsu Cork Fig. 4 Ishire Kidney High 5 1 for 4-tooth bow I own number I Bridge 6 AND circuit

Claims

[Scope of Claims] 1. Means for writing information from a host computer to the same address in the storage device of all element processors at once, means for detecting the completion of processing of all element processors, and means for writing information between arbitrary element processors. Mutual connection network for sending and receiving information, and synchronization variables or registers and their exclusiveness provided in each element processor to synchronize writing and reading to and from the storage device when sending and receiving information. 1. A parallel computer comprising: a data flow synchronization means comprising a digital addition/subtraction circuit; 2. Between element processors having vector calculation units, 1
means for establishing a path for sending data directly from the vector register of one element processor to the vector register of one or more other element processors, and when the value is zero, data can be written to the vector register; , a data flow synchronization device between vector registers consisting of a vector register having a tag field provided for each word, which allows data to be read from the vector register when the value is 1, and means for manipulating the value of the tag field. . 3. A network connection pattern setting circuit that sets the connection pattern of the network that interconnects the element processors before using the network, and a vector register address or storage device that stores the source element processor number and the data sent from there. A network preset device consisting of a storage address generation circuit that converts into a storage area address inside. 4. A parallel computer using the vector register data flow synchronization device according to claim 2 as data flow synchronization means. 5. A parallel computer using the network preset device according to claim 3. 6. A parallel computer using the network presetting device according to claim 3 as an interconnection network.