JPH06309285A

JPH06309285A - Communication processing circuit for parallel computer

Info

Publication number: JPH06309285A
Application number: JP5099901A
Authority: JP
Inventors: Shinichi Ichikawa; 眞一市川
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1993-04-27
Filing date: 1993-04-27
Publication date: 1994-11-04
Anticipated expiration: 2020-08-17
Also published as: JP3684579B2

Abstract

PURPOSE:To reduce the overhead of a processor element accompanied with an inter-processor element communication processing by providing a specific communication processing part, and selectively executing a prescribed communication processing independently of an operation at a calculation processing part. CONSTITUTION:This circuit is constituted of a calculation processing part 10 and a communication processing part 11. Then, the communication processing part 11 is equipped with registers 110-113 which store logic address (1) of a processor element 1, number (2) of entire processors, logical data (5) and (6), and communication data (3) communicated from a communication network 3, buffers 116-117, each kind of arithmetic unit 114 and 115, control sequencer 118 which operates communication control, and communication network interface 119, as a hardware mechanism used exclusively for a global processing. Then, after the circuit is activated by a software instruction executed by the calculation processing part 10, the global communication processing is executed by the hardware mechanism independently of the calculation processing part 10.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、分散型メモリを備えた
並列計算機における通信処理方法に関し、特に、並列計
算機を構成している複数個の全プロセッサエレメント(P
E1, 〜) 上でのデータを参照して行う大域的な処理を行
う際の通信処理回路に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a communication processing method in a parallel computer provided with a distributed memory, and more particularly to a plurality of all processor elements (P
E1, ...) It relates to the communication processing circuit when performing global processing by referring to the data above.

【０００２】あらゆる工業分野、技術開発分野で設計を
行う際には、偏微分方程式を解いたり、構造を解析した
りして、実験によらずに数値シミュレーションにより、
製品の特性、性能を予測することが重要となってきてい
る。しかも、この数値シミュレーションには、年々より
高速のコンピュータが必要となってきており、中央処理
装置(CPU) を数多く連ねた並列コンピュータは、計算能
力に対する増大する要求に答える手段として、その利用
が検討されはじめている。When designing in all industrial fields and technical development fields, partial differential equations are solved, structures are analyzed, and numerical simulation is performed without experiments.
It is becoming important to predict the characteristics and performance of products. Moreover, this numerical simulation requires a faster computer every year, and the use of parallel computers with a large number of central processing units (CPUs) is being considered as a means of responding to the increasing demand for computational power. It is beginning to be done.

【０００３】分散メモリ型並列計算機上で、このような
数値シミュレーションの並列処理を行う時には、ホスト
からデータ用の配列を、各プロセッサエレメント(PE1,P
E2,〜) に分割配置してデータの更新を行う。When parallel processing of such a numerical simulation is performed on a distributed memory type parallel computer, an array for data is sent from the host to each processor element (PE1, P1
The data is updated by dividing it into E2, ~).

【０００４】このような並列処理で必要な通信処理は、
数値モデル上で相互作用の及ぶ範囲のデータを持つプロ
セッサエレメント(PE1,PE2, 〜) から参照すべきデータ
のコピーを受け取る局部的な処理と、全プロセッサエレ
メント(PE1,PE2, 〜) 上のデータを参照して演算を行
う、大域的な処理とがある。The communication processing required for such parallel processing is
Local processing that receives a copy of the data to be referenced from the processor elements (PE1, PE2, ...) that have data within the interaction range on the numerical model, and the data on all processor elements (PE1, PE2, ...) There is a global process in which the calculation is performed by referring to.

【０００５】このような、大域的な処理においては、プ
ロセッサエレメント(PE1,PE2, 〜)の数が多くなると、
処理時間も大きくなるが、該処理時間に含まれる通信処
理時間も無視できなくなること、又、該大域的な処理に
おいては、他のプロセッサエレメントからデータをもら
って、所定の演算をした後、別のプロセッサエレメント
に送信するといった処理であるにも関わらず、データを
受信する毎に、該プロセッサエレメント内のソフトウェ
ア (即ち、アプリケーション) に割り込みが発生して、
該アプリケーションが擾乱され、該並列計算機全体とし
ての処理能力が低下することから、上記プロセッサエレ
メント内で実行されているアプリケーション等に対する
影響の少ない通信処理回路が要求される。In such global processing, if the number of processor elements (PE1, PE2, ...) Increases,
Although the processing time becomes long, the communication processing time included in the processing time cannot be ignored, and in the global processing, after receiving data from another processor element and performing a predetermined calculation, another processing is performed. Despite the processing such as sending to the processor element, each time data is received, an interrupt occurs in software (that is, application) in the processor element,
Since the application is disturbed and the processing capacity of the parallel computer as a whole is lowered, a communication processing circuit which has little influence on the application executed in the processor element is required.

【０００６】[0006]

【従来の技術】図３は、並列計算機での従来の通信処理
方法を説明する図であり、図３(a) はメモリ分散型並列
計算機の構成例を示し、図３(b) は、プロセッサエレメ
ント間で送受信されるデータのフオーマット例を示して
いる。2. Description of the Related Art FIG. 3 is a diagram for explaining a conventional communication processing method in a parallel computer, FIG. 3 (a) shows a configuration example of a memory distributed parallel computer, and FIG. 3 (b) shows a processor. The format example of the data transmitted / received between elements is shown.

【０００７】先ず、図３(b) に示したデータフオーマッ
トにおいて、先頭のヘッダ部は、通信先のプロセッサエ
レメント(PE1, 〜) 1 のアドレス(SA)と、該通信先のプ
ロセッサエレメント(PE1, 〜) 1 に対する割り込み(IN
T) の有無と、データのクラス(アプリケーションが優先
して処理する必要のあるデータか否かを識別するクラ
ス) 等の制御情報で構成されている。First, in the data format shown in FIG. 3 (b), the header section at the head is the address (SA) of the processor element (PE1, ...) 1 of the communication destination and the processor element (PE1, PE1 of the communication destination). ~) Interrupt for 1 (IN
It is composed of control information such as the presence / absence of T) and the class of data (a class for identifying whether or not the data needs to be preferentially processed by the application).

【０００８】送信側のプロセッサエレメント(PEn) 1 の
アプリケーション、例えば、通信ライブラリが、プロセ
ッサエレメント(PE1) 1 にデータの送信を行う場合、所
定のデータを、主記憶装置 12 上に用意した後、所定の
条件を指示して、通信処理部11 内の、例えば、ダイレ
クトメモリアクセス機構(DMA) 110 を起動する。When an application of the processor element (PEn) 1 on the transmission side, for example, a communication library sends data to the processor element (PE1) 1, after preparing predetermined data on the main storage device 12, By instructing a predetermined condition, for example, the direct memory access mechanism (DMA) 110 in the communication processing unit 11 is activated.

【０００９】該ダイレクトメモリアクセス機構(DMA) 11
0 は、指示された条件の元で、プロセッサエレメント(P
En) 1 の主記憶装置 12 の所定のアドレスから、所定の
データ長のデータを読み取り、通信ネットワーク 3を介
して、プロセッサエレメント(PE1) 1 にデータを転送す
る。The direct memory access mechanism (DMA) 11
0 is the processor element (P
Data of a predetermined data length is read from a predetermined address of the main memory device 12 of En) 1 and transferred to the processor element (PE1) 1 via the communication network 3.

【００１０】プロセッサエレメント(PE1) 1 の通信処理
部 11 では、送信されてきた通信データのヘッダ部を
参照して、自己のプロセッサエレメント(PE1) 1 に対す
る送信データであって、且つ、割り込みフラグ(INT) が
“１”になっていると、本体部 (計算処理部) 10で実行
されているオペレーションシステム(OS)に割り込みを発
生する。The communication processing section 11 of the processor element (PE1) 1 refers to the header section of the transmitted communication data, determines that the data is the transmission data for its own processor element (PE1) 1, and the interrupt flag ( When INT) is “1”, it interrupts the operating system (OS) executed in the main unit (calculation processing unit) 10.

【００１１】本体部 10 で実行されるオペレーションシ
ステム(OS)では、上記割り込みを受け付けると、通信デ
ータの、上記ヘッダ部を取り込み、通信先のチェック
をした後、データ長等を参照して、ダイレクトメモリア
クセス機構(DMA) 110 を起動する。When the operating system (OS) executed by the main body unit 10 accepts the interrupt, the header unit of the communication data is fetched, the communication destination is checked, and then the data length is referred to to directly Invokes the Memory Access Facility (DMA) 110.

【００１２】起動されたダイレクトメモリアクセス機構
(DMA) 110 では、指示されたデータ転送条件の元で、通
信ネットワーク 3から通信データのデータ部を読み取
り、主記憶装置 12 に転送する。Direct memory access mechanism activated
The (DMA) 110 reads the data portion of the communication data from the communication network 3 and transfers it to the main storage device 12 under the instructed data transfer condition.

【００１３】このように、従来の通信処理では、所定の
プロセッサエレメント(PEi) 1 からのデータ転送がある
毎に、通信先のプロセッサエレメント(PEj) 1 の本体部
10で実行されているオペレーションシステム(OS)に割
り込まれ、少なくとも、ヘッダ部を読み取る為に、通信
データを通信処理部 11 から本体部 10 まで転送する
動作が実行される。As described above, in the conventional communication processing, every time data is transferred from a predetermined processor element (PEi) 1, the main body of the processor element (PEj) 1 of the communication destination is transmitted.
The operation system (OS) executed in 10 is interrupted, and at least an operation of transferring communication data from the communication processing unit 11 to the main body unit 10 is executed in order to read the header part.

【００１４】[0014]

【発明が解決しようとする課題】従って、現在の並列計
算機では、主記憶装置 12 に対するメモリアクセスに比
べて、通信処理のスループットが格段に遅いため、並列
処理の効果を得るためには、出来るだけ通信処理, 及
び、通信に関連する処理の時間を短くする工夫が必要で
ある。Therefore, since the throughput of the communication processing is much slower than the memory access to the main memory 12 in the current parallel computer, the parallel processing can be obtained as much as possible. It is necessary to devise to shorten the time of communication processing and processing related to communication.

【００１５】前記数値シミュレーションを並列処理する
際に必要な、通信を要する大域的処理の例は、次のよう
なものである。ａ）最大最小値探索（大小比較）ｂ）大域的論理演算（論理和，排他的論理和等）ｃ）総和計算（浮動少数点，整数の加算）ｄ）処理要素に分割された全てのデータの共有（合同）これらの複合的な通信処理を、従来の計算処理部 10 と
通信処理部 11 とから成り立つプロセッサエレメント(P
Ei) 1 で、上記の如きメッセージパッシング (メッセー
ジ受け渡し) 機構、例えば、ダイレクトメモリアクセス
機構(DMA) 110,割り込み機構等により行う際には、通信
そのものよりも、通信を起動するソフトウェアや、通信
データ、例えば、ヘッダ部、の受け取り、送出に関わ
るプロセッサエレメント(PEi) 1 内の処理に時間を多く
費やされる。The following is an example of global processing that requires communication when parallel processing of the numerical simulation is performed. a) Maximum / minimum value search (size comparison) b) Global logical operation (logical sum, exclusive OR, etc.) c) Total calculation (floating point, addition of integer) d) All data divided into processing elements Sharing (joint) of these complex communication processing is performed by the processor element (P
In Ei) 1, when the message passing (message passing) mechanism as described above, for example, the direct memory access mechanism (DMA) 110, the interrupt mechanism, etc. is performed, the software that activates the communication and the communication data For example, a lot of time is spent on the processing in the processor element (PEi) 1 related to receiving and sending the header part.

【００１６】上記の如き、大域的処理では、受け取った
データに一つの演算を施した後、直ちに、再び、通信ネ
ットワーク 3へ送出する決まった処理であるにもかかわ
らず、メッセージパッシング (データの受け渡し) によ
る一般的な通信機構を用いることは、通信ソフトウェア
(上記通信ライブラリ) 内の通信先チェックなどの処理
の重複や、本体部 10 と通信処理部 11 との間のデータ
の移動などのオーバーヘッドを被りやすい。しかも、こ
れらの大域的処理で、最も有効な、バイナリツリーアル
ゴリズム（後述の図２参照）では、演算処理が必要なの
は、全てのプロセッサエレメント(PEi) 1 ではなく、一
部のプロセッサエレメントである。このため、これらの
処理を頻繁に行う数値シミュレーションでは、演算処理
を行わない他のプロセッサエレメント(PEj) に待ちが生
じて稼動率が下がり、並列処理の効果が出にくい。In the global processing as described above, the message passing (data transfer) is performed even though the received data is subjected to one operation and then immediately again sent to the communication network 3. ) Is a communication software
It is easy to suffer from overhead such as duplication of processes such as communication destination check in the (communication library) and movement of data between the main unit 10 and the communication processing unit 11. Moreover, in the most effective binary tree algorithm in these global processes (see FIG. 2 described later), it is not all the processor elements (PEi) 1 that require arithmetic processing, but some processor elements. For this reason, in the numerical simulation in which these processes are frequently performed, the other processor elements (PEj) that do not perform the calculation process wait and the operating rate is reduced, and the effect of parallel processing is difficult to be obtained.

【００１７】本発明は上記従来の欠点に鑑み、並列計算
機で行われる数値シミュレーション等の並列処理に必要
であるが、数値シミュレーションの並列処理効果を下げ
る、複合通信処理を、高速に行うことができる通信処理
の方法を提供することを目的とするものである。In view of the above-mentioned conventional drawbacks, the present invention is necessary for parallel processing such as numerical simulation performed by a parallel computer, but it is possible to perform composite communication processing at a high speed, which reduces the parallel processing effect of numerical simulation. It is intended to provide a communication processing method.

【００１８】[0018]

【課題を解決するための手段】図１は、本発明の一実施
例を模式的に示した図であり、図２はバイナリーツリー
による大域的処理を説明する図である。上記の問題点は
下記の如くに構成した並列計算機における通信処理方法
によって解決される。FIG. 1 is a diagram schematically showing an embodiment of the present invention, and FIG. 2 is a diagram for explaining global processing by a binary tree. The above problems can be solved by the communication processing method in the parallel computer configured as follows.

【００１９】(1) 分散型メモリ (主記憶装置) 12を備え
た複数個のプロセッサエレメント 1が、通信ネットワー
ク 3を介して接続されている並列計算機におけるプロセ
ッサエレメント 1での通信処理回路であって、各プロセ
ッサエレメント 1内に、計算処理部 10 とは別に、プロ
セッサエレメントの論理アドレスと，全プロセッサエ
レメントの数と、計算処理部 10 からゆローカルデー
タ, と, 通信ネットワーク 3から受信した通信デー
タとを格納するレジスタ 110,111,112,120,113，バッ
ファ 116,117と、各種の演算器114,115と、通信制御を
行うコントロールシーケンサ 118と、通信ネットワーク
インタフェーサ 119とからなる通信処理部 11 を設け、
計算処理部 10 で実行されるソフトウェアからの指示
で、上記レジスタ 110,111にプロセッサエレメントの論
理アドレス、プロセッサエレメントの数を設定した
後、該設定された論理アドレスと，プロセッサエレメ
ント数と，第何回目の通信であるかを指示しているコ
ントロールシーケンサ 118のシーケンス番号とで定ま
る通信処理（データの送信, 又は、データ受信，演算，
又は、演算結果の送信）を、選択的に、上記計算処理部
10 での動作とは独立に実行するように構成する。(1) A communication processing circuit in a processor element 1 in a parallel computer in which a plurality of processor elements 1 each having a distributed memory (main memory) 12 are connected via a communication network 3. In each processor element 1, apart from the calculation processing unit 10, the logical address of the processor element, the number of all the processor elements, the local data from the calculation processing unit 10, and the communication data received from the communication network 3 are stored. Is provided with a communication processing unit 11 including registers 110, 111, 112, 120, 113, buffers 116, 117 for storing the data, various arithmetic units 114, 115, a control sequencer 118 for communication control, and a communication network interface 119.
After the logical addresses of the processor elements and the number of processor elements are set in the registers 110 and 111 by the instruction from the software executed by the calculation processing unit 10, the set logical address, the number of processor elements, and the A communication process (data transmission, data reception, calculation,
Alternatively, the calculation result is transmitted) selectively
Configured to run independently of the behavior in 10.

【００２０】(2) 上記通信処理として、バイナリーツリ
ー手順により、大域的な演算を行うように構成する。(2) As the above communication processing, a global tree procedure is used to perform a global operation.

【００２１】[0021]

【作用】前述のように、分散メモリ型並列計算機で、例
えば、数値シミュレーションを並列処理する際に必要
な、プロセッサエレメント(PEi) 間の通信を必要とする
大域的な処理の例として、 a) 最大最小値探索（大小比
較）、 b) 大域的論理演算(論理和, 排他的論理和等)
、 c) 総和計算（浮動少数点，整数の加算）、 d)デー
タ列の合同 (繋ぎ合わせ) 等があるが、この大域的処理
で最も有効な通信手段として、図２に示したバイナリー
ツリーアルゴリズムが知られている。As described above, as an example of global processing that requires communication between processor elements (PEi), which is necessary for parallel processing of a numerical simulation in a distributed memory parallel computer, Maximum / minimum value search (comparison of magnitude), b) Global logical operation (logical sum, exclusive logical sum, etc.)
, C) total sum calculation (floating point, integer addition), d) congruence of data strings (joining), etc., but the binary tree algorithm shown in Fig. 2 is the most effective communication method in this global processing. It has been known.

【００２２】図２から明らかなように、バイナリーツリ
ーによる通信処理では、該並列計算機を構成しているプ
ロセッサエレメント(PE1,PE2, 〜) の数によって、該バ
イナリーツリーの構成が決められ、図２の構成例では、
例えば、奇数番号のプロセッサエレメント(PE1,PE3,PE
5, 〜) 1 では、通信ネットワーク 3から通信データ
を受信して、予め、定められている演算処理(OPRで示
す) を実行するか、更に、実行した演算結果を、１つ，
又は、２つ，又は、４つ若番のプロセッサエレメント(P
Ej) 1 に転送するかに定形化されており、どの通信処理
を行うかは、上記バイナリーツリーの第何番目の通信で
あるかによって決まっている。As is clear from FIG. 2, in the binary tree communication processing, the configuration of the binary tree is determined by the number of processor elements (PE1, PE2, ...) That constitute the parallel computer. In the configuration example of
For example, odd-numbered processor elements (PE1, PE3, PE
5, ~) 1 receives communication data from the communication network 3 and executes a predetermined calculation process (indicated by OPR), or further
Alternatively, two or four younger processor elements (P
Ej) It is formalized to transfer to 1, and which communication process is performed is determined by the number of communication in the above binary tree.

【００２３】例えば、プロセッサエレメント(PE1) 1
は、データを受信して、所定の演算を繰り返すのみであ
るが、プロセッサエレメント(PE3,PE7, 〜) 1 では、第
１回目の通信処理で、演算処理を行い、演算結果を他の
プロセッサエレメント(PE1) 1に送信するのみであり、
プロセッサエレメント(PE5, 〜) 1 では、第１回目の通
信処理では演算処理のみであり、第２回目の通信処理で
は、演算した結果を他のプロセッサエレメント(PE1) 1
に送信するといったように、プロセッサエレメント(PE
i) のプロセッサエレメントアドレス (番号) と、第
何回目の通信処理（これは、コントロールシーケンサ番
号で決まる）であるかにより、通信処理の内容が定形
化されている。For example, the processor element (PE1) 1
Only receives the data and repeats the predetermined calculation. However, in the processor element (PE3, PE7, ...) 1, the calculation processing is performed in the first communication processing, and the calculation result is stored in another processor element. (PE1) only send to 1,
In the processor element (PE5, ...) 1, only the arithmetic processing is performed in the first communication processing, and in the second communication processing, the calculation result is stored in another processor element (PE1) 1.
To the processor element (PE
The contents of the communication process are formalized by the processor element address (number) of i) and the number of times of the communication process (this is determined by the control sequencer number).

【００２４】又、偶数番号のプロセッサエレメント(PE
2,PE4, 〜) 1 では、自己の持っているデータ (即ち、
ホストから配分されているデータ、ローカルデータ)
を他のプロセッサエレメント(PE1,PE3, 〜) 1 に送信す
るのみである。In addition, even-numbered processor elements (PE
2, PE4, 〜) 1 has its own data (that is,
(Data distributed from the host, local data)
To the other processor elements (PE1, PE3, ...) 1 only.

【００２５】本発明は、この点に着目して、各プロセッ
サエレメント 1内に、本体部である計算処理部 10 とは
別に、プロセッサエレメントの論理アドレスと，全プ
ロセッサエレメントの数と、自己の持っているローカ
ルデータ, と、通信ネットワーク 3から受信した通
信データとを格納するレジスタ 110,111,112,120,11
3，バッファ 116,117と、各種の演算器 114,115と、通
信制御を行うコントロールシーケンサ 118と、通信ネッ
トワークインタフェーサ(119) とからなる通信処理部 1
1 を設け、該計算処理部 10 で実行されるソフトウェア
からの指示で、上記レジスタ 110,111にプロセッサエレ
メントの論理アドレス、プロセッサエレメントの数
を設定した後、該設定された論理アドレスと，プロセ
ッサエレメント数と，第何回目の通信であるかを指示
しているコントロールシーケンサ 118のシーケンス番号
とで定まる通信処理（データの送信, 又は、データ受
信，演算，又は、演算結果の送信）を、選択的に、上記
計算処理部 10 での動作とは独立に実行するように構成
したものである。Focusing on this point, the present invention, in addition to the calculation processing section 10 which is the main body, in each processor element 1, the logical address of the processor element, the number of all the processor elements, and its own. Register 110,111,112,120,11 that stores the local data that is being stored and the communication data that is received from the communication network 3.
3, a buffer 116, 117, various computing units 114, 115, a control sequencer 118 for communication control, and a communication processing unit 1 (119)
1 is provided, and the logical address of the processor element and the number of processor elements are set in the registers 110 and 111 by the instruction from the software executed in the calculation processing unit 10, and then the set logical address and the number of processor elements are set. , The communication process (data transmission, data reception, calculation, or calculation result transmission) that is determined by the sequence number of the control sequencer 118 instructing the number of times of communication is selectively performed. The calculation processing unit 10 is configured to be executed independently of the operation.

【００２６】従って、従来のように、汎用的なメッセー
ジパッシング（メッセージの受け渡し）の通信ソフトウ
ェアを多数回実行することによるオーバヘッドを少なく
でき、又、定形的な処理となるバイナリーツリーアルゴ
リズムを、簡単なハードウェア機構で実行することによ
り、プロセッサエレメント(PEi) の本体部である計算処
理部でのメモリアクセス, 入出力処理と競合することな
く、演算処理を実行でき、又、通信ネットワークの通信
データを、各プロセッサエレメント(PEi) の計算処理
部へ移動させずに済む為、通信処理を高速化できる。こ
の結果、複合通信処理の時間を短縮することができ、全
プロセッサエレメント(PEi) の稼働率を向上させること
ができる効果が得られる。Therefore, it is possible to reduce the overhead caused by executing communication software for general-purpose message passing (message passing) a large number of times as in the conventional art, and to simplify the binary tree algorithm which is a fixed process. By executing with the hardware mechanism, arithmetic processing can be executed without competing with memory access and input / output processing in the calculation processing unit which is the main body of the processor element (PEi), and communication data of the communication network can be executed. Since it is not necessary to move to the calculation processing unit of each processor element (PEi), the communication processing can be speeded up. As a result, it is possible to shorten the time required for the composite communication process and improve the operating rate of all processor elements (PEi).

【００２７】[0027]

【実施例】以下本発明の実施例を図面によって詳述す
る。前述の図１は、本発明の一実施例を模式的に示した
図であり、図２は、バイナリーツリーによる大域的処理
を説明する図である。Embodiments of the present invention will be described in detail below with reference to the drawings. FIG. 1 described above is a diagram schematically showing an embodiment of the present invention, and FIG. 2 is a diagram explaining global processing by a binary tree.

【００２８】本発明においては、各プロセッサエレメン
ト 1内に、本体部である計算処理部10 とは別に、プロ
セッサエレメントの論理アドレスと，全プロセッサエ
レメントの数と、ローカルデータ, と、通信ネッ
トワークから受信した通信データとを格納するレジス
タ 110,111,112,120,113，バッファ 116,117と、各種の
演算器 114,115と、通信制御を行うコントロールシーケ
ンサ 118と、通信ネットワークインタフェーサ(119) と
からなる通信処理部 11 を設け、該計算処理部10 で実
行されるソフトウェアからの指示で、上記レジスタ 11
0,111にプロセッサエレメントの論理アドレス、プロ
セッサエレメントの数を設定した後、該設定された論
理アドレスと，プロセッサエレメント数と，第何回
目の通信であるかを指示しているコントロールシーケン
サ 118のシーケンス番号とで定まる通信処理（データ
の送信, 又は、データ受信，演算，又は、演算結果の送
信）を、選択的に、上記計算処理部 10 での動作とは独
立に実行する手段が、本発明を実施するのに必要な手段
である。尚、全図を通して同じ符号は同じ対象物を示し
ている。According to the present invention, in each processor element 1, a logical address of the processor element, the number of all processor elements, local data, and a communication network are received in addition to the calculation processing section 10 which is the main body. The communication processing unit 11 including a register 110, 111, 112, 120, 113, a buffer 116, 117 for storing the communication data, a variety of arithmetic units 114, 115, a control sequencer 118 for communication control, and a communication network interface (119) is provided to perform the calculation process. The above register 11 is specified by the instruction from the software executed in part 10.
After setting the logical address of the processor element and the number of processor elements to 0, 111, the set logical address, the number of processor elements, and the sequence number of the control sequencer 118 instructing the number of times of communication are set. Means for selectively executing the communication process (data transmission, or data reception, calculation, or calculation result transmission) determined independently of the operation of the calculation processing unit 10 according to the present invention. It is a necessary means to do. The same reference numerals indicate the same objects throughout the drawings.

【００２９】以下、図１，図２によって、本発明の分散
メモリ型並列計算機における通信処理回路の構成と動作
を説明する。分散メモリ型並列計算機のプロセッサエレ
メント(PE1,PE2, 〜) 1 は、図１に示されているよう
に、計算処理部 10 と通信処理部 11 とから成り、通信
処理部11 に、大域的処理を行う専用のハードウェア機
構として、プロセッサエレメントの論理アドレスと，
全プロセッサエレメントの数と、ローカルデータ,
と、通信ネットワーク 3から受信した通信データと
を格納するレジスタ 110,111,112,120,113，バッファ 1
16,117と、各種の演算器 114,115と、通信制御を行うコ
ントロールシーケンサ 118と、通信ネットワークインタ
フェーサ 119を設け、計算処理部 10 で実行されるソフ
トウェア命令により起動された後は、全てハードウェア
により、計算処理部 10 とは独立に、大域的な通信処理
が、以下に説明するハードウェア機構１，２で実行され
る。The configuration and operation of the communication processing circuit in the distributed memory type parallel computer of the present invention will be described below with reference to FIGS. As shown in FIG. 1, the processor elements (PE1, PE2, ...) 1 of the distributed memory parallel computer are composed of a calculation processing unit 10 and a communication processing unit 11, and the communication processing unit 11 performs global processing. As a dedicated hardware mechanism for performing the logical address of the processor element,
Number of all processor elements and local data,
And registers 110,111,112,120,113 for storing communication data received from the communication network 3 and buffer 1
16, 117, various arithmetic units 114, 115, a control sequencer 118 for communication control, and a communication network interface 119 are provided, and after being activated by software instructions executed by the calculation processing unit 10, all calculations are performed by hardware. Independent of the processing unit 10, global communication processing is executed by the hardware mechanisms 1 and 2 described below.

【００３０】「ハードウェア機構１」：バイナリーツリ
ーによる通信先の決定と，送受信を制御する機構とし
て、プロセッサエレメントの論理アドレスを設定する
レジスタ(R1) 110と、バイナリーツリーによる通信処理
の定形化に関与する全プロセッサエレメントの数を設
定するレジスタ(R2) 111と、主記憶装置 12 上のデー
タ、即ち、ローカルデータを格納しておくレジスタ(R
3) 112, 及び、バッファ(BUF) 116 と、ローカルデータ
のデータ長を格納しておくレジスタ(R5) 120と、通信
ネットワーク 3からの通信データを格納するレジスタ
(R3) 113, 及び、バッファ(BUF) 117 と、現在の通信処
理が、上記バイナリーツリーによる通信処理における第
何回目の通信処理であるか、即ち、シーケンス番号を
指示して、所定の制御信号を出力するコントロールシー
ケンサ 118とを通信処理部 11 内に設ける。"Hardware mechanism 1": A register (R1) 110 for setting a logical address of a processor element and a formalization of communication processing by a binary tree as a mechanism for determining a communication destination by a binary tree and controlling transmission / reception. A register (R2) 111 for setting the number of all processor elements involved and a register (R2) for storing the data on the main memory 12, that is, local data.
3) 112, a buffer (BUF) 116, a register (R5) 120 for storing the data length of local data, and a register for storing communication data from the communication network 3
(R3) 113, the buffer (BUF) 117, and the current communication process is the number of times of the communication process in the binary tree, that is, the sequence number is instructed, and a predetermined control signal is given. A control sequencer 118 for outputting the signal is provided in the communication processing unit 11.

【００３１】コントロールシーケンサ 118は、レジスタ
(R1) 110にあるプロセッサエレメントアドレス, 及
び、レジスタ(R2) 111に設定されている全プロセッサエ
レメントの数をもとに通信処理を制御する。The control sequencer 118 is a register
The communication processing is controlled based on the processor element address in (R1) 110 and the number of all processor elements set in the register (R2) 111.

【００３２】バイナリーツリーによる通信先の決定方法
を図２示す。図２から明らかなように、バイナリーツリ
ーによる通信処理では、例えば、奇数番号のプロセッサ
エレメント(PE1,PE3,PE5, 〜) 1 では、通信ネットワー
ク 3から通信データを受信して、予め、定められてい
る演算処理(OPRで示す) を実行するか、更に、実行した
演算結果を、１つ，又は、２つ，又は、４つ若番のプロ
セッサエレメント(PEj) 1 に転送するかに定形化されて
おり、どの通信処理を行うかは、上記バイナリーツリー
の第何番目の通信であるか、即ち、上記シーケンス番号
によって決まっている。FIG. 2 shows a method of determining a communication destination by using a binary tree. As is clear from FIG. 2, in the communication processing by the binary tree, for example, the odd-numbered processor elements (PE1, PE3, PE5, ...) 1 receive the communication data from the communication network 3 and are determined in advance. Whether or not to execute the operation processing (indicated by OPR) that is present or to transfer the executed operation result to one, two, or four younger processor elements (PEj) 1. The communication process to be performed is determined by the number of communication in the binary tree, that is, the sequence number.

【００３３】例えば、プロセッサエレメント(PE1) 1
は、データを受信して、所定の演算を繰り返すのみであ
るが、プロセッサエレメント(PE3,PE7, 〜) 1 では、第
１回目の通信処理で、演算処理を行い、演算結果を他の
プロセッサエレメント(PE1) 1に送信するのみであり、
プロセッサエレメント(PE5, 〜) 1 では、第１回目の通
信処理では演算処理のみであり、第２回目の通信処理で
は、演算した結果を他のプロセッサエレメント(PE1) 1
に送信するといったように、プロセッサエレメント(PE
i) のプロセッサエレメントアドレス (番号) と、第
何回目の通信処理であるかを示すシーケンス番号によ
り、通信処理の内容が定形化されている。For example, the processor element (PE1) 1
Only receives the data and repeats the predetermined calculation. However, in the processor element (PE3, PE7, ...) 1, the calculation processing is performed in the first communication processing, and the calculation result is stored in another processor element. (PE1) only send to 1,
In the processor element (PE5, ...) 1, only the arithmetic processing is performed in the first communication processing, and in the second communication processing, the calculation result is stored in another processor element (PE1) 1.
To the processor element (PE
The contents of the communication process are formalized by the processor element address (number) of i) and the sequence number indicating the number of times of the communication process.

【００３４】又、偶数番号のプロセッサエレメント(PE
2,PE4, 〜) 1 では、自己の持っているデータ (即ち、
ホストから配分されているローカルデータ) を他のプ
ロセッサエレメント(PE1,PE3, 〜) 1 に送信するのみで
ある。Further, even-numbered processor elements (PE
2, PE4, 〜) 1 has its own data (that is,
It only sends the local data distributed from the host) to the other processor elements (PE1, PE3, ...) 1.

【００３５】そこで、本発明においては、上記レジスタ
(R1) 110に設定されている自己のプロセッサエレメント
アドレス (プロセッサエレメント番号) と、レジスタ
(R2)111に設定されている、バイナリーツリーの全体の
構成を決定する全プロセッサエレメントの数と、該バ
イナリーツリーによる通信処理において、第何回目の通
信処理であるかを指示するコントロールシーケンサ 118
のシーケンス番号とによって、自己の処理する通信処
理の形態を決定する。Therefore, in the present invention, the above register
(R1) The own processor element address (processor element number) set to 110 and the register
(R2) The number of all processor elements that determine the overall configuration of the binary tree set in 111 and the control sequencer that indicates the number of times of the communication processing in the binary tree communication processing 118
The sequence number and the sequence number determine the type of communication processing to be performed by itself.

【００３６】「ハードウェア機構２」: 上記ハードウェ
ア機構１の制御により動作する浮動小数点加算器(FLOA
T) 114 、整数加算器(INT) 115 、又は、演算処理時に
アクセスされる、前述のレジスタ(R3) 112, レジスタ(R
4) 113, レジスタ(R5) 120、及び、バッファ(BUF) 116,
117 を通信処理部 11 に設ける。"Hardware mechanism 2": A floating point adder (FLOA) which operates under the control of the above hardware mechanism 1.
T) 114, integer adder (INT) 115, or the above-mentioned register (R3) 112, register (R
4) 113, register (R5) 120, and buffer (BUF) 116,
117 is provided in the communication processing unit 11.

【００３７】上記レジスタ(R4) 113, 及びバッファ(BU
F) 117 は、通信ネットワーク 3から直接通信データ
を受け取り、又、通信ネットワーク 3へ直接、ホストか
ら配分されているローカルデータ, 或いは、自己の通
信処理部 11 内の上記浮動小数点加算器(FLOAT) 114 、
整数加算器(INT) 115 での演算結果データを通信ネット
ワーク 3に送出させる。これらのハードウェア機構は、
次の実施例のように動作して大域的処理を行う。The register (R4) 113 and the buffer (BU
F) 117 receives the communication data directly from the communication network 3, or the local data directly distributed to the communication network 3 from the host, or the floating point adder (FLOAT) in the communication processing unit 11 of its own. 114,
Sends the calculation result data from the integer adder (INT) 115 to the communication network 3. These hardware features are
It operates as in the following example to perform global processing.

【００３８】「段階１」：計算処理部 10 で実行される
ソフトウェア、例えば、前述の通信ライブラリの指示に
より、上記「ハードウェア機構１」が動作を開始する。
この時、ソフトウェアからプロセッサエレメントの論理
アドレスを受け取り、上記レジスタ(R1) 110に設定す
る。又、ソフトウェアから大域的処理を行う対象である
ローカルデータを、前述の大域的処理ａ）、ｂ）、
ｃ）の場合は、ローカルデータ用のレジスタ(R3) 112に
受け取り、大域的処理ｄ）の場合は、バッファ(BUF) 11
6 に受け取る。又、前述の大域的処理ｄ）の場合は、ロ
ーカルデータの長さを、データ長用のレジスタ(R5) 1
20に受け取る。"Stage 1": Software executed by the calculation processing unit 10, for example, the above-mentioned "hardware mechanism 1" starts operating according to an instruction from the above-mentioned communication library.
At this time, the logical address of the processor element is received from the software and set in the register (R1) 110. In addition, the local data to be subjected to the global processing from software is converted into the global processing a), b),
In the case of c), it is received by the register (R3) 112 for local data, and in the case of global processing d), the buffer (BUF) 11
Receive on 6. In the case of the global processing d) described above, the length of the local data is set in the data length register (R5) 1
Receive at 20.

【００３９】「段階２」：演算を担当するプロセッサエ
レメント(PEi) 1 は、バイナリツリーアルゴリズムに従
って、他のプロセッサ演算(PEj) 1 から通信データを
受信し、次の処理を行う。"Stage 2": The processor element (PEi) 1 in charge of the operation receives the communication data from the other processor operation (PEj) 1 according to the binary tree algorithm and performs the following processing.

【００４０】1）大域的処理ａ）、ｂ）、ｃ）｛大小比
較, 論理和, 排他的論理和, 論理演算等｝の場合、レジ
スタ(R3) 112にあるローカルデータと、通信ネットワ
ーク3から転送され、レジスタ(R4) 113に格納されてい
る通信データとの間で、それぞれの演算を行い、結果
は、バイナリツリーアルゴリズムに従って、ローカルデ
ータ用レジスタ(R3) 112に格納するか、又は、他のプロ
セッサエレメント(PEj) 1 に送信する。1) In the case of global processing a), b), c) {size comparison, logical sum, exclusive OR, logical operation, etc.}, local data in register (R3) 112 and communication network 3 Each operation is performed with the communication data transferred and stored in the register (R4) 113, and the result is stored in the local data register (R3) 112 according to the binary tree algorithm, or other Send to processor element (PEj) 1 of.

【００４１】2) 大域的処理ｄ）｛合同、即ち、データ
の結合｝の場合、ネットワーク 3から転送され、バッフ
ァ(BUF) 117 に格納されている通信データを、上記デ
ータ長用のレジスタ(R5) 120を参照して、バッファ(BU
F) 116 にあるローカルデータの最後尾に追加して書
き込み、合計したデータ長を、上記データ長用のレジス
タ(R5) 120に書き込む。2) In the case of global processing d) {congruence, that is, data combination}, the communication data transferred from the network 3 and stored in the buffer (BUF) 117 is used as the data length register (R5 ) Refer to 120 for the buffer (BU
F) Add the data to the end of the local data in 116, write it, and write the total data length to the above data length register (R5) 120.

【００４２】上記「段階１，２」での処理をバイナリー
ツリーが収束するまで繰り返す。「段階３」：バイナリーツリーの頂点に立つプロセッサ
エレメント (例えば、図１に示したバイナリーツリー構
成では、PE1) 1は、最終結果を他の全てのプロセッサエ
レメントに放送し、次の処理に備える。The processing in the above "stages 1 and 2" is repeated until the binary tree converges. "Stage 3": The processor element (for example, PE1 in the binary tree configuration shown in FIG. 1) 1 standing at the top of the binary tree broadcasts the final result to all other processor elements and prepares for the next processing.

【００４３】このように、本発明においては、並列計算
機における通信処理回路において、プロセッサエレメン
ト内に、計算処理部とは別に、プロセッサエレメントの
論理アドレスと，全プロセッサエレメントの数と、
ローカルデータと、通信データとを格納するレジス
タ，バッファと、各種の演算器と、通信制御を行うコン
トロールシーケンサとからなる通信処理部を設け、計算
処理部で実行されるソフトウェアからの指示で、上記レ
ジスタにプロセッサエレメントの論理アドレス、プロ
セッサエレメントの数を設定した後、該設定された論
理アドレスと，プロセッサエレメント数と，第何回
目の通信であるかを指示しているコントロールシーケン
サのシーケンス番号とで定まるバイナリーツリー方法
による大域的な通信処理（データの送信，又は、データ
受信，演算，又は、演算結果の送信）を、選択的に、計
算処理部 (本体部) での動作とは独立に実行するように
構成したところに特徴がある。As described above, according to the present invention, in the communication processing circuit in the parallel computer, the logical address of the processor element, the number of all processor elements, and
A communication processing unit including a register and a buffer for storing local data and communication data, various arithmetic units, and a control sequencer for performing communication control is provided. After setting the logical address of the processor element and the number of processor elements in the register, the set logical address, the number of processor elements, and the sequence number of the control sequencer instructing the number of times of communication are set. Global communication processing (data transmission, data reception, calculation, or calculation result transmission) by the defined binary tree method is selectively executed independently of the operation in the calculation processing unit (main unit). It is characterized in that it is configured to do.

【００４４】[0044]

【発明の効果】以上、詳細に説明したように、本発明の
並列計算機における通信処理回路によれば、汎用的なメ
ッセージパッシング (メッセージの受け渡し) の通信ソ
フトウェアを多数回実行することによるオーバーヘッド
を避けることができ、又、定形的処理である、バイナリ
ツリーアルゴリズムをハードウェア機構で実行すること
により、プロセッサエレメント(PE1,PE2, 〜) の他の入
出力やメモリアクセスと競合せずに、演算処理の制御を
行わせることができ、また通信ネットワークからの通信
データをプロセッサエレメント(PE1,PE2, 〜) の計算
処理部 (本体部)へ移動せずに済むため、処理が高速化
できる。この結果、複合通信処理の時間を短縮でき、全
プロセッサエレメントの稼動率を向上させることができ
る効果がある。As described in detail above, according to the communication processing circuit in the parallel computer of the present invention, the overhead caused by executing the general-purpose message passing (message passing) communication software many times is avoided. Moreover, by executing the binary tree algorithm, which is a standard process, by the hardware mechanism, the arithmetic processing can be performed without competing with other input / output or memory access of the processor element (PE1, PE2, ...). The control can be performed, and the communication data from the communication network need not be moved to the calculation processing unit (main body) of the processor element (PE1, PE2, ...), so that the processing can be speeded up. As a result, there is an effect that the time of the composite communication processing can be shortened and the operation rates of all the processor elements can be improved.

[Brief description of drawings]

【図１】本発明の一実施例を模式的に示した図FIG. 1 is a diagram schematically showing an embodiment of the present invention.

【図２】バイナリーツリーによる大域的処理を説明する
図FIG. 2 is a diagram illustrating global processing using a binary tree.

【図３】並列計算機での従来の通信処理方法を説明する
図FIG. 3 is a diagram illustrating a conventional communication processing method in a parallel computer.

[Explanation of symbols]

1 プロセッサエレメント(PE1,PE2, 〜) 10 計算処理部 (本体部) 11 通信処理部 110 レジスタ(R1) 111 レジスタ(R
2) 112 レジスタ(R3) 113 レジスタ(R
4) 114 浮動小数点加算器(FLOAT) 115 整数加算器(INT) 116,117 バッ
ファ(BUF) 118 コントロールシーケンサ 119 通信ネットワークインタフェーサ 120 レジスタ(R5) 12 分散型メモリ (主記憶装置) 3 通信ネット
ワークプロセッサエレメントアドレス (プロセッサエレ
メントの番号) プロセッサエレメントの数通信データシーケンス番号ローカルデータローカルデータ (データ長)1 Processor element (PE1, PE2, ...) 10 Calculation processor (main body) 11 Communication processor 110 Register (R1) 111 Register (R
2) 112 register (R3) 113 register (R
4) 114 Floating point adder (FLOAT) 115 Integer adder (INT) 116,117 Buffer (BUF) 118 Control sequencer 119 Communication network interface 120 Register (R5) 12 Distributed memory (main memory) 3 Communication network Processor element address (Processor element number) Number of processor elements Communication data Sequence number Local data Local data (Data length)

Claims

[Claims]

1. A communication processing circuit in a processor element (1) in a parallel computer, wherein a plurality of processor elements (1) having a distributed memory (12) are connected via a communication network (3). Therefore, in each processor element (1), apart from the calculation processing section (10), the logical address () of the processor element,
Number of all processor elements () and calculation processing unit (10)
Local data from (,) and the communication network
Registers (110,111,112,120,113) for storing communication data () received from (3), buffers (116,117), various arithmetic units (114,115), control sequencer (118) for communication control, and communication network interface
(119) is provided with a communication processing unit (11), and the logical address of the processor element () and the number of processor elements () are stored in the registers (110, 111) according to an instruction from the software executed by the calculation processing unit (10). )
After setting, the communication process is determined by the set logical address (), the number of processor elements (), and the sequence number () of the control sequencer (118) that indicates the number of times of communication. (Sending data,
Or data reception, calculation, or transmission of calculation results)
Is selectively executed independently of the operation of the calculation processing section (10), a communication processing circuit in a parallel computer.

2. The communication processing circuit in a parallel computer according to claim 1, wherein, as the communication processing, a global operation is performed by a binary tree procedure.