JPH11110362A

JPH11110362A - Method for communicating data between computers

Info

Publication number: JPH11110362A
Application number: JP9268364A
Authority: JP
Inventors: Tsuneyuki Imaki; 常之今木; Akihiko Sakaguchi; 明彦坂口; Nobutoshi Sagawa; 暢俊佐川; Shunji Takubo; 俊二田窪; Masatada Takasugi; 昌督高杉; Kimihide Kureya; 公英呉屋
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1997-10-01
Filing date: 1997-10-01
Publication date: 1999-04-23

Abstract

PROBLEM TO BE SOLVED: To execute the reduction in the number of times of communication or communication processing time while suppressing the required resources to a minimum by performing communication in certain order so that a computer to be communication waiting can not appear in each phase. SOLUTION: Each computer defines a computer having a value, for which the rank number of the computer itself is subtracted from N-1 (5), as a rank number as a communicating party in a phase 0. After a phase 1, each computer defines a computer adding '1' to the communicating party in the preceding phase as the communicating party. In this case, when a value adding '1' becomes N(6), it is returned to 0. This is repeated to a phase N-1. Therefore, the communication for totally N phases from the phase 0 to the phase N-1 is performed. When a computer (a) defines a computer (b) as the communicating party in each phase, the relation that the computer (b) also defines the computer (a) as the communicating party is established. Therefore, the communication processing of each computer can be simultaneously executed and mutual data transmission can be executed while being overlapped.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、並列計算機を含む
ネットワークで接続された複数の計算機間でのデータ通
信方法に関わり、特に計算機間のデータ通信の標準的な
インターフェイスであるメッセージパッシングインター
フェイス（以下、MPIという）における全体通信の高速
化方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data communication method between a plurality of computers connected by a network including a parallel computer, and more particularly to a message passing interface (hereinafter referred to as a standard interface for data communication between computers). , MPI) for speeding up overall communication.

【０００２】[0002]

[Prior art]

(並列計算機の通信方式)分散メモリ型並列計算機はノー
ドと呼ばれる計算機を高速ネットワークで結合し、お互
いのデータをその高速ネットワーク通信を用いて交換し
ながら並列に計算を進める方式の計算機である。高速ネ
ットワークは様々な方式が存在するが、一つの方式では
特定のメモリ領域を高速通信用に割り当て、送信ノード
が受信ノードの高速通信用メモリに直接データを書き込
むことにより受信側ノードのOSオーバヘッドを削減し高
速な通信を実現する。この場合、送受信が必要となるデ
ータを高速通信用領域に割り当て、その他のデータを通
常のメモリ領域に割り当てることにより、高速ネットワ
ークを有効に利用した処理が可能となる。(Communication method of parallel computer) A distributed memory type parallel computer is a computer of a method in which computers called nodes are connected by a high-speed network and calculations are performed in parallel while exchanging each other's data using the high-speed network communication. There are various methods for high-speed networks, but one method allocates a specific memory area for high-speed communication, and the sending node writes data directly to the high-speed communication memory of the receiving node to reduce the OS overhead of the receiving node. Reduce and realize high-speed communication. In this case, by allocating the data that needs to be transmitted / received to the high-speed communication area and allocating the other data to the normal memory area, it is possible to perform processing that effectively uses the high-speed network.

【０００３】また、この高速ネットワークが双方向通信
をサポートしている場合は、お互いのデータを交換しあ
う処理の際にはデータの送信と受信をオーバーラップさ
せることにより、さらに通信性能を上げることが可能と
なる。When the high-speed network supports two-way communication, the transmission and reception of data are overlapped in the process of exchanging data with each other, thereby further improving the communication performance. Becomes possible.

【０００４】(メッセージパッシングライブラリによる
通信方式)このような分散メモリ型並列計算機で並列処
理を行うためのプログラミングにはさまざまなモデルが
存在するが、もっとも一般的なモデルとして用いられて
いるのがメッセージパッシングモデルである。メッセー
ジパッシングモデルでは、各計算機が互いに独立に計算
を行い、必要に応じてデータの交換をメッセージの送受
信という形式で、通信関数を用いて明示的に行う。また
最近ではプログラムの移植性および異機種間のデータ通
信を目的として、通信関数の機能やインターフェイスを
統一するためにMPI(Message Passing Interface)と呼ば
れるメッセージパッシング関数仕様がMPI Forumによっ
て制定されている。MPIでは、通信相手を明確に示すた
めに、各計算機にランク番号と呼ばれる識別番号を割り
当て、各通信関数ではこのランク番号を用いて通信相手
の特定を行う。以下では、ランク番号nを割り当てられ
た計算機のことを計算機nと表記する。MPIで定められる
通信処理には、一対の計算機間で通信を行う一対一通信
の他に、全計算機の間で決められたパターンのデータ転
送を行う全体通信が規定されている。全体通信には、一
つの計算機にあるデータをその他の全ての計算機に転送
するBroadcast通信、一つの計算機のデータを分割して
それぞれ他の計算機に分配するScatter通信、全ての計
算機に分散されているデータを一つあるいは全ての計算
機に集めるGather通信、それぞれの計算機にあるデータ
を分割しそれぞれ全ての計算機に分配するAlltoall通
信，全ての計算機にあるデータを集計し一つあるいは全
ての計算機に転送するReduce通信などがある。(Communication method using message passing library) There are various models for programming for performing parallel processing in such a distributed memory type parallel computer, and the most common model is a message model. It is a passing model. In the message passing model, each computer performs calculations independently of each other, and if necessary, exchanges data explicitly by using a communication function in the form of message transmission and reception. Recently, a specification of a message passing function called MPI (Message Passing Interface) has been established by the MPI Forum to unify functions and interfaces of communication functions for the purpose of portability of programs and data communication between different types of devices. In MPI, in order to clearly indicate a communication partner, an identification number called a rank number is assigned to each computer, and each communication function specifies a communication partner using this rank number. Hereinafter, a computer to which a rank number n is assigned is referred to as a computer n. In the communication process defined by MPI, in addition to one-to-one communication for performing communication between a pair of computers, general communication for performing data transfer in a predetermined pattern between all computers is defined. Broadcast communication that transfers data from one computer to all other computers, Scatter communication that divides data from one computer and distributes them to other computers, and distributed to all computers Gather communication that collects data in one or all computers, Alltoall communication that divides data in each computer and distributes it to all computers, aggregates data in all computers and transfers it to one or all computers There is Reduce communication.

【０００５】以下、これらのうち本発明が関わる通信処
理であるGather/Scatter通信、alltoall通信、そしてre
duce通信の中でも特に結果を全ての計算機に分配するal
l-reduce通信について説明する。[0005] In the following, of these communication processes related to the present invention, Gather / Scatter communication, alltoall communication, and re
al which distributes the result among all computers among duce communication
The l-reduce communication will be described.

【０００６】(Gather/Scatter通信のパターン)まず最初
にGather/Scatter通信について説明する。Gather/Scatt
er通信は図８に示すように、複数の計算機に分散してい
るデータをrootと呼ばれる計算機に収集してひとまとま
りのデータとしたり(Gather)、あるいは一つの計算機に
あるひとまとまりのデータを分割して複数の計算機に分
配する処理である。(Gather / Scatter Communication Pattern) First, gather / scatter communication will be described. Gather / Scatt
As shown in FIG. 8, the er communication collects data distributed to a plurality of computers into a computer called a root to form a group of data (Gather), or divides a group of data in one computer. This is a process of distributing to a plurality of computers.

【０００７】Gather/Scatter通信における一般的な処理
方式をGather通信を例に説明する。ここでは説明を簡単
にするため計算機数を４台、root計算機を計算機０とし
た。A general processing method in Gather / Scatter communication will be described using Gather communication as an example. Here, in order to simplify the explanation, the number of computers is four, and the root computer is computer 0.

【０００８】図９にGather通信の通信パターンをしめ
す。まず最初にroot計算機８０１が自分自身の持つ送信
データ８１０を受信用バッファ８２０へコピーする。こ
の時、他の計算機８０２〜８０４はroot計算機８０１に
対してデータ送信関数を発行する。次にroot計算機８０
１は計算機１(８０２)に対して受信関数を発行してデー
タ８１１を受信用バッファ８２０へ受信する。root計算
機８０１は同様の処理を全ての計算機に対して行い、デ
ータを受信用バッファ８２０へ収集する。FIG. 9 shows a communication pattern of gather communication. First, the root computer 801 copies its own transmission data 810 to the reception buffer 820. At this time, the other computers 802 to 804 issue a data transmission function to the root computer 801. Next, the root computer 80
1 issues a reception function to the computer 1 (802) and receives the data 811 to the reception buffer 820. The root computer 801 performs the same processing for all computers, and collects data in the reception buffer 820.

【０００９】このような処理によって、root計算機は計
算機数-１回の通信と自分自身のメモリコピー１回の処
理でGather処理を終了する。この方式を逐次処理方式と
呼ぶ。[0009] By such processing, the root computer terminates the Gather processing by performing the communication of the number of computers minus one time and the processing of one memory copy of itself. This method is called a sequential processing method.

【００１０】(より高速なGather/Scatter通信方式）一
方、送受信回数を削減する手段として、二分木を用いる
方法がある。図１０に二分木を用いたＧａｔｈｅｒ通信
の実現方式を示す。図１０は４つの計算機の並列プログ
ラムについて計算機０をroot計算機とし、Gather通信を
行った例である。二分木方式では、送信データを格納し
たバッファ８１０〜８１３と受信データを格納するバッ
ファ８２０の他に、各計算機毎に作業用バッファ１０３
０〜１０３３を用意する。必要な作業バッファの容量は
計算機毎に異なるが、あらかじめ必要な容量を各計算機
毎に計算するのは処理に時間がかかるため、全計算機が
最大必要量、すなわち最終的にroot計算機が受信するデ
ータ量に相当する作業用バッファを確保する。そして、
第１フェーズとして各計算機８０１〜８０４は送信デー
タ格納バッファ８１０〜８１３にある送信データを作業
用バッファ１０３０〜１０３３にコピーする。以降のデ
ータ通信はこの作業用バッファのデータについて行う。
次に第２のフェーズで計算機１(８０２)が計算機０(８
０１)に、計算機３(８０４)が計算機２(８０３)にそれ
ぞれデータを送信する。次に第３のフェーズで計算機０
(８０１)と計算機２(８０３)の間で通信を行い、計算機
２(８０３)が持っている計算機２(８０３)、計算機３
(８０４)のデータを計算機０(８０１)に送信する。これ
によりrootである計算機０(８０１)の作業用バッファ１
０３０に全データが集まったことになる。最後にroot計
算機８０１は作業用バッファ１０３０のデータを受信用
バッファ８２０へコピーし、Gather通信処理が完了す
る。この方式ではn個の計算機でGather処理を行った場
合、log(n)回のフェーズで処理を行うことが可能で、最
初に示した方法よりも短時間で処理することができる。
ただし、本方式では、各計算機に作業用バッファ領域を
用意する必要があるため、データ量が多い場合には余分
にメモリ資源を使用することになる。また、その作業用
バッファへのデータコピーが処理の最初と最後に発生す
るため、オーバヘッドがかかることになる。(Higher-speed Gather / Scatter communication method) On the other hand, as a means for reducing the number of times of transmission / reception, there is a method using a binary tree. FIG. 10 shows a method of realizing Gather communication using a binary tree. FIG. 10 shows an example in which computer 0 is used as a root computer for parallel programs of four computers and Gather communication is performed. In the binary tree method, in addition to the buffers 810 to 813 for storing transmission data and the buffer 820 for storing reception data, a work buffer 103 for each computer is provided.
0 to 1033 are prepared. The required work buffer capacity differs for each computer, but calculating the required capacity in advance for each computer takes a long time, so all computers need the maximum amount of data, that is, the data that the root computer finally receives. Allocate a work buffer corresponding to the amount. And
As the first phase, the computers 801 to 804 copy the transmission data in the transmission data storage buffers 810 to 813 to the working buffers 1030 to 1033. The subsequent data communication is performed on the data in the working buffer.
Next, in the second phase, computer 1 (802)
01), the computer 3 (804) transmits data to the computer 2 (803). Next, in the third phase, the computer 0
(801) and computer 2 (803) are communicated, and computer 2 (803) and computer 3 possessed by computer 2 (803).
The data of (804) is transmitted to the computer 0 (801). Thus, the working buffer 1 of the computer 0 (801), which is the root,
030 means that all data has been collected. Finally, the root computer 801 copies the data in the working buffer 1030 to the receiving buffer 820, and the Gather communication process is completed. In this method, when the Gather processing is performed by n computers, the processing can be performed in log (n) phases, and the processing can be performed in a shorter time than the first method.
However, in this method, since it is necessary to prepare a working buffer area in each computer, an extra memory resource is used when the data amount is large. Further, since data is copied to the working buffer at the beginning and end of the processing, overhead is required.

【００１１】（Alltoallの通信パターン）次にAlltoall
通信について説明する。Alltoall通信は先に説明したGa
therあるいはScatter通信を全計算機について行う処理
で、全ての計算機が自分の持っているデータを分割して
その他の計算機へ分配する通信処理である。(Communication pattern of Alltoall) Next, Alltoall
Communication will be described. Alltoall communication is based on the Ga
This is a process of performing ther or Scatter communication for all computers, and is a communication process in which all computers divide their own data and distribute the data to other computers.

【００１２】Alltoallでは、各計算機が自分自身を含む
全計算機にデータを送信する。図４を参照して、４台の
計算機がAlltoallの通信を行なう例を説明する。４４０
〜４４３は計算機、４００〜４３３は計算機間のデータ
の送信を表す。計算機４４０,４４１,４４２,４４３に
は、それぞれ識別子として０,１,２,３の番号が割り当
てられている。MPIではこれをランク番号と呼ぶ（以下
では、「ランク番号ｎを割り当てられた計算機」を、
「計算機ｎ」と表記する）。送信４abは、計算機aから
計算機bへのデータ送信を示している。図４(A)は、計算
機４４０が自分自身を含む全計算機にデータを送信する
様子を表す。Alltoallでは、全計算機が同様のデータ送
信を行なうため、図４(B)のようになる。ここで、各計
算機の間で双方向通信が利用できる場合には、送信４ab
と送信４baの２つの送信を、オーバーラップさせて実行
することができる。In Alltoall, each computer transmits data to all computers including itself. An example in which four computers perform Alltoall communication will be described with reference to FIG. 440
443 indicates a computer, and 400 to 433 indicate data transmission between the computers. Computers 440, 441, 442, and 443 are respectively assigned numbers 0, 1, 2, and 3 as identifiers. In MPI, this is called a rank number (hereinafter, "a computer assigned a rank number n"
"Calculator n"). The transmission 4ab indicates data transmission from the computer a to the computer b. FIG. 4A shows a state in which the computer 440 transmits data to all computers including itself. In Alltoall, since all the computers perform the same data transmission, the result is as shown in FIG. Here, if bidirectional communication can be used between the computers, transmission 4ab
And transmission 4ba can be executed in an overlapping manner.

【００１３】（ノンブロッキング通信によるAlltoallの
実装）一般的には、Alltoallの動作をノンブロッキング
通信によって実現している。ノンブロッキング通信で
は、通信相手に一方的に送信（受信）要求を発行してお
き、相手が受信（送信）できる状態になった時点で実際
の通信処理を行なう。ノンブロッキング通信によるAllt
oallの実現法では、まず各計算機が全計算機に対して送
信要求と受信要求を一度に発行しておき、各計算機がそ
の要求を処理できる状態になった時点で、順次、実際の
通信処理を行なう。このような動作であるため、各送信
処理が実行される順番は不定である。(Implementation of Alltoall by Non-blocking Communication) Generally, the operation of Alltoall is realized by non-blocking communication. In the non-blocking communication, a transmission (reception) request is unilaterally issued to a communication partner, and an actual communication process is performed when the communication partner can receive (transmit). Allt by non-blocking communication
In the oall realization method, each computer first issues a transmission request and a reception request to all computers at once, and when each computer is ready to process the request, the actual communication process is sequentially performed. Do. Because of such an operation, the order in which the transmission processes are executed is undefined.

【００１４】（全計算機間の同時通信方式）並列計算機
内の各計算機同士がお互いに一対一通信が可能である場
合に、各計算機が自分以外の全計算機にデータを送信す
る方式として、特開平８−２６３３４９がある。該方式
に、自分自身へデータを送信する処理を付加すると、MP
IのAlltoallと同等の機能になる。該方式は、MPIにおけ
るノンブロッキング通信を用いたAlltoallの実装とは異
なり、各計算機が通信相手の順番を決めて、フェーズ毎
に同時に通信を行なう。これにより、各計算機の通信処
理が同期してスムーズに進行する。(Simultaneous Communication System Between All Computers) Japanese Patent Laid-Open Publication No. HEI 9 (1999) discloses a method in which each computer in a parallel computer transmits data to all computers other than itself when each computer can communicate with each other one-to-one. 8-263349. If a process of transmitting data to itself is added to this method, MP
It has the same function as I Alltoall. In this method, unlike the implementation of Alltoall using non-blocking communication in MPI, each computer determines the order of communication partners and performs communication simultaneously for each phase. Thereby, the communication processing of each computer proceeds synchronously and smoothly.

【００１５】該方式を説明する。該方式では、N台の計
算機が２ⁿフェーズで通信を行なう（但し、２^n-1＜N≦
２ⁿ）。各計算機は、自身に割り当てられたランク番号
０〜N−１を２進数で表現した値と、フェーズ番号０〜
２ⁿ−１を２進数で表現した値の排他的論理和が、ラン
ク番号として割り当てられている計算機を、フェーズ毎
の通信相手とする。The method will be described. In this method, N computers communicate in 2 ⁿ phases (however, 2 ^n-1 <N ≦
2 ⁿ ). Each computer has a binary number representing a rank number 0 to N-1 assigned to itself, and a phase number 0 to N-1.
The computer to which the exclusive OR of the value expressing 2 ⁿ -1 in a binary number is assigned as the rank number is set as the communication partner for each phase.

【００１６】図３（A）を参照して、８台(=２³台）の計
算機が通信を行なう例を説明する。例えばフェーズ２で
は、各計算機（０,１,２,３,４,５,６,７）はそれぞれ
計算機（２,３,０,１,６,７,４,５）を通信相手として
いる（各計算機の通信相手は、自身のランク番号とフェ
ーズ番号「２」の排他的論理和である）。ここで、計算
機aが計算機bを通信相手としていれば、計算機bの方も
計算機aを通信相手としている、という関係が成り立
つ。よって、各計算機の通信処理は同時に実行すること
ができ、また、お互いへのデータ送信を、オーバーラッ
プして実行することができる。以上は、他のフェーズで
も同様である。各計算機が、フェーズ０〜フェーズ７の
８フェーズの間に、自分自身を含む全計算機にデータを
送信するため、Alltoallと同等の機能になる。[0016] With reference to FIG. 3 (A), 8 units (= 2 ³ units) of the computer will be described an example in which the communication. For example, in phase 2, each computer (0,1,2,3,4,5,6,7) communicates with each computer (2,3,0,1,6,7,4,5) ( The communication partner of each computer is the exclusive OR of its own rank number and the phase number "2". Here, if the computer a has the computer b as the communication partner, the relationship holds that the computer b also has the computer a as the communication partner. Therefore, the communication processing of each computer can be executed at the same time, and data transmission to each other can be executed in an overlapping manner. The same applies to other phases. Since each computer transmits data to all the computers including itself during the eight phases of phase 0 to phase 7, the function is equivalent to that of Alltoall.

【００１７】次に、図３（B）を参照して、６台（２²＜
６≦２³）の計算機が通信を行なう例を説明する。例え
ばフェーズ２では、各計算機（０,１,２,３,４,５）は
それぞれ計算機（２,３,０,１,６,７）を通信相手とし
ている。ここで、計算機（４,５）の通信相手である計
算機（６,７）は、実際には存在しない。そのため、計
算機（４,５）は、他の計算機（０,１,２,３）の通信処
理が終了するまで待機する。結果として、８（=２³）フ
ェーズの通信となる。Next, referring to FIG. 3B, six units (2 ² <
An example in which the computer of 6 ≦ 2 ³ ) performs communication will be described. For example, in phase 2, each computer (0,1,2,3,4,5) communicates with each computer (2,3,0,1,6,7). Here, the computers (6, 7) which are the communication partners of the computers (4, 5) do not actually exist. Therefore, the computers (4, 5) wait until the communication processing of the other computers (0, 1, 2, 3) ends. As a result, communication of 8 (= 2 ³ ) phases is performed.

【００１８】なお、元の特許では、自分自身を通信相手
に含めないので、フェーズ０における通信処理は行なわ
ず、２ⁿ−１フェーズの通信となる。In the original patent, since the communication device does not include itself in the communication partner, the communication process in the phase 0 is not performed, and the communication is a 2 ⁿ -1 phase communication.

【００１９】(all-reduce通信の概要)次にreduce通信に
ついて説明する。reduce通信は、全ノードが所有してい
るデータを、何らかの演算(合計、最大値、論理和など)
によってまとめ、特定のroot計算機に集めるかあるいは
全計算機に分配する処理である。特にその結果を全部の
計算機が最終的に保持するようなreduce通信処理をall-
reduce通信と呼ぶ。(Outline of all-reduce communication) Next, reduce communication will be described. Reduce communication uses data owned by all nodes to perform some operation (sum, maximum value, logical sum, etc.)
This is the process of collecting the information by a specific root computer or distributing it to all computers. In particular, all-computers must implement a reduce communication process in which all the
Called reduce communication.

【００２０】本発明では通信用の高速メモリと通常メモ
リの二種類のメモリを有する計算機を前提とする。The present invention is based on a computer having two types of memories, a high-speed memory for communication and a normal memory.

【００２１】送受信関数で指定されたバッファは通信用
高速メモリに割り当てられるが、実行時に動的に割り当
てられるバッファは通常メモリに割り当てられる。all-
reduceでは、元データの格納されているデータバッファ
と演算結果を格納する演算バッファが指定される。これ
らの二つのバッファは高速メモリに割り当てられる。The buffer specified by the transmission / reception function is allocated to the communication high-speed memory, but the buffer dynamically allocated at the time of execution is allocated to the normal memory. all-
In reduce, a data buffer in which original data is stored and an operation buffer for storing an operation result are specified. These two buffers are allocated for fast memory.

【００２２】n台のプロセスによるall-reduceを内部バ
ッファを確保しないで実装するとシーケンシャル方式に
なる。具体的には、まず、n番目のプロセスがn-１番目
のプロセスにデータを送信し、n-１番目のプロセスは演
算バッファでそのデータを受信する。受信データと自プ
ロセスの持つ元データとの演算結果を演算バッファに格
納し、演算バッファを用いてn-２番目のプロセスにデー
タを送信する。以後順次、各プロセスが他のプロセスか
らデータを受け取り、自プロセスの持つデータとの演算
を行い、その結果を次のプロセスに送信するというフェ
ーズを繰り返し、最初のプロセスの演算バッファに演算
結果が格納される。その演算結果を全てのプロセスにbr
oadcastすることで通信が完了する。この方式では演算
にnフェーズ、broadcastにlog(n)フェーズ必要である。When all-reduce by n processes is implemented without securing an internal buffer, a sequential method is adopted. Specifically, first, the n-th process transmits data to the (n-1) -th process, and the (n-1) -th process receives the data in the operation buffer. The calculation result of the received data and the original data of the own process is stored in the calculation buffer, and the data is transmitted to the (n-2) th process using the calculation buffer. Thereafter, each process sequentially receives data from other processes, performs an operation with the data of the own process, and transmits the result to the next process, and repeats the phase, and the operation result is stored in the operation buffer of the first process. Is done. Apply the result to all processes
Communication is completed by oadcast. In this method, n phases are required for computation and log (n) phase is required for broadcast.

【００２３】これに対して、内部バッファの使用を許可
した場合、Hyper Cube方式を用いる事でlog(n)フェーズ
で通信処理を完了する事が出来る。Hyper Cube方式と
は、プロセス数nを２のべき乗に丸めて、２のべき乗数
のプロセスが各フェーズごとに自身に割り当てられたラ
ンク番号と２の(通信回数-１)乗との排他的論理和に等
しいランク番号を持つ計算機と通信を行う方式である。
具体的にプロセス数５の場合を図６を用いて説明する。
各プロセスに０から順に４まで識別用にランク番号を付
ける。各プロセスは通信に使用するバッファ（６０１、
６０２、６０３、６０４、６０５）を持つ。通信には通
常、送信用と受信用の二つのバッファが使用されるが、
ここでは送受信可能な一つのバッファを仮定する。まず
プロセス数を２のべき乗である４に丸めるためにランク
４のプロセスはランク０に対してデータを送信する（６
０６）。この結果、プロセス０は、０と４との演算結果
（図では和を表すSUMを例示）とする。以降、ランク０
〜３の間で通信が行われる。１回目の通信では、２⁰と
の排他的論理和を取り、ランク０と１、２と３が通信を
行い（６０７、６０８）、プロセス０,１がSUM(０,１,
４)を持ち、プロセス２,３がSUM(２,３)を持つ。２回目
には２¹との排他的論理和を取り、ランク０と２、１と
３が通信を行う（６０９、６１０）。最後にランク０か
らランク４に結果を送信して全てのプロセスで通信が完
了する（６１１）。Hyper Cubeを用いた場合には、各プ
ロセスでデータバッファと演算バッファの他に他プロセ
スからのデータを受信する受信バッファが必要になる。
各プロセスはまず受信バッファでデータを受け取り、受
信データと元データ（２回目の演算からは前回までの演
算結果データ）との間で演算を行い、その結果を演算バ
ッファに格納する。次のフェーズで演算バッファのデー
タを相手プロセスに送信する。この受信バッファは関数
の引き数では指定されず実行時に動的に割り当てられる
ため、通常メモリに割り当てられる。したがって、Hype
r Cube方式は少ないフェーズで通信を行えるが、通常メ
モリを用いるため一回当りの通信時間はシーケンシャル
方式より長くなる。On the other hand, when the use of the internal buffer is permitted, the communication process can be completed in the log (n) phase by using the Hyper Cube method. The Hyper Cube method is an exclusive logic of the process number n rounded to the power of 2 and the process of the power of 2 assigned to the rank number assigned to itself for each phase and the power of 2 (the number of communication times -1). This is a method of communicating with a computer having a rank number equal to the sum.
The case of five processes will be specifically described with reference to FIG.
Each process is assigned a rank number from 0 to 4 for identification. Each process uses a buffer (601,
602, 603, 604, and 605). Communication usually uses two buffers, one for transmission and one for reception.
Here, one buffer that can transmit and receive is assumed. First, in order to round the number of processes to a power of two, a process of rank 4 transmits data to rank 0 (6
06). As a result, the process 0 is assumed to be the operation result of 0 and 4 (SUM representing the sum is illustrated in the figure). After that, rank 0
The communication is performed between. In the first communication, an exclusive OR of the 2 ^0, communicates rank 0 and 1, 2 and 3 (607, 608), the process 0,1 SUM (0,1,
4), and processes 2 and 3 have SUM (2,3). The second take the exclusive OR of the 2 ^1, the rank 0 and 2,1 and 3 perform communication (609, 610). Finally, the result is transmitted from rank 0 to rank 4, and communication is completed in all processes (611). When Hyper Cube is used, each process requires a reception buffer for receiving data from another process in addition to a data buffer and an operation buffer.
Each process first receives data in the reception buffer, performs an operation between the received data and the original data (the operation result data from the second operation to the previous operation), and stores the result in the operation buffer. In the next phase, the data in the operation buffer is transmitted to the partner process. This receive buffer is not specified by a function argument, but is allocated dynamically at the time of execution, and is therefore usually allocated to memory. Therefore, Hype
The r Cube method can perform communication in a small number of phases, but the communication time per operation is longer than that of the sequential method because a normal memory is used.

【００２４】[0024]

【発明が解決しようとする課題】これらMPIの全体通信
では、次のような課題がある。The following problems are involved in the whole communication of MPI.

【００２５】（１）全計算機がroot計算機としてGather
/Scatter通信をおこなうall-to-all通信の場合、普通に
Gather/Scatter通信を計算機数回行うと計算機数Ｎの二
乗に比例して処理時間が増大してしまう。また従来手法
で示されている方法を用いた場合でも、Ｎを２のべき乗
に切り上げた数に比例した通信フェーズが必要となるた
め、余計な通信待ち時間が発生してしまう。(1) All computers are Gather as root computers
For all-to-all communication with / Scatter communication,
When Gather / Scatter communication is performed several times, the processing time increases in proportion to the square of the number N of computers. Further, even when the method shown in the conventional method is used, a communication phase proportional to the number obtained by rounding N up to a power of 2 is required, so that extra communication waiting time occurs.

【００２６】（２）Gather/Scatter通信では、root計算
機がその他の全ての計算機とデータの送受信処理を行う
逐次処理方式を用いた場合、計算機数に比例して処理時
間が増加してしまう。また、高速な二分木方式を用いた
場合には、各計算機が作業用バッファを確保する必要が
あり、データ量が増大するにつれその負担が大きくな
る。また、作業用バッファに対するメモリコピーという
オーバヘッドが発生するため、データが大きくなった場
合や計算機数が少ない場合にはそのオーバヘッドのため
に性能が上がらない。(2) In Gather / Scatter communication, if the root computer uses a sequential processing method of transmitting and receiving data to and from all other computers, the processing time increases in proportion to the number of computers. When a high-speed binary tree method is used, each computer must secure a work buffer, and the load increases as the data amount increases. In addition, since the overhead of memory copying to the working buffer occurs, if the data becomes large or the number of computers is small, the overhead does not improve the performance.

【００２７】（３）all-reduce通信では、内部バッファ
を使用しない場合、n＋log(n)回の通信が必要になって
しまい、内部バッファを使用した場合には通信用の高速
メモリが使用できなくなる。(3) In the all-reduce communication, if the internal buffer is not used, n + log (n) times of communication are required, and if the internal buffer is used, the high-speed memory for communication cannot be used. .

【００２８】本発明の目的は、これらMPIの全体通信処
理について、その通信回数や通信処理時間の削減を、バ
ッファなどの必要な資源を最小に抑えながら実現するた
めの計算機間データ通信方法を提供することにある。An object of the present invention is to provide an inter-computer data communication method for reducing the number of times of communication and the communication processing time of the entire MPI communication processing while minimizing necessary resources such as buffers. Is to do.

【００２９】[0029]

[Means for Solving the Problems]

（１）alltoall通信では、各フェーズで通信待ちになる
計算機が出て来ないような順番で、通信を行なう。(1) In alltoall communication, communication is performed in such an order that computers waiting for communication in each phase do not appear.

【００３０】（２）Gather/Scatter通信処理について
は、逐次処理方式と二分木方式の両方を用意し、条件に
応じてより効率のよい方式に切り替えるハイブリッド方
式により、それぞれの長所を利用した処理を提供する。(2) For the Gather / Scatter communication processing, both the sequential processing method and the binary tree method are prepared, and the processing using the advantages of each is performed by the hybrid method in which a more efficient method is switched according to the conditions. provide.

【００３１】（３）all-reduce通信では、内部バッファ
を使用することにより通信回数の少ない通信方法を用
い、さらに通信用の高速メモリを使用して一回当りの通
信時間も削減するための方式を提供する。(3) In the all-reduce communication, a method for reducing the number of communication times by using an internal buffer and using a high-speed memory for communication to reduce the communication time per time. I will provide a.

【００３２】[0032]

BEST MODE FOR CARRYING OUT THE INVENTION

（Alltoallの通信方式）図１を参照して、本実施例にお
けるAlltoallの通信方式を説明する。図１（Ａ）は、遇
数個（６台）の計算機がAlltoall通信を行なう例、図１
（Ｂ）は、奇数個（５台）の計算機がAlltoall通信を行
なう例を示している。(Alltoall communication method) The Alltoall communication method in the present embodiment will be described with reference to FIG. FIG. 1A shows an example in which even number (six) computers perform Alltoall communication.
(B) shows an example in which an odd number (five) of computers perform Alltoall communication.

【００３３】まず、本実施例の通信方式を示す（該通信
方式は、図１（A）と図１（B）で共通である）。各計算
機はフェーズ０で、Ｎ−１（図１（A）の例では５）か
ら自分の計算機のランク番号を引いた値をランク番号に
持つ計算機を通信相手とする。フェーズ１以降は、各計
算機は前フェーズでの通信相手に１を加算した計算機を
通信相手とする。ただし、１を加算した値がＮ（図１
（A）の例では６）になる場合は、０に戻す。これをフ
ェーズN−１まで繰り返す。よって、本実施例のAlltoal
lは、フェーズ０〜フェーズN−１の、計Ｎフェーズの通
信になる。First, a communication method according to this embodiment will be described (the communication method is common to FIGS. 1A and 1B). In phase 0, each computer communicates with a computer having a rank number obtained by subtracting the rank number of its own computer from N-1 (5 in the example of FIG. 1A). After the phase 1, each computer sets the computer obtained by adding 1 to the communication partner in the previous phase as the communication partner. However, the value obtained by adding 1 is N (FIG. 1).
In the example of (A), when it becomes 6), it is reset to 0. This is repeated up to phase N-1. Therefore, the Alltoal
l is communication of a total of N phases of phase 0 to phase N-1.

【００３４】次に、図１（A）を参照して、遇数個の計
算機がAlltoall通信を行なう場合、本実施例を用いる
と、通信ペアが各フェーズでどの様に決まるかを説明す
る。Next, with reference to FIG. 1A, when an even number of computers perform Alltoall communication, how this embodiment determines a communication pair in each phase will be described.

【００３５】フェーズ０では、各計算機（０,１,２,３,
４,５）はそれぞれ計算機（５,４,３,２,１,０）を通信
相手とすることになる。ここで、（５,４,３,２,１,
０）の並びは、（０,１,２,３,４,５）の並びの反転に
なっているので、計算機aが計算機bを通信相手としてい
れば、計算機bの方も計算機aを通信相手としている、と
いう関係が成り立つ。In phase 0, each computer (0,1,2,3,
4 and 5) are computers (5, 4, 3, 2, 1, and 0) respectively. Where (5,4,3,2,1,
Since the sequence of (0) is the reverse of the sequence of (0,1,2,3,4,5), if computer a communicates with computer b, computer b also communicates with computer a. The relationship of being with the other party holds.

【００３６】フェーズ１以降では、通信相手の並びが２
つの部分に分かれる。例えばフェーズ２では、各計算機
（０,１,２,３,４,５）はそれぞれ計算機（１,０,５,
４,３,２）を通信相手とすることになる。ここで、
（１,０,５,４,３,２）の並びを、点線で囲った２つの
部分１０１aと１０１bに分けると、１０１aの並びは
（１,０）、１０１bの並びは（５,４,３,２）である。
これらの並びはそれぞれ、（０,１）の並びの反転、
（２,３,４,５）の並びの反転になっているので、計算
機aが計算機bを通信相手としていれば、計算機bの方も
計算機aを通信相手としている、という関係が成り立
つ。In the phase 1 and thereafter, the number of communication partners is 2
Divided into two parts. For example, in phase 2, each computer (0,1,2,3,4,5) has its own computer (1,0,5,
4, 3, 2) will be the communication partner. here,
When the arrangement of (1,0,5,4,3,2) is divided into two parts 101a and 101b surrounded by a dotted line, the arrangement of 101a is (1,0), and the arrangement of 101b is (5,4, 3,2).
These sequences are the inverse of the (0,1) sequence,
Since the arrangement of (2, 3, 4, 5) is reversed, if the computer a has the computer b as the communication partner, the relationship holds that the computer b also has the computer a as the communication partner.

【００３７】次のフェーズ３では、各計算機（０,１,
２,３,４,５）はそれぞれ計算機（２,１,０,５,４,３）
を通信相手とすることになる。ここで、計算機１および
計算機４は、それぞれ自分自身を通信相手としている。
このときの１０２a（２,１,０）または１０２b（５,４,
３）のように、並びが奇数個になる場合は、その真ん中
では通信相手が自分自身となる。（図中では、通信相手
が自分自身となっているところを○でかこってある。）
次に、図１（Ｂ）を参照して、奇数個の計算機がAlltoa
ll通信を行なう場合、本実施例を用いると、通信ペアが
各フェーズでどの様に決まるかを説明する。In the next phase 3, each computer (0, 1,
2,3,4,5) are computers (2,1,0,5,4,3)
Is the communication partner. Here, the computer 1 and the computer 4 each have their own communication partner.
At this time, 102a (2, 1, 0) or 102b (5, 4,
In the case where the arrangement is odd as in 3), the communication partner is itself in the middle. (In the figure, circles indicate where the communication partner is yourself.)
Next, referring to FIG. 1B, an odd number of computers are assigned to Alltoa
In the case of performing II communication, how this embodiment determines a communication pair in each phase will be described.

【００３８】フェーズ０では、各計算機（０,１,２,３,
４）はそれぞれ計算機（４,３,２,１,０）を通信相手と
することになる。ここで、（４,３,２,１,０）の並び
は、（０,１,２,３,４）の並びの反転になっているの
で、計算機aが計算機bを通信相手としていれば、計算機
bの方も計算機aを通信相手としている、という関係が成
り立つ。また、（０,１,２,３,４）の並びは奇数個なの
で、その真ん中の計算機２の通信相手は自分自身とな
る。In phase 0, each computer (0,1,2,3,
4) means that the computer (4, 3, 2, 1, 0) is the communication partner. Here, the sequence of (4,3,2,1,0) is the reverse of the sequence of (0,1,2,3,4), so if computer a is using computer b as the communication partner, ,calculator
The relationship is established that b also has computer a as the communication partner. Also, since the arrangement of (0, 1, 2, 3, 4) is an odd number, the communication partner of the computer 2 in the middle is itself.

【００３９】フェーズ１以降では、通信相手の並びが２
つの部分に分かれる。例えばフェーズ２では、各計算機
（０,１,２,３,４）はそれぞれ計算機（１,０,４,３,
２）を通信相手とすることになる。ここで、（１,０,
４,３,２）の並びを、点線で囲った２つの部分１０３a
と１０３bに分けると、１０３aの並びは（１,０）、１
０３bの並びは（４,３,２）である。これらの並びはそ
れぞれ、（０,１）の並びの反転、（２,３,４）の並び
の反転になっているので、計算機aが計算機bを通信相手
としていれば、計算機bの方も計算機aを通信相手として
いる、という関係が成り立つ。また、（２,３,４）の並
びは奇数個なので、真ん中の計算機３の通信相手は自分
自身となる。In the phase 1 and thereafter, the number of communication partners is 2
Divided into two parts. For example, in phase 2, each computer (0,1,2,3,4) has its own computer (1,0,4,3,4)
2) will be the communication partner. Where (1,0,
4, 3, 2), two parts 103a enclosed by dotted lines
And 103b, the arrangement of 103a is (1, 0), 1
The arrangement of 03b is (4, 3, 2). Since these arrangements are the reverse of the arrangement of (0,1) and the arrangement of (2,3,4), if computer a is the communication partner of computer b, computer b will also The relationship holds that the computer a is the communication partner. Also, since the arrangement of (2, 3, 4) is an odd number, the communication partner of the middle computer 3 is itself.

【００４０】以上は、偶数個の計算機がAlltoall通信を
行なう場合と同様である。The above is the same as the case where an even number of computers perform Alltoall communication.

【００４１】ここで、計算機台数が奇数個の場合、各フ
ェーズで必ず１台、また１台のみの計算機が自分自身を
通信相手とすることになる。Here, when the number of computers is odd, one computer is always used in each phase, and only one computer uses itself as a communication partner.

【００４２】以上示した通り、本実施例の通信方式によ
れば、各フェーズで、計算機aが計算機bを通信相手とし
ていれば、計算機bの方も計算機aを通信相手としてい
る、という関係が成り立つ。よって、各計算機の通信処
理は同時に実行することができ、また、お互いへのデー
タ送信を、オーバーラップして実行することができる。
また、余計な通信待ち時間が一切発生しない。As described above, according to the communication method of the present embodiment, in each phase, if the computer a has the computer b as the communication partner, the computer b also has the computer a as the communication partner. Holds. Therefore, the communication processing of each computer can be executed at the same time, and data transmission to each other can be executed in an overlapping manner.
Also, no extra communication waiting time occurs.

【００４３】（計算機の台数が偶数の場合における自分
以外との全対全通信方式）図２の例を参照して、計算機
の台数Nが偶数（図２の例では６台）の場合に、自分以
外の相手とN−１（図２の例では５）フェーズで通信す
る方式を示す。(All-to-All Communication System with Other Computers When the Number of Computers is Even) Referring to the example of FIG. 2, when the number of computers N is even (6 in the example of FIG. 2), A method of communicating with the other party in the N-1 (5 in the example of FIG. 2) phase is shown.

【００４４】まず、N−１（＝５）台のAlltoall通信方
式における通信順序を求める。Ｎが偶数なので、N−１
は奇数になる。よって、各フェーズで必ず１台、また１
台のみの計算機が自分自身を通信相手とすることにな
る。First, the communication order in N-1 (= 5) Alltoall communication systems is determined. Since N is even, N-1
Becomes odd. Therefore, in each phase, one and one
Only one computer will communicate with itself.

【００４５】５台の計算機（計算機０〜計算機４）がAl
ltoall通信を行なう例は、図１（B）に示す通りであ
る。フェーズ０〜４の、計５フェーズの通信になる。各
フェーズ（０,１,２,３,４）で、それぞれ計算機（２,
０,３,１,４）が自分自身を通信相手としている。The five computers (computer 0 to computer 4) are Al
An example of performing ltoall communication is as shown in FIG. Communication is performed in a total of five phases, that is, phases 0 to 4. In each phase (0,1,2,3,4), the computer (2,
0, 3, 1, 4) have themselves as communication partners.

【００４６】各計算機（０,１,２,３,４）が、それぞれ
自分自身を通信相手とする代わりに、６台目の計算機５
を通信相手とすれば、自分以外の５台と、５フェーズで
通信を行なうことになる。それに対応して、計算機５
は、各フェーズ（０,１,２,３,４）毎に計算機（２,０,
３,１,４）の順番で通信する。Each of the computers (0, 1, 2, 3, 4) uses the sixth computer 5 instead of its own communication partner.
Is the communication partner, communication with the other five devices is performed in five phases. Correspondingly, computer 5
Is the computer (2,0,2) for each phase (0,1,2,3,4)
Communication is performed in the order of (3, 1, 4).

【００４７】以上のように、計算機０〜計算機５の６台
の計算機が、自分以外の相手と５フェーズで通信を行な
うことができる。As described above, the six computers 0 to 5 can communicate with the other party in five phases.

【００４８】(Gather/Scatterの通信方式)次に、本発明
によるGather/Scatter通信処理の実施形態を図１１のフ
ローチャートを用いて説明する。(Gather / Scatter Communication Method) Next, an embodiment of the gather / scatter communication processing according to the present invention will be described with reference to the flowchart of FIG.

【００４９】MPI関数を用いて通信処理を行う場合は、
プログラムの最初の部分で各計算機が初期化処理を行う
ため、その初期化処理の終了後には全計算機数を各計算
機が把握していることになる(処理１１０１)。When performing communication processing using the MPI function,
Since each computer performs the initialization processing in the first part of the program, each computer knows the total number of computers after the initialization processing is completed (processing 1101).

【００５０】次にGather/Scatter通信関数が呼ばれた際
に、その引数から善データ量を得(処理１１０２)、初期
化の時に得られた全計算機数をそれぞれ調べ、全データ
量がある一定値以下であり(処理１１０３)なおかつ全計
算機数がある一定値以上である場合(処理１１０４)に
は、以降二分木方式を用いてデータを通信する(処理１
１０５)。その条件にあてはまらない場合については逐
次処理方式を用いて処理を行う(処理１１０６)。最後に
終了処理を行い(処理１１０７)プログラムは終了する。Next, when the Gather / Scatter communication function is called, the amount of good data is obtained from its argument (process 1102), and the total number of computers obtained at the time of initialization is checked. If the value is equal to or less than the value (process 1103) and the total number of computers is equal to or more than a certain value (process 1104), data is communicated using the binary tree method thereafter (process 1104).
105). If the condition is not met, processing is performed using the sequential processing method (processing 1106). Finally, an end process is performed (process 1107), and the program ends.

【００５１】それぞれの切替点については、実装する計
算機の特性、すなわち搭載しているメモリ量、一対一通
信の性能、メモリコピー性能、などによって異なる。Each switching point differs depending on the characteristics of the computer to be mounted, that is, the amount of installed memory, the performance of one-to-one communication, the memory copy performance, and the like.

【００５２】(all-reduceの実施形態)次に、本発明にお
けるall-reduceの実装方式を図を用いて詳細に説明す
る。図５に本発明の全体構成図を示す。５０１、５０２
はプロセッサ内部のメモリ領域を表し、各メモリには通
信用高速メモリ（５０３、５０４）と通常メモリ（５０
５、５０６）が存在する。all-reduceの関数引き数とし
て指定される二つのバッファ、データバッファ（５０
７、５０８）と演算バッファ（５０９、５１０）は通信
用高速メモリに割り当てられる。更に、実行時に動的に
割り当てられる受信バッファ（５１１、５１２）は通常
メモリに割り当てられる。(Embodiment of all-reduce) Next, an implementation method of all-reduce in the present invention will be described in detail with reference to the drawings. FIG. 5 shows an overall configuration diagram of the present invention. 501, 502
Represents a memory area inside the processor, and each memory has a high-speed communication memory (503, 504) and a normal memory (50
5, 506). Two buffers specified as function arguments of all-reduce, a data buffer (50
7, 508) and the operation buffers (509, 510) are allocated to the communication high-speed memory. Furthermore, receive buffers (511, 512) dynamically allocated at the time of execution are usually allocated to memory.

【００５３】従来は、Hyper Cube方式の各フェーズにお
いて各プロセスは、１回目の通信はデータバッファから
相手プロセスの受信バッファにデータを転送し、２回目
以降の通信では、演算バッファから相手プロセスの受信
バッファにデータを転送する。本発明では、通信時には
常にデータバッファと演算バッファを使用することによ
り、通信用高速メモリの利用による高速なall-reduceを
実現する。Conventionally, in each phase of the Hyper Cube method, each process transfers data from the data buffer to the receiving buffer of the partner process in the first communication, and receives the partner process from the arithmetic buffer in the second and subsequent communications. Transfer data to the buffer. In the present invention, high-speed all-reduce is realized by using a high-speed memory for communication by always using a data buffer and an operation buffer during communication.

【００５４】本発明の実現方法を図７を用いて説明す
る。まず、各プロセスは、通信開始前にデータバッファ
内のデータを受信バッファにコピーする（７１３、７１
４）。受信バッファは以後、元データの保持のためだけ
に使用される。最初の通信時には、各プロセスはデータ
バッファから相手プロセスの演算バッファにデータを送
信し（７１５、７１６）、演算バッファで受信した相手
プロセスのデータと自プロセスのデータを演算し、演算
バッファに格納する。２回目以後の通信では、演算バッ
ファから相手プロセスのデータバッファにデータを送信
し（７１７、７１８）、演算バッファ内の前回までの演
算結果とデータバッファ内の受信データとで演算を行
い、演算バッファに格納する。順次これを繰り返し、最
終的な演算結果が各プロセスの演算バッファに格納され
る。通信終了後最後に受信バッファに保持しておいた元
データをデータバッファにコピーして（７１９、７２
０）通信を終了する。A method for realizing the present invention will be described with reference to FIG. First, each process copies the data in the data buffer to the reception buffer before starting communication (713, 71).
4). The reception buffer is thereafter used only for holding the original data. At the time of the first communication, each process transmits data from the data buffer to the operation buffer of the other process (715, 716), calculates the data of the other process and the data of the own process received by the operation buffer, and stores the data in the operation buffer. . In the second and subsequent communications, data is transmitted from the operation buffer to the data buffer of the partner process (717, 718), and an operation is performed using the operation result up to the previous time in the operation buffer and the received data in the data buffer. To be stored. This is sequentially repeated, and the final operation result is stored in the operation buffer of each process. After the end of the communication, the original data held in the reception buffer is copied to the data buffer at last (719, 72).
0) Terminate the communication.

【００５５】[0055]

【発明の効果】本発明により、以下の効果が得られる。According to the present invention, the following effects can be obtained.

【００５６】（１）余計な通信待ちを必要とせずに、計
算機台数に比例した時間で、全対全通信を実現すること
ができる。(1) All-to-all communication can be realized in a time proportional to the number of computers without requiring extra communication waiting.

【００５７】（２）全通信データ量や計算機数が変化し
ても、常に最適な方式で処理を行うことが可能となる。(2) Even if the total communication data amount or the number of computers changes, processing can always be performed in an optimal manner.

【００５８】（３）all-reduce関数において、通信回数
の少ないHyper Cube方式を採用し、かつ通信用高速メモ
リを利用する事でより高速に通信を行う事が可能とな
る。(3) In the all-reduce function, the communication can be performed at a higher speed by adopting the Hyper Cube method with a small number of communication times and using a high-speed memory for communication.

[Brief description of the drawings]

【図１】Alltoallの通信方式の説明図。FIG. 1 is an explanatory diagram of a communication method of Alltoall.

【図２】計算機の台数が偶数の場合における自分以外と
の全対全通信方式の説明図。FIG. 2 is an explanatory diagram of an all-to-all communication system with a computer other than itself when the number of computers is an even number.

【図３】従来例の全計算機間の同時通信方式の説明図。FIG. 3 is an explanatory diagram of a conventional simultaneous communication method between all computers.

【図４】Alltoallの通信パターンの説明図。FIG. 4 is an explanatory diagram of a communication pattern of Alltoall.

【図５】All-reduceにおける計算機の構成図。FIG. 5 is a configuration diagram of a computer in All-reduce.

【図６】Hyper Cube方式のデータ転送順序の説明図。FIG. 6 is an explanatory diagram of a data transfer order of the Hyper Cube method.

【図７】All-reduceの通信順序の説明図。FIG. 7 is an explanatory diagram of a communication order of All-reduce.

【図８】Gather通信の通信パターンの説明図。FIG. 8 is an explanatory diagram of a communication pattern of gather communication.

【図９】Gather通信における逐次処理方式の説明図。FIG. 9 is an explanatory diagram of a sequential processing method in gather communication.

【図１０】Gather通信における二分木方式の説明図。FIG. 10 is an explanatory diagram of a binary tree method in Gather communication.

【図１１】Gather通信における逐次処理方式と二分木方
式の切替処理を示すフローチャート。FIG. 11 is a flowchart showing switching processing between a sequential processing method and a binary tree method in gather communication.

[Explanation of symbols]

１０１ａ，１０１ｂ，１０２ａ，１０２ｂ，１０３ａ，
１０３ｂ．．．通信相手の並び、４４０〜４４
３．．．計算機、４０１〜４３３．．．データの送
信、５０１，５０２．．．メモリ領域、５０
３，５０４．．．通信用高速メモリ、５０７，５０
８．．．データバッファ、６０１〜６０５．．．バ
ッファ、６０６〜６１１．．．通信、８０１〜
８０４．．．計算機、８１０〜８１３．．．送信デ
ータを格納したバッファ、８２０．．．受信データ
を格納するバッファ、１０３０〜１０３３．．．作
業用バッファ、１１０１〜１１０７．．．Gather通
信における逐次処理方式と二分木方式の切替処理を示す
フローチャート。101a, 101b, 102a, 102b, 103a,
103b. . . Line of communication partner, 440-44
3. . . Computer, 401-433. . . Data transmission, 501, 502. . . Memory area, 50
3,504. . . High-speed memory for communication, 507, 50
8. . . Data buffer, 601-605. . . Buffer, 606-611. . . Communication, 801-
804. . . Computer, 810-813. . . 820. Buffer storing transmission data; . . Buffer for storing received data, 1030 to 1033. . . Working buffer, 1101-1107. . . 9 is a flowchart illustrating switching processing between a sequential processing method and a binary tree method in Gather communication.

───────────────────────────────────────────────────── フロントページの続き (72)発明者田窪俊二東京都国分寺市東恋ケ窪一丁目280番地株式会社日立製作所中央研究所内 (72)発明者高杉昌督神奈川県横浜市戸塚区戸塚町5030番地株式会社日立製作所ソフトウェア開発本部内 (72)発明者呉屋公英神奈川県横浜市戸塚区戸塚町5030番地株式会社日立製作所ソフトウェア開発本部内 ──────────────────────────────────────────────────の Continued on the front page (72) Inventor Shunji Takubo 1-280 Higashi Koigakubo, Kokubunji-shi, Tokyo Inside the Central Research Laboratory, Hitachi, Ltd. Hitachi Software Development Division

Claims

[Claims]

1. A data communication method between element computers in a distributed memory type parallel computer connected by a network, comprising:
In each communication step, two element computers perform mutual communication by a set of two element computers, and each communication step forms a different set from the communication set in the previous communication step, and performs n communication steps Wherein the data in each element computer is mutually transmitted and received between all the element computers.

2. A method for arbitrarily assigning serial numbers from 0 to n-1 to said n element computers, and in the first communication, each element computer
-1 is subtracted from the serial number given to itself, and is paired with a computer of a value.In the subsequent communication steps, each element computer is an element computer in which the communication partner number is increased by one, but n- After the 1st element calculator, the 0th element calculator,
2. The data communication method between computers according to claim 1, wherein the communication is performed in pairs.

3. A high-speed network, wherein a specific memory area is allocated for high-speed communication, and a transmitting node writes data directly to a high-speed communication memory of a receiving node, thereby reducing the OS overhead of the receiving node and achieving high-speed communication. A data communication method between element computers in a distributed memory type parallel computer that realizes communication, and a binary tree in which a work memory for all data is secured in each element computer and sequentially collected in half of the element computers for each communication step Select the method of communicating with the model and the method of one element computer collecting data without securing working memory and communicating with all the element computers sequentially using the number of element computers and data amount as parameters. A data communication method for inter-computer, characterized in that communication is performed so as to optimally collect data in all of the element computers into one element computer.

4. A high-speed network, wherein a specific memory area is allocated for high-speed communication, and a transmitting node directly writes data to a high-speed communication memory of a receiving node, thereby reducing the OS overhead of the receiving node and achieving high-speed communication. A data communication method between element computers in a distributed memory type parallel computer that realizes communication, and a binary tree in which a work memory for all data is secured in each element computer and sequentially distributed to double element computers at each communication step. Select the method of communicating with the model and the method of one element computer that distributes data sequentially with all element computers without securing working memory, using the number of element computers and data amount as parameters A data communication method for inter-computer communication characterized by performing communication that optimally distributes data in one element computer to all element computers.

5. A high-speed network, a specific memory area is allocated for high-speed communication, and a transmitting node directly writes data to a high-speed communication memory of a receiving node, thereby reducing an OS overhead of a receiving node and achieving high-speed communication. A data communication method between element computers in a distributed memory type parallel computer that realizes communication, and secures a working memory for storing initial data, thereby enabling transmission and reception allocated to a high-speed communication memory. An inter-computer data communication method characterized in that an operation result for data in all element computers is communicated to all element computers using only a memory area.