JPH0756870A

JPH0756870A - Data mapping method

Info

Publication number: JPH0756870A
Application number: JP5203901A
Authority: JP
Inventors: Sadao Nakamura; 定雄中村
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1993-08-18
Filing date: 1993-08-18
Publication date: 1995-03-03
Anticipated expiration: 2014-04-12
Also published as: JP2880880B2

Abstract

PURPOSE:To provide the data mapping method capable of a high-speed inter- processing element communication. CONSTITUTION:In the data mapping method for which data provided at a virtual PE are allocated to the memory of a physical processor while making the coordinate axis of the virtual PE corresponding to the desired coordinate axis of a physical PE respectively between physical processors P (p0, ..., pN-1) and virtual processors V (v0, ..., vM-1) numbered by the pairs of coordinate values, when the virtual PE with the coordinate value of vi in a direction (i) is made correspondent to the physical PE with the coordinate value of pj in a direction (j), the virtual PE with the coordinate value of vi+1 in the direction (i) is made correspondent to the physical PE with the coordinate value of pj-1, pj or pj+1 in the direction (j) and when the coordinate value of the virtual PE in the direction (j) is increased one by one, one partial column to increase a sequence prepared by the coordinate values of the correspondent physical PE in the direction (j) one by one and one partial colulmn to decrease it one by one are respectively contained at least.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、並列計算機における仮
想プロセッサと物理プロセッサとの間のデータマッピン
グ方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data mapping method between a virtual processor and a physical processor in a parallel computer.

【０００２】[0002]

【従来の技術】高速の処理装置を多数並べた分散メモリ
型超並列計算機を実現し、効率的に動作させることがで
きれば、科学技術計算において劇的な性能向上を期待で
きる。分散メモリ型並列計算機で最大性能を引き出すた
めには、計算を分割し、なおかつプロセッシングエレメ
ント（以下、ＰＥと略記する）間の通信を最小化させる
必要がある。従って、プログラムの構造に適したデータ
のＰＥへの配置方法を選択することが重要である。2. Description of the Related Art If a distributed memory type massively parallel computer in which a large number of high speed processing devices are arranged and operated efficiently can be expected to dramatically improve performance in scientific and technological calculations. In order to obtain the maximum performance in the distributed memory parallel computer, it is necessary to divide the calculation and minimize the communication between processing elements (hereinafter abbreviated as PE). Therefore, it is important to select a method of arranging the data in the PE that is suitable for the structure of the program.

【０００３】数値計算の分野では、Ｆｏｒｔｒａｎ言語
が広く使われており、特に並列計算機に対しては並列拡
張されたＦｏｒｔｒａｎ言語が考えられている。図９
は、このような並列Ｆｏｒｔｒａｎによるプログラム例
である。このプログラムは、長さ２０の２つの１次元配
列Ａ、Ｂを定義し、Ａの要素を右に１つづつ循環シフト
してＢに格納するものである。図中の行７０にあるＦＯ
ＲＡＬＬ文は並列実行を意味し、配列Ａの循環シフト
と、その結果の配列Ｂへの格納とが、要素ごとに並列に
実行される。並列プログラミングでは、図９で示した処
理やこれに類似した処理は度々現れ、基本的な処理パタ
ーンの一つである。In the field of numerical calculation, the Fortran language is widely used, and in particular, the parallel extended Fortran language is considered for a parallel computer. Figure 9
Is an example of a program by such parallel Fortran. This program defines two one-dimensional arrays A and B having a length of 20, cyclically shifts the elements of A one by one to the right and stores them in B. FO on line 70 in the figure
The RALL statement means parallel execution, and the cyclic shift of the array A and the storage of the result in the array B are executed in parallel for each element. In parallel programming, the processing shown in FIG. 9 and processing similar thereto frequently appear, which is one of the basic processing patterns.

【０００４】上記プログラムによる並列実行を具体化す
るために、長さ２０の１次元の仮想プロセッサ配列Ｖ
(i) 、（ｉ＝０，１，…，１９）を定義し、配列データ
Ａ、Ｂを仮想プロセッサ配列にマッピングする。ここで
仮想プロセッサとは、プログラムの持つデータを通信距
離を導入して分割したとき、分割された各々のデータの
集合のことである。従って、同じ仮想プロセッサに属す
るデータの参照には通信を必要としないのでコストがか
からないが、異なる仮想プロセッサに属するデータの参
照には通信を必要とする可能性があるのでコストがかか
ると考える。今の場合、仮想プロセッサＶ(i) に、配列
データＡ(i) および配列データＢ(i) 、（ｉ＝０，１，
…，１９）を対応させる。図９のプログラムでは、１次
元の仮想プロセッサがリング状に結合されている、すな
わち１次元トーラス結合されていると考えると都合がよ
い。In order to embody parallel execution by the above program, a one-dimensional virtual processor array V of length 20
(i), (i = 0, 1, ..., 19) are defined, and the array data A and B are mapped to the virtual processor array. Here, the virtual processor is a set of each divided data when the data of the program is divided by introducing the communication distance. Therefore, reference to data belonging to the same virtual processor does not require communication, so there is no cost, but reference to data belonging to a different virtual processor may require communication, which is considered to be costly. In this case, the virtual processor V (i) has array data A (i) and array data B (i), (i = 0, 1,
..., 19) are made to correspond. In the program of FIG. 9, it is convenient to consider that one-dimensional virtual processors are connected in a ring shape, that is, one-dimensional torus connection.

【０００５】仮想プロセッサを物理プロセッサに割り当
てる方法として、従来からサイクリック分割法とブロッ
ク分割法が知られている。図１０は、２０個の１次元ト
ーラス結合された上記仮想プロセッサを、５個のＰＥか
らなる１次元物理プロセッサＰ'(0)〜Ｐ'(4)へサイクリ
ック分割する例である。この分割によって例えば物理プ
ロセッサＰ'(0)は、配列データＡ(0) 、Ａ(5) 、Ａ(1
0)、Ａ(15)、および、Ｂ(0) 、Ｂ(5) 、Ｂ(10)、Ｂ(15)
を持つことになる。As a method of assigning a virtual processor to a physical processor, a cyclic division method and a block division method have been conventionally known. FIG. 10 shows an example in which the above-mentioned 20 one-dimensional torus-connected virtual processors are cyclically divided into five one-dimensional physical processors P ′ (0) to P ′ (4). Due to this division, for example, the physical processor P '(0) causes the array data A (0), A (5), A (1)
0), A (15) and B (0), B (5), B (10), B (15)
Will have.

【０００６】図１０において、物理プロセッサが実際に
１次元トーラス結合されている場合は問題ないが、物理
プロセッサＰ'(4)とＰ'(0)を結ぶ直接の結合が存在しな
い場合、Ｐ'(4)からＰ'(0)への通信は途中のＰ'(3)、
Ｐ'(2)、およびＰ'(1)を介して行うことになる。従っ
て、この場合にはＰＥ間通信に時間がかかることにな
る。In FIG. 10, there is no problem if the physical processors are actually one-dimensional torus coupled, but if there is no direct coupling between the physical processors P '(4) and P' (0), then P '. Communication from (4) to P '(0) is on the way to P' (3),
This is done through P '(2) and P' (1). Therefore, in this case, communication between PEs takes time.

【０００７】ここに、図９で示した並列Ｆｏｒｔｒａｎ
プログラムはグローバルな名前空間を持つ。すなわち、
配列Ａおよび配列Ｂが並列計算機全体に対して定義され
て、ＰＥ間の通信はデータ参照として表し、明示的な形
では記述しない。このプログラムを個々の物理プロセッ
サで実行するためには、物理プロセッサのローカルメモ
リ上のデータに対するプログラム、すなわちローカルな
名前空間を持つプログラムに変換しなければならない。
ローカル名前空間を持つプログラムでは他のＰＥとの通
信は送信関数ｓｅｎｄ（）および受信関数ｒｅｃｖ（）
を使って明示的に表現する。Here, the parallel Fortran shown in FIG. 9 is used.
The program has a global namespace. That is,
The array A and the array B are defined for the entire parallel computer, and the communication between PEs is represented as a data reference and is not described explicitly. In order to execute this program on each physical processor, it must be converted into a program for data on the local memory of the physical processor, that is, a program having a local name space.
In a program having a local namespace, communication with other PEs is performed by the send function send () and the receive function recv ().
To express it explicitly.

【０００８】図９のプログラムを前記サイクリック分割
法によるデータマッピングに従ってプログラム変換する
と、図１１のような物理ＰＥが直接実行できるプログラ
ムになる。個々の物理ＰＥに割り当てられた配列を、元
のグローバルな配列に対してローカルな配列という。図
１１において、ローカル配列はａ、ｂで表されている。
物理プロセッサＰ'(0)の場合、ローカル配列の要素ａ
[0] 、ａ[1] 、ａ[2] 、ａ[3] 、ｂ[0] 、ｂ[1] 、ｂ
[2] 、ｂ[3] には、各々グローバル配列の要素Ａ(0) 、
Ａ(5) 、Ａ(10)、Ａ(15)、Ｂ(0) 、Ｂ(5) 、Ｂ(10)、Ｂ
(15)が対応する。When the program of FIG. 9 is converted according to the data mapping by the cyclic division method, the program as shown in FIG. 11 can be directly executed by the physical PE. The array assigned to each physical PE is called a local array with respect to the original global array. In FIG. 11, the local array is represented by a and b.
In the case of the physical processor P '(0), the element a of the local array
[0], a [1], a [2], a [3], b [0], b [1], b
[2] and b [3] are the elements A (0) of the global array,
A (5), A (10), A (15), B (0), B (5), B (10), B
(15) corresponds.

【０００９】図１１のプログラムを実行させるために
は、プログラムを全ての物理ＰＥにコピーし、同時に実
行を開始させる。各々の物理ＰＥは、プログラムの実行
の先頭で図中の行７１によってプロセッサ番号を取得し
て変数ｐに格納する。その後の処理はプロセッサ番号の
値によって少しずつ異なる。In order to execute the program shown in FIG. 11, the program is copied to all physical PEs and the execution is started at the same time. Each physical PE obtains the processor number by the line 71 in the figure at the beginning of the execution of the program and stores it in the variable p. Subsequent processing is slightly different depending on the value of the processor number.

【００１０】例えば、図中の行７２は、物理プロセッサ
Ｐ'(p)からＰ'(p+1)へのデータ送信であり、行７３は、
物理プロセッサＰ'(p)のＰ'(p-1)からのデータ受信であ
る。この場合、物理プロセッサのメモリバンド幅が十分
高いか、またはプロセッサ内のレジスタに対して直接デ
ータの送受信を行うなら、データ送信とデータ受信を同
時に実行することが可能である。従って、前記行の７２
のｓｅｎｄ関数は送信の終了を待たずに制御を戻し、次
のデータ受信である行７３のｒｅｃｖ関数の実行を開始
できる。以上により、前記送信および前記受信の実行時
間は、図８（ａ）におけるＴ´で示される。プロセッサ
間通信時間Ｔ´は可能な限り短くする必要があるが、こ
こで示した従来例では、これ以上短くすることはできな
い。For example, line 72 in the figure is data transmission from physical processor P '(p) to P' (p + 1), and line 73 is
Data is received from P '(p-1) of the physical processor P' (p). In this case, if the memory bandwidth of the physical processor is sufficiently high, or if data is directly transmitted / received to / from a register in the processor, it is possible to perform data transmission and data reception at the same time. Therefore, in line 72
The send function of returns the control without waiting for the end of the transmission, and can start executing the recv function of the line 73 which is the next data reception. From the above, the execution time of the transmission and the reception is indicated by T ′ in FIG. The inter-processor communication time T'must be made as short as possible, but in the conventional example shown here, it cannot be made any shorter.

【００１１】以上述べた従来例は１次元の場合である
が、多次元に拡張した場合も同様のことが言える。ま
た、科学技術計算、特に２次元領域や３次元領域上で偏
微分方程式を解く問題では周期境界条件がよく現れ、従
ってプロセッサ間通信が２次元または３次元のトーラス
接続に従って発生すると仮定すると都合がよい場合が多
い。あるいは、全要素の和を計算する等のグローバルな
演算を高速化するためにもトーラス接続は便利である。The conventional example described above is a one-dimensional case, but the same can be said when expanded to a multi-dimensional case. Also, it is convenient to assume that periodic boundary conditions often appear in scientific and engineering calculations, especially in the problem of solving partial differential equations in two-dimensional and three-dimensional regions, and therefore interprocessor communication occurs according to a two-dimensional or three-dimensional torus connection. Often good. Alternatively, the torus connection is convenient for speeding up global operations such as calculating the sum of all elements.

【００１２】従来からよく知れらた並列計算機として、
ＰＥが物理的に２次元、または３次元の格子状に結合さ
れた並列計算機がある。そのような格子状に結合された
計算機は実装が容易であり、拡張性が高いという特徴が
ある。図１２は、２次元格子結合された並列計算機の例
である。このような並列計算機の上にトーラス結合され
た仮想プロセッサを、従来のサイクリックマッピングや
ブロックマッピング方法でマッピングすると、直接ＰＥ
間の接続のないＰ'(0,0)とＰ'(0,3)との間の通信が発生
し、このような通信は通信経路上のＰ'(0,1)、Ｐ'(0,2)
を介して行うことになる。一般に、ＰＥ間通信はＰＥ間
距離が大きい程時間がかかる。そして通信距離が長いと
その間の通信路を占有するため、他の通信を待たせる可
能性が高くなる。従って、すべてのＰＥが仮想的なトー
ラス接続に沿って一斉に隣接ＰＥと通信を行おうとする
と、距離の長い通信経路によって通信性能が抑えられる
ことになる。As a well-known parallel computer,
There is a parallel computer in which PEs are physically connected in a two-dimensional or three-dimensional lattice form. Such a grid-connected computer is easy to implement and has high expandability. FIG. 12 is an example of a two-dimensional grid-connected parallel computer. When a torus-coupled virtual processor on such a parallel computer is mapped by the conventional cyclic mapping or block mapping method, direct PE
Communication occurs between P '(0,0) and P' (0,3) with no connection between them, and such communication occurs on P '(0,1), P' (0 , 2)
Will be done through. Generally, the communication between PEs takes longer as the distance between PEs increases. If the communication distance is long, the communication path between them is occupied, and the possibility of waiting for other communication increases. Therefore, if all the PEs try to communicate with the adjacent PEs all at once along the virtual torus connection, the communication performance is suppressed by the communication path having a long distance.

【００１３】前記問題を解決するために、ＰＥ間を物理
的にトーラスネットワークで結合することが考えられ
る。この方法は、実行プログラムがトーラス結合された
プロセッサの全体を使う場合はよいが、通常この条件は
満足されない。その一つの例は、多数のＰＥから成る並
列計算機を空間的に分割し、分割された各々の並列計算
機をユーザに割り当てることによって、マルチユーザを
実現する場合である。図１３は、２次元トーラス接続さ
れた並列計算機上で、ユーザＡとユーザＢに各々空間分
割された並列計算機１３１と並列計算機１３２を割り当
てた例を示している。この場合、ユーザＡにおいてＰ'
(0,3)とＰ'(0,0)とが通信を行うためにトーラス接続を
利用するなら、通信メッセージは他のユーザＢの領域を
通過することになる。これは通信距離が長いという問題
の他に、異なるユーザのメッセージが混じり合うため、
他ユーザの誤ったメッセージ通信によって自分の正しい
プログラムの動作が妨げられる可能性が生じる。また異
なるユーザのメッセージが混じりあって互いに影響しあ
うことはプログラムのデバックを困難なものにする。も
しＰ'(0,3)とＰ'(0,0)の通信を自分の領域内だけで行う
なら、状況は前記図１２で述べた格子結合ネットワーク
の場合と同じであり、従来のマッピング方法では距離の
長い通信経路によって通信経路によって通信性能が劣化
する。In order to solve the above problem, it is conceivable to physically connect PEs by a torus network. This method is good when the execution program uses the whole of the torus-coupled processor, but normally this condition is not satisfied. One example thereof is a case where a multi-user is realized by spatially dividing a parallel computer composed of a large number of PEs and assigning each divided parallel computer to a user. FIG. 13 shows an example in which a parallel computer 131 and a parallel computer 132, which are spatially divided, are assigned to user A and user B on a two-dimensional torus-connected parallel computer. In this case, user A has P '
If (0,3) and P '(0,0) use a torus connection to communicate, the communication message will pass through the area of another user B. This is due to the fact that the messages of different users are mixed in addition to the problem of long communication distance.
There is a possibility that wrong message communication of other users may interfere with the operation of his / her correct program. Also, it is difficult to debug a program that messages from different users are mixed and influence each other. If P '(0,3) and P' (0,0) are communicated only within their own area, the situation is the same as in the case of the grid connection network described in FIG. Then, the communication performance deteriorates depending on the communication path due to the long communication path.

【００１４】[0014]

【発明が解決しようとする課題】以上のように、従来の
データマッピング方法では、距離の長い通信経路が存在
するので、この通信経路によって通信性能が劣化すると
いう問題点があった。本発明は、上記問題点に鑑みてな
されたものであり、高速なプロセッシングエレメント間
通信を可能とするデータマッピング方法を提供すること
を目的とする。As described above, the conventional data mapping method has a problem that the communication performance deteriorates due to the communication path having a long distance. The present invention has been made in view of the above problems, and an object thereof is to provide a data mapping method that enables high-speed communication between processing elements.

【００１５】[0015]

【課題を解決するための手段】本発明では、Ｎ個の座標
値の組で番号付けられる複数個の物理プロセッサＰ(p
₀ ,p₁ ，…,p_N-1 ）と、Ｍ個の座標値の組で番号づけら
れる複数個の仮想プロセッサＶ(v₀ ,v₁ ，…,v_M-1 ）と
の間で、前記仮想プロセッサのＮ個の座標軸をそれぞれ
前記物理プロセッサのＭ個の座標軸のうちの所望の１つ
に対応付けることによって、前記仮想プロセッサの持つ
データを前記物理プロセッサのメモリに割り当てるデー
タマッピング方法において、ｉ方向の座標値がｖ_i であ
る前記仮想プロセッサをｊ方向の座標値がｐ_j である物
理プロセッサに対応させた場合、ｉ方向の座標値がｖ_i
＋１である仮想プロセッサをｊ方向の座標値がｐ_j −
１、ｐ_j 、またはｐ_j ＋１である物理プロセッサに対応
させるとともに、前記仮想プロセッサのｉ方向の座標値
を１づつ増加させたとき、対応する前記物理プロセッサ
のｊ方向の座標値の作る数列に、１づつ増加する部分列
および１づつ減少する部分列を各々少なくとも１つ含ま
せることを特徴とする。According to the present invention, a plurality of physical processors P (p are numbered by a set of N coordinate values.
₀ , p ₁ , ..., p _N-1 ) and a plurality of virtual processors V (v ₀ , v ₁ , ..., v _M-1 ) numbered by a set of M coordinate values, A data mapping method for allocating the data of the virtual processor to the memory of the physical processor by associating the N coordinate axes of the virtual processor with a desired one of the M coordinate axes of the physical processor, i. When the virtual processor whose coordinate value in the direction is v _i is associated with the physical processor whose coordinate value in the j direction is p _j , the coordinate value in the i direction is v _i
The coordinate value in the j direction of the virtual processor that is +1 is p _j −
When the coordinate value in the i direction of the virtual processor is incremented by 1 while being associated with the physical processor being 1, p _j , or p _j +1, the corresponding sequence value of the coordinate value in the j direction of the physical processor is generated. It is characterized in that at least one subsequence increasing by one and at least one subsequence decreasing by one are included.

【００１６】[0016]

【作用】本発明では、トーラス結合された仮想プロセッ
サＶ(v₀ ,v₁ ，…,v_M-1 ）を折り畳むようにトーラス結
合されてない物理プロセッサＰ(p₀ ,p₁ ，…,p_N-1 ）に
割当てることによって、互いに隣接した仮想プロセッサ
を互いに隣接した物理プロセッサまたは同一の物理プロ
セッサに割り当てることができる。According to the present invention, the physical processors P (p ₀ , p ₁ , ..., P) that are not torus-coupled so as to fold the torus-coupled virtual processors V (v ₀ , v ₁ , ..., V _M-1 ). _N-1 ) allows virtual processors adjacent to each other to be allocated to adjacent physical processors or the same physical processor.

【００１７】それゆえ、本発明のデータマッピング方法
によって、トーラス結合上で存在していた端から端への
大まわりの接続がなくなり、かつ循環シフト通信で右方
向の通信と左方向の通信を同時に実行できるので、ＰＥ
間通信の高速化が可能になる。Therefore, according to the data mapping method of the present invention, there is no end-to-end large-scale connection existing on the torus connection, and the rightward communication and the leftward communication are simultaneously performed by the cyclic shift communication. PE that can be executed
Higher inter-communication is possible.

【００１８】また、本発明によれば、隣接した物理プロ
セッサ間でのみ通信を行うので、並列計算機を空間的に
分割し、分割された並列計算機の各々を異なるユーザに
割り当てたとき、各ユーザは他のユーザとは分離した独
立なトーラスネットワークを実現できる。Further, according to the present invention, since communication is performed only between adjacent physical processors, when a parallel computer is spatially divided and each divided parallel computer is assigned to a different user, each user It is possible to realize an independent torus network separated from other users.

【００１９】さらに、本発明のデータマッピング方法に
従って、仮想プロセッサに対応してデータを分割された
クローバル名前空間のプログラムを、個々の物理プロセ
ッサが実行できるローカルな名前空間を持つプログラム
に変換したとき、従来のブロック分割やサイクリック分
割の場合にはない新しいプログラム最適化の機会が得ら
れ、ＰＥ間通信のさらなる高速化が可能になる。Further, according to the data mapping method of the present invention, when a program in a global namespace in which data is divided corresponding to a virtual processor is converted into a program having a local namespace which can be executed by each physical processor, A new program optimizing opportunity that is not available in the conventional block division or cyclic division is obtained, and the communication speed between PEs can be further increased.

【００２０】[0020]

【実施例】以下、図面を参照しながら、本発明の一実施
例について説明する。図１は、本発明に係る物理ＰＥ１
の構成図である。図１に示すように、この物理プロセッ
サ１は、メモリ２、中央演算処理部（以下、ＣＰＵと略
記する）４、および他の物理ＰＥと通信を行うために２
つの通信ポートＡ、Ｂを備える。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 shows a physical PE 1 according to the present invention.
It is a block diagram of. As shown in FIG. 1, the physical processor 1 has a memory 2, a central processing unit (hereinafter abbreviated as CPU) 4, and a physical processor 2 for communicating with other physical PEs.
Two communication ports A and B are provided.

【００２１】通信ポートＡは、入力ポート１１と出力ポ
ート１２とから成り、同様に通信ポートＢは入力ポート
２１と出力ポート２２とから成る。この物理ＰＥ１の内
部で、これら入力ポート１１、２１および出力ポート１
２，２２はそれぞれ、レジスタまたは高速なメモリ１
３，２３，１４，２４に接続されている。従って、本実
施例のＰＥは、４つの入力／出力ポート１１，１２，１
３，１４を同時に動作させてデータ転送を行うことが可
能である。The communication port A is composed of an input port 11 and an output port 12, and similarly, the communication port B is composed of an input port 21 and an output port 22. Inside the physical PE 1, these input ports 11 and 21 and output port 1
2 and 22 are registers or high-speed memory 1
It is connected to 3,23,14,24. Therefore, the PE of this embodiment has four input / output ports 11, 12, 1.
It is possible to simultaneously operate 3 and 14 to transfer data.

【００２２】図２は、前記物理ＰＥ１を１次元結合した
並列計算機の構成図である。すなわち、本発明の実施例
では、図１の物理ＰＥ１を図２に示すように１次元状に
接続する。１つの物理ＰＥの通信ポートＡの入力ポート
１１および出力ポート２２は、隣接する物理ＰＥの通信
ポートＢの出力ポート２２および入力ポート２１とそれ
ぞれ接続される。図に示されている通り本実施例では、
両端に位置するプロセッシングエレメントＰ(0) とＰ
(4) とを直接的には接続しない。従って、図１の物理Ｐ
Ｅ１を用いて構成した１次元並列計算機を複数結合して
より大きな１次元並列計算機に拡張することが容易であ
る。FIG. 2 is a block diagram of a parallel computer in which the physical PEs 1 are one-dimensionally connected. That is, in the embodiment of the present invention, the physical PE 1 of FIG. 1 is connected in a one-dimensional manner as shown in FIG. The input port 11 and the output port 22 of the communication port A of one physical PE are connected to the output port 22 and the input port 21 of the communication port B of the adjacent physical PE, respectively. In this embodiment, as shown in the figure,
Processing elements P (0) and P located at both ends
(4) Do not connect directly to. Therefore, the physical P of FIG.
It is easy to combine a plurality of one-dimensional parallel computers configured by using E1 to expand to a larger one-dimensional parallel computer.

【００２３】次に、図３には、本発明のデータマッピン
グ方法に従って、２０個のトーラス結合された（ループ
状に結合された）１次元仮想プロセッサを、５個の物理
ＰＥからなる１次元物理プロセッサへ割り当てた例を示
す。Next, referring to FIG. 3, according to the data mapping method of the present invention, 20 torus-coupled (loop-shaped) one-dimensional virtual processors are connected to one-dimensional physics consisting of five physical PEs. An example of allocation to a processor is shown.

【００２４】図３のように、まず、仮想プロセッサＶ
(0) を物理プロセッサＰ(0) に割り当てる。次に、仮想
プロセッサＶ(0) に隣接するＶ(1) を物理プロセッサＰ
(0) 隣接するＰ(1) に割り当てる。同様に、Ｖ(2) 〜Ｖ
(4) を、それぞれＰ(2) 〜Ｐ(4) に割り当てる。ここ
で、本発明では次の５つの仮想プロセッサＶ(5) 〜Ｖ
(9)を、折り畳むようにしてそれぞれ上記物理プロセッ
サＰ(4) 〜Ｐ(0) に割り当てる。同様に、Ｖ(10)〜Ｖ(1
4)をそれぞれＰ(0) 〜Ｐ(4) に、Ｖ(15)〜Ｖ(19)をそれ
ぞれＰ(4) 〜Ｐ(0) に割り当てる。As shown in FIG. 3, first, the virtual processor V
(0) is assigned to the physical processor P (0). Next, V (1) adjacent to the virtual processor V (0) is connected to the physical processor P.
(0) Assign to adjacent P (1). Similarly, V (2) ~ V
(4) is assigned to P (2) to P (4), respectively. In the present invention, the following five virtual processors V (5) -V
(9) is assigned to the physical processors P (4) to P (0) in a folded manner. Similarly, V (10) to V (1
4) is assigned to P (0) to P (4), and V (15) to V (19) is assigned to P (4) to P (0).

【００２５】これによって、互いに隣接する仮想プロセ
ッサを、すべて互いに隣接する物理プロセッサまたは同
一の物理プロセッサに割り当てることができる。この場
合、隣接する仮想プロセッサ間の通信はすべて、たかだ
か隣接する物理プロセッサ間の通信のみによって実行さ
れる。As a result, virtual processors adjacent to each other can be assigned to physical processors adjacent to each other or the same physical processor. In this case, all communication between the adjacent virtual processors is performed only by communication between the adjacent physical processors.

【００２６】また、本実施例のデータマッピング方法に
よって、例えば物理プロセッサＰ(0) のローカル配列の
要素ａ[0] 、ａ[1] 、ａ[2] 、ａ[3] 、ｂ[0] 、ｂ[1]
、ｂ[2] 、ｂ[3] には、グローバル配列の要素Ａ(0)
、Ａ(10)、Ａ(9) 、Ａ(19)、Ｂ(0) 、Ｂ(10)、Ｂ(9)
、Ｂ(19)を対応させる。他の物理プロセッサのローカ
ル配列もまた同様にグローバル配列の異なる部分が対応
する。Further, according to the data mapping method of this embodiment, for example, elements a [0], a [1], a [2], a [3], b [0] of the local array of the physical processor P (0) are used. , B [1]
, B [2], b [3] are the elements A (0) of the global array.
, A (10), A (9), A (19), B (0), B (10), B (9)
, B (19) are made to correspond. The local arrays of other physical processors also correspond to different parts of the global array.

【００２７】図４は、図３のマッピングに対応して、前
述した図９の並列Ｆｏｒｔｒａｎによるプログラムを物
理ＰＥのプログラムに変換してものである。図９のプロ
グラムを実行するためには、予め図４のプログラムを図
２の全てのＰＥのメモリにコピーし、全物理ＰＥで同時
に前記プログラムを動作させる。FIG. 4 corresponds to the mapping of FIG. 3 and converts the program by the parallel Fortran of FIG. 9 into the program of the physical PE. In order to execute the program of FIG. 9, the program of FIG. 4 is copied in advance to the memory of all PEs of FIG. 2, and the program is simultaneously operated in all physical PEs.

【００２８】まず、各物理ＰＥはプログラムの実行の先
頭で図４の行３１によって自分のプロセッサ番号を取得
して変数ｐに格納する。その後の処理はプロセッサ番号
の値によって少しずつ異なる。First, each physical PE obtains its own processor number by the line 31 in FIG. 4 at the beginning of execution of the program and stores it in the variable p. Subsequent processing is slightly different depending on the value of the processor number.

【００２９】通信の最も多いＰ(1) 〜Ｐ(3) の場合につ
いて説明すると、図中の行３２は、物理プロセッサＰ
(p) からＰ(p+1) へのデータ送信であり、行３３はＰ
(p) からＰ(p+1) へのデータ送信である。これら２つの
送信はそれぞれ、相手ＰＥが自分の左側のＰＥおよび右
側のＰＥである。従って、必要とする通信ポートが異な
るため、図１で述べたＰＥによって、前記２つの送信を
同時に実行することが可能である。図中の行３４はＰ
(p) のＰ(p+1) からのデータ受信であり、行３５はＰ
(p) からＰ(p-1) からのデータ受信である。これら２つ
のデータ受信も同じ理由により同時に実行することが可
能である。The case of P (1) to P (3) with the highest number of communications will be described. The line 32 in the figure indicates the physical processor P.
Data transmission from (p) to P (p + 1), line 33 is P
Data transmission from (p) to P (p + 1). In these two transmissions, the partner PE is the left PE and the right PE, respectively. Therefore, since the required communication ports are different, the two transmissions can be executed simultaneously by the PE described in FIG. Row 34 in the figure is P
Data is received from P (p + 1) of (p), line 35 is P
Data is received from (p) to P (p-1). These two data receptions can be performed simultaneously for the same reason.

【００３０】さらに、図１で述べた本発明のＰＥでは、
ＰＥ間の結合が全２重であるので、送信と受信とを同時
に実行可能であり、行３２〜３５に示される通信は、す
べて同時に実行可能である。Further, in the PE of the present invention described in FIG. 1,
Since the coupling between PEs is full duplex, sending and receiving can be done simultaneously, and the communications shown in rows 32-35 can all be done simultaneously.

【００３１】また、本実施例による送信データおよび受
信データのサイズは、従来例である図１１の場合の半分
になる。従来例である図１１のプログラムの行７２，７
３に示されている通り、従来方法では長さ４のデータを
右側のＰＥに送信し、長さ４のデータを左側のＰＥから
受信する必要があった。一方、本発明の実施例に対応す
る図４のプログラムでは、図中の行３２〜３５に示され
ている通り、送信すべきデータのサイズは上記従来例の
半分の２で良いが、その代わり右側のＰＥと左側のＰＥ
に送信し、また長さ２のデータを右側のＰＥと左側のＰ
Ｅから受信する必要がある。しかしながら既に述べたよ
うに本発明の実施例では、右側への送信と左側への送信
は同時に実行可能であり、右側からの受信と左側からの
受信も同時に実行可能なのである。Further, the sizes of the transmission data and the reception data according to this embodiment are half of those of the conventional example shown in FIG. Lines 72 and 7 of the program of FIG. 11 which is a conventional example
As shown in FIG. 3, in the conventional method, it was necessary to transmit the data of length 4 to the PE on the right side and receive the data of length 4 from the PE on the left side. On the other hand, in the program of FIG. 4 corresponding to the embodiment of the present invention, as shown in lines 32 to 35 in the drawing, the size of the data to be transmitted may be 2 which is half the size of the above-mentioned conventional example. PE on the right side and PE on the left side
To the PE on the right side and P on the left side.
Must be received from E. However, as described above, in the embodiment of the present invention, the transmission to the right side and the transmission to the left side can be executed simultaneously, and the reception from the right side and the reception from the left side can be executed simultaneously.

【００３２】図５は、図４のプログラムの動作を表すフ
ローチャートである。図４における行３２〜３５に示さ
れる通信はそれぞれ、図５におけるステップ４７〜５０
に対応する。図５において、まずステップ４７で示され
る左側ＰＥへの送信を実行する。そして、ステップ４７
の実行終了を待たずに制御を戻し、ステップ４８で示さ
れる右側のＰＥへの送信を実行する。同様に、ステップ
４８の実行終了を待たずにステップ４９で示される左側
ＰＥからの受信を実行し、ステップ４９の実行終了を待
たずに、ステップ５０で示される右側のＰＥからの受信
を実行する。従って、ステップ４７〜５０の各通信は、
実行時間が重なることになる。FIG. 5 is a flow chart showing the operation of the program of FIG. The communication shown in lines 32-35 in FIG. 4 corresponds to steps 47-50 in FIG. 5, respectively.
Corresponding to. In FIG. 5, first, the transmission to the left PE shown in step 47 is executed. And step 47
The control is returned without waiting for the end of execution of the step, and the transmission to the PE on the right side shown in step 48 is executed. Similarly, the reception from the left PE shown in step 49 is executed without waiting for the end of execution of step 48, and the reception from the right PE shown in step 50 is executed without waiting for the end of execution of step 49. . Therefore, each communication in steps 47 to 50
The execution time will overlap.

【００３３】図８（ｂ）は、図５におけるステップ４７
〜５０の通信の実行時間を示したものである。これら
は、実行時間が重なってほぼ並列に動作し、全体の通信
時間がＴとなる。FIG. 8B shows step 47 in FIG.
It shows the execution time of the communication of ~ 50. These operate almost in parallel with their execution times overlapped, and the total communication time becomes T.

【００３４】本実施例では転送すべきデータサイズが従
来例の半分になるが、そのかわりデータは左のＰＥと右
のＰＥの両方に転送しなければならない。しかし、左の
ＰＥへの転送と右のＰＥへの転送とは同時に実行可能で
ある。従って、全体として従来よりも転送時間を短くで
きるのである。In this embodiment, the data size to be transferred is half that of the conventional example, but instead, the data must be transferred to both the left PE and the right PE. However, the transfer to the left PE and the transfer to the right PE can be executed simultaneously. Therefore, the transfer time can be shortened as compared with the conventional case.

【００３５】以上により、本実施例によるデータ通信時
間Ｔは、従来例によるデータ転送時間Ｔ´のほぼ半分に
なる。なお、図４および図５において、Ｐ(0) およびＰ
(4) における処理内容および動作の流れは、以上の説明
によって当業者であれば容易に理解できるので、詳細な
説明は省略した。As described above, the data communication time T according to the present embodiment is almost half the data transfer time T'according to the conventional example. 4 and 5, P (0) and P (0)
The processing contents and operation flow in (4) can be easily understood by those skilled in the art from the above description, and thus detailed description thereof has been omitted.

【００３６】ここに、上述した本発明の実施例では、２
０個の仮想プロセッサを５個の物理プロセッサにマッピ
ングするものであったので、マッピングは図３のように
規則的なもので実現できた。しかしながら、本発明は様
々な仮想プロセッサ数と物理プロセッサ数とに対して適
用可能である。例えば、１７個のトーラス結合された仮
想プロセッサを８個のトーラス結合されてない物理プロ
セッサにマッピングする場合にも本発明は適用可能であ
る。この場合のマッピングの一例を図６に示す。この例
でも隣接する仮想プロセッサは、隣接する物理プロセッ
サまたは同一の物理プロセッサにマッピングされてい
る。In the embodiment of the present invention described above, 2
Since 0 virtual processors were mapped to 5 physical processors, the mapping could be realized regularly as shown in FIG. However, the present invention is applicable to various numbers of virtual processors and physical processors. For example, the present invention can be applied to the case of mapping 17 torus-coupled virtual processors to 8 non-torus-coupled physical processors. An example of mapping in this case is shown in FIG. Also in this example, the adjacent virtual processors are mapped to the adjacent physical processors or the same physical processor.

【００３７】以上述べた本発明の実施例は１次元プロセ
ッサの場合であったが、本発明は１次元に限定されるも
のではなく、多次元のプロセッサ配列にも適応できる。
図７は２次元の仮想プロセッサのトーラスネットワーク
を物理プロセッサの２次元の格子ネットワークにマッピ
ングする例である。例えば、仮想プロセッサＶ(0,0)，
Ｖ(0,5) ，Ｖ(5,0) ，Ｖ(5,5) は、物理プロセッサＰ
(0,0) に割り当てられ、仮想プロセッサＶ(1,2) ，Ｖ
(1,3) ，Ｖ(4,2) ，Ｖ(4,3) は、物理プロセッサＰ(1,
2) に割り当てられる。図７のように、ここでも隣接す
る仮想プロセッサは、隣接する物理プロセッサまたは同
一の物理プロセッサにマッピングされている。すなわ
ち、本発明の実施によって２次元のトーラス結合された
仮想プロセッサがトーラスではない２次元格子結合され
た物理プロセッサにマッピングされた。また、同じ方法
は３次元トーラスに対しても適用できる。Although the embodiment of the present invention described above is a case of a one-dimensional processor, the present invention is not limited to one-dimensional processor and can be applied to a multi-dimensional processor array.
FIG. 7 is an example of mapping a torus network of a two-dimensional virtual processor onto a two-dimensional lattice network of a physical processor. For example, virtual processor V (0,0),
V (0,5), V (5,0), V (5,5) are physical processors P
Virtual processors V (1,2), V assigned to (0,0)
(1,3), V (4,2), V (4,3) are the physical processors P (1,3)
Assigned to 2). As shown in FIG. 7, the adjacent virtual processors are also mapped to the adjacent physical processors or the same physical processor. That is, according to the embodiment of the present invention, a two-dimensional torus-coupled virtual processor is mapped to a non-torus two-dimensional lattice-coupled physical processor. Also, the same method can be applied to a three-dimensional torus.

【００３８】このように、本発明によれば多次元トーラ
ス結合された仮想プロセッサ配列を、トーラス結合され
てない多次元格子結合された物理プロセッサ配列上に、
隣接性を保ったマッピングが可能になる。従って本発明
によって、格子ネットワークにも関わらずトーラス結合
を想定したＰＥ間通信が高速に実行できる。As described above, according to the present invention, the virtual processor array which is multidimensionally torus-coupled is arranged on the physical processor array which is not torusally coupled and is multidimensionally grid-connected.
Mapping that preserves adjacency is possible. Therefore, according to the present invention, the PE-to-PE communication assuming the torus coupling can be executed at high speed regardless of the lattice network.

【００３９】ここに、格子ネットワークは分割と拡張が
容易であり、この格子ネットワークの中の任意の部分格
子に仮想プロセッサのトーラスネットワークをマッピン
グできることは大きな利点である。Here, the lattice network is easy to divide and expand, and it is a great advantage that the torus network of the virtual processor can be mapped to an arbitrary sublattice in the lattice network.

【００４０】さらに、例えば、物理プロセッサを空間的
に分割して、複数のユーザに割り当てるとき、他のユー
ザの領域に影響を与えずに、自分の領域である部分格子
上にトーラスネットワークを構築できる。従って、デー
タ通信におけるユーザ間の分離が可能になり、信頼性の
高いプログラミングとプログラム実行環境を実現するこ
とができる。Furthermore, for example, when a physical processor is spatially divided and assigned to a plurality of users, a torus network can be constructed on a sublattice which is its own area without affecting the areas of other users. . Therefore, it is possible to separate users in data communication, and it is possible to realize a highly reliable programming and program execution environment.

【００４１】また、本発明によって並列プログラミング
においてよく現れる基本的な通信パターンにおいて、Ｐ
Ｅ間データ転送時間を従来よりも高速化することが可能
になる。In addition, according to the present invention, in a basic communication pattern that often appears in parallel programming, P
The data transfer time between E can be made faster than before.

【００４２】[0042]

【発明の効果】本発明によれば多次元トーラス結合され
た仮想プロセッサ配列を、トーラス結合されてない多次
元格子結合された物理プロセッサ配列上に、隣接性を保
ったマッピングが可能になる。According to the present invention, it is possible to perform a mapping of a virtual processor array, which is multidimensionally torus-coupled, on a physical processor array, which is not torusally coupled but is multidimensionally lattice-coupled, while maintaining adjacency.

【００４３】従って、本発明によって、格子ネットワー
クにも関わらずトーラス結合を想定したＰＥ間通信が高
速に実行できる。また、本発明によって、並列プログラ
ミング中によく現れる基本的な通信パターンにおいて、
ＰＥ間データ転送時間を従来よりも高速化することが可
能になる。また、本発明は上述した各実施例に限定され
るものではなく、その要旨を逸脱しない範囲で、種々変
形して実施することができる。Therefore, according to the present invention, the PE-to-PE communication assuming the torus coupling can be executed at high speed regardless of the lattice network. Also, according to the present invention, in a basic communication pattern that often appears during parallel programming,
The data transfer time between PEs can be made faster than before. Further, the present invention is not limited to the above-described embodiments, and various modifications can be carried out without departing from the scope of the invention.

[Brief description of drawings]

【図１】本発明の一実施例に係る物理プロセッシングエ
レメントの構成図FIG. 1 is a configuration diagram of a physical processing element according to an embodiment of the present invention.

【図２】本発明の一実施例に係る並列計算機の構成図FIG. 2 is a configuration diagram of a parallel computer according to an embodiment of the present invention.

【図３】本発明の一実施例に係るデータマッピング方法
を示す図FIG. 3 is a diagram showing a data mapping method according to an embodiment of the present invention.

【図４】図３のデータマッピングに対応する物理プロセ
ッシングエレメントのプログラムFIG. 4 is a program of a physical processing element corresponding to the data mapping of FIG.

【図５】図４のプログラムのフローチャート5 is a flowchart of the program shown in FIG.

【図６】本発明を適用して１７個の仮想プロセッサを８
個の物理プロセッサにマッピングする方法を説明するた
めの図FIG. 6 is a block diagram of the 17 virtual processors to which the present invention is applied.
For explaining how to map to one physical processor

【図７】本発明の２次元トーラス結合された仮想プロセ
ッサに対する適用を説明するための図FIG. 7 is a diagram for explaining an application of the present invention to a two-dimensional torus-coupled virtual processor.

【図８】本発明の一実施例および従来例におけるデータ
転送時間を比較するための図FIG. 8 is a diagram for comparing data transfer times in an example of the present invention and a conventional example.

【図９】並列Ｆｏｒｔｒａｎのプログラムの一例FIG. 9: Example of parallel Fortran program

【図１０】従来例のデータマッピング方法の一例を示す
図FIG. 10 is a diagram showing an example of a conventional data mapping method.

【図１１】図１０の従来例に対応して図９のプログラム
から導かれた物理ＰＥのプログラム11 is a physical PE program derived from the program of FIG. 9 corresponding to the conventional example of FIG.

【図１２】従来の単純な格子結合ネットワークを持つ並
列計算機の一例を示す図FIG. 12 is a diagram showing an example of a conventional parallel computer having a simple grid-connected network.

【図１３】従来の２次元トーラスネットワークを空間分
割してマルチユーザを実現する一例を示す図FIG. 13 is a diagram showing an example of realizing multi-user by spatially dividing a conventional two-dimensional torus network.

[Explanation of symbols]

１…物理プロセッサ２…メモリ４…中央演算処理部１１，２１…入力ポート１２，２２…出力ポート１３，１４，２３，２４…レジスタまたは高速なメモリＰ(0) 〜Ｐ(4) …物理プロセッサＶ(0) 〜Ｖ(19)…仮想プロセッサ 1 ... Physical processor 2 ... Memory 4 ... Central processing unit 11,21 ... Input port 12,22 ... Output port 13,14,23,24 ... Register or high-speed memory P (0) to P (4) ... Physical processor V (0) to V (19) ... Virtual processor

Claims

[Claims]

1. A plurality of physical processors P (p ₀ , p ₁ , ..., P _N-1 ) numbered with a set of N coordinate values, and a plurality of physical processors numbered with a set of M coordinate values. Virtual processors V (v
₀ , v ₁ , ..., V _M-1 ) between the virtual processors N
In the data mapping method of allocating the data of the virtual processor to the memory of the physical processor by associating each of the coordinate axes with a desired one of the M coordinate axes of the physical processor, the coordinate value in the i direction is When the virtual processor having v _i is associated with the physical processor having a coordinate value in the j direction of p _j , the virtual processor having a coordinate value in the i direction of v _i +1 has a coordinate value in the j direction of p _j −1. , P _j , or p _j +1 and the coordinate value of the virtual processor in the i direction is incremented by 1, the sequence of coordinate values of the corresponding physical processor in the j direction is: A data mapping method comprising at least one subsequence increasing by one and at least one subsequence decreasing by one.