JPH01147767A

JPH01147767A - Double cascaded parallel processing system

Info

Publication number: JPH01147767A
Application number: JP30561987A
Authority: JP
Inventors: Ikuo Yoshihara; 郁夫吉原; Akira Muramatsu; 晃村松; Kazuo Nakao; 中尾　和夫
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1987-12-04
Filing date: 1987-12-04
Publication date: 1989-06-09

Abstract

PURPOSE:To improve the operation rate of an element processor PE by carrying forward plural summing calculations concurrently in a case that sequential operational processing is performed by a cascade system. CONSTITUTION:A host computer 1 performs the loading of a program, the transfer of data, a scalar operational processing and the progress control of an operation to/of the PE 3 through an array controller 2. The PEs 3 are arrayed by L sets in a lateral direction and M sets in a longitudinal direction. IN order to obtain a sum total, at first, a cascade sum is taken toward right direction for every row, and the sum total Xsum is obtained on the uppermost PE(L,M). Simultaneously with this sum total calculation, the sum total Ysum of data (y) is obtained. By using this calculation, since the PEs necessitated for the sum total calculation of xi do not overlap the PEs necessitated for the sum total calculation of yi, the calculation to obtain Xsum and the calculation to obtain Ysum can be executed in parallel. Accordingly, the operation rate of the PE is improved doubly by using this calculation system.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、電子計算機を用いる演算処理方式に係り、特
に、多数の要素プロセッサ（ＰＥ：Ｐｒｏｃｅｓｓｏｒ
　Ｅｌｅｍｅｎｔ）から成る並列計算機に於いて、各Ｐ
Ｅに分散記憶されたデータを参照する演算を。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to an arithmetic processing method using an electronic computer, and particularly relates to an arithmetic processing method using an electronic computer, and in particular, to
In a parallel computer consisting of
An operation that refers to data distributed and stored in E.

効率よく処理する方法に関するものである。It relates to efficient processing methods.

[Conventional technology]

多数の要素プロセッサから成る並列計算機のプロセッサ
間結合方式としては、・格子状結合方式・ハイパーキューブ（超立方格子）結合方式・行列クロ
スバ・スイッチ結合方式（たとえば、特願昭６１−２６
９６５５号参府）などがある。偏微分方程式を離散化して並列計算機で解
く場合、解析対象とする空間を複数の部分空間に分け、
各ＰＥに部分空間（これは１個又は複数個の格子点を含
む）に含まれる格子点の処理をさせるのが一般的である
。従って、データ（変数の値）は、各ＰＥに付随する記
憶装置に、少しずつ分散的に記憶させるのが自然である
。このようなデータ割当は、偏微分方程式の求解に限ら
ず、画像処理など多量のデータ処理を行う問題でしばし
ば生ずる。The inter-processor coupling methods of a parallel computer consisting of a large number of element processors include: - Lattice coupling method - Hypercube (hypercubic lattice) coupling method - Matrix crossbar switch coupling method (for example, Japanese Patent Application No. 61-26
9655 Sanfu) etc. When discretizing partial differential equations and solving them on a parallel computer, the space to be analyzed is divided into multiple subspaces,
It is common to have each PE process grid points contained in a subspace (which includes one or more grid points). Therefore, it is natural to store data (values of variables) little by little in a distributed manner in storage devices attached to each PE. Such data allocation often occurs not only in solving partial differential equations, but also in problems that involve processing a large amount of data, such as image processing.

ところで計算の過程では、全部又は一部のＰＥに分散さ
れたデータを参照するため、計算順序が可換であるにも
拘らず、逐次処理せざるを得ない計算がしばしば現われ
る９代表的な例として、・総和計算　　Ｘｓｕｍ＝Σｉ
　Ｘ　１・内積計算　　＜ｘ、ｙ＞＝Σ１ｘ１ｙｔ・最
大値探索　Ｘｍａｘ＝ｍａｘ（ｘｉｈ＝１ｙ２＋−−）
・最大値探索　Ｘｍｉｎ＝ｍｉｘ（ｘ　＊　ｌ　ｉ＝　
ｌ　ｅ　２　＋・・・・・・）がある。ただし、Σ１は
ｉに関する和である。ここでは、これらを総称して「総
和型計算」と呼ぶ。By the way, in the process of calculation, data distributed to all or some PEs is referred to, so even though the calculation order is commutative, calculations that must be processed sequentially often occur.9 Typical Examples As,・Sum calculation Xsum=Σi
X 1・Inner product calculation <x, y>=Σ1x1yt・Maximum value search Xmax=max(xih=1y2+−-)
・Maximum value search Xmin=mix(x * l i=
l e 2 +...). However, Σ1 is a sum regarding i. Here, these are collectively referred to as "summation type calculations."

並列計算機で総和型計算を行なう代表的な方法に、カス
ケード・サム（Ｃａｓｃａｄｅ　Ｓｕｍ）がある、たと
えば、Ｒ，Ｗ、ホツキニー、Ｃ，Ｒ，ジエソツペ著：　
「並列計算機」（共立出版（１９８４）ｐ、ｐ。Cascade Sum is a typical method for performing summation calculations on parallel computers.
"Parallel Computer" (Kyoritsu Shuppan (1984), p.

２０３〜２０７）参照。これは、次の手順で総和をとる
方法である。即ち、■２個ずつのデータの部分和をとる
。次に■２個ずつの部分和に対して再び部分和をとる。203-207). This is a method of calculating the summation using the following steps. That is, (1) Take the partial sum of each two pieces of data. Next, calculate the partial sums again for each two partial sums.

以下、■の操作を繰返すと総和が得られる。Hereafter, by repeating the operation (■), the summation can be obtained.

カスケード・サムによれば、Ｎ個のデータの総和は、ｎ
ステップの演算で処理することが出来る（ｎ　＝　ｌ　
１ｏｇｚＮ　ｌ↑で、１・・・ｌ↑は切り上げの記号、
ステップとは１対の部分和をとる操作を言う）。According to cascade sum, the sum of N data is n
It can be processed by step operations (n = l
1ogzN l↑, 1...l↑ is the symbol for rounding up,
A step is an operation that takes a pair of partial sums).

従って、すべて逐次処理する場合に比べ、計算時間はｎ
　／　Ｎに短縮され効率的である（／は除算の記号）。Therefore, compared to the case where everything is processed sequentially, the calculation time is n
It is efficient because it can be shortened to /N (/ is the symbol for division).

例えば、Ｎ＝２１０＝１０２４台のとき。For example, when N=210=1024 units.

計算時間は０．９８％に短縮されたことになる。This means that the calculation time has been reduced to 0.98%.

[Problem that the invention seeks to solve]

しかし、カスケード・サムは、２個ずつの部分和をとる
操作を繰り返すため、使用しないＰＥが多数発生する。However, since the cascade sum repeats the operation of taking two partial sums, many unused PEs occur.

即ち、第１ステツプでは全体の１／２のＰＥが働かず、
第２ステツプでは全体の３／４のＰＥが働かず、第３ス
テツプでは全体の７／８のＰＥが働かない・・・・・・
と言う状況が生じ、ＰＥの稼働率は（Ｎ−１）／（ｎ申
Ｎ）となる（申は乗算の記号）。例えば、Ｎ＝２１０＝
１０２４台のとき、ＰＥの稼働率は、９．９９％に過ぎ
ない。That is, in the first step, 1/2 of the PEs do not work,
In the second step, 3/4 of the PEs do not work, and in the third step, 7/8 of the PEs do not work.
This situation arises, and the operating rate of the PE becomes (N-1)/(nxN) (xx is the symbol for multiplication). For example, N=210=
When there are 1024 PEs, the PE operating rate is only 9.99%.

数値シミュレーションでは、総和型計算がたびたび現わ
れるから、稼働率の向上が望まれる。In numerical simulations, summation-type calculations often appear, so it is desirable to improve the utilization rate.

〔問題を解決するための手段及び作用〕現実の問題では
、総和型計算が、複数個同時に必要なことが屡々ある。[Means and operations for solving problems] In real problems, multiple summation-type calculations are often required at the same time.

本発明では、複数個の総和型計算を、同時に進行させる
ことにより、ＰＥの稼動率向上を図る。The present invention aims to improve the operating rate of PE by simultaneously performing a plurality of summation-type calculations.

例えば、２組の総和Ｘ　８１１１１１　：ΣｒＸｉ　と
Ｙｓｕｍ＝Σｉｙｔを求める場合、隣同士のＰＥのデー
タを対にして番号の大きい方のＰＥを使ってＸｓｕｍ＝
Σ１ｘ１の部分和を求める。一方、同じ隣同士のＰＥの
データを対にして出発し、常に番号の小さい方のＰＥを
使って、Ｙｓｕｍ＝Σｔｙｔ　を求める・このようにす
ると、Ｘ　ｓｕｍを求める計算とＹ　ｓｕｍを求める計
算とを、同時に進行させることが出来る。For example, when calculating the sum of two sets X 811111 :ΣrXi and Ysum=Σiyt, pair the data of neighboring PEs and use the PE with the larger number to calculate Xsum=
Find the partial sum of Σ1x1. On the other hand, start with the data of the same neighboring PEs as a pair and always use the PE with the smaller number to find Ysum=Σtyt. If you do this, the calculation to find X sum and the calculation to find Y sum will be different. can proceed simultaneously.

〔Example〕

以下１本発明の一実施例を第１図〜第３図により説明す
る。第３ｒＭは、本発明の処理方式を適用する、行列ク
ロスバ・スイッチ結合方式の並列計算機の構成図である
。An embodiment of the present invention will be described below with reference to FIGS. 1 to 3. The third rM is a configuration diagram of a matrix crossbar switch coupling type parallel computer to which the processing method of the present invention is applied.

ホスト計算機１は、アレイ・コントローラ２を介して、
ＰＥ３へのプログラムのロード、データの転送、スカラ
ー演算処理、演算の進行管理を行なう。アレイ・コント
ローラ２は、上記機能のほか、ＰＥ３からホスト計算機
１へのデータ転送、ＰＥ３から周辺装置７へのデータ転
送（例えば、磁気ディスクへの書き込み）の制御を行う
。ＰＥ３は、数値演算、及びＰＥへのデータの送信と受
信を行なう要素プロセッサであり、記憶装置も内蔵して
いる。ＰＥは横方向にＬ＝２ＡＱ台、縦方向にＭ＝２Ａ
ｍ台、合計Ｎ＝Ｌ傘Ｍ台並んでいる（′は巾乗の記号、
Ｑ、ｍは正整数）。第３図に示したのは、Ｑ＝ｍ＝２で
１６台のＰＥからなる並列計算機の構成の例である。個
々のＰＥは２次元的に付された番号（ｉｔ　ｊ）により
識別される。The host computer 1, via the array controller 2,
It loads programs to PE3, transfers data, performs scalar calculation processing, and manages the progress of calculations. In addition to the above functions, the array controller 2 controls data transfer from the PE 3 to the host computer 1 and data transfer from the PE 3 to the peripheral device 7 (for example, writing to a magnetic disk). The PE3 is an element processor that performs numerical calculations and transmits and receives data to the PE, and also has a built-in storage device. PE is horizontally L = 2AQ level, vertically M = 2A
m units, total N = L umbrellas M units are lined up (' is the symbol for cross power,
Q and m are positive integers). FIG. 3 shows an example of the configuration of a parallel computer consisting of 16 PEs with Q=m=2. Each PE is identified by a two-dimensionally assigned number (it j).

ただし、ｉ＝１．２．・・・・・・ｅ　Ｌ、ｊ＝１−＋
　２・・・・・・。However, i=1.2.・・・・・・e L, j=1−+
2...

Ｍ。行うロスバ・スイッチ４は、横方向に並んだＰＥ間
のデータ転送路であり、それぞれのチャンネルに対し入
力ポート出力ポートを持ち、並列度りである。即ち、同
じ行に属するＬ台のＰＥは、それぞれ同時にデータを受
ける取ることが出来る。M. The loss bar switch 4 is a data transfer path between PEs arranged in the horizontal direction, has an input port and an output port for each channel, and is highly parallel. That is, L PEs belonging to the same row can each receive data at the same time.

ただし、１台のＰＥが同時に２力所以上からデータを受
は取ることはできない。列クロスバ・スイッチ５は、縦
方向に並んだＰＥ間のデータ転送路であり、並列度Ｍで
ある。即ち、同じ列に属するＭ台のＰＥは、それぞれ同
時にデータを受は取ることが出来る。クラスタ・メモリ
６は、縦並びのＰ’Ｅ３毎に共有する外部記憶装置であ
る。周辺装置７は、入出力装置、外部記憶装置などであ
る。However, one PE cannot receive data from two or more power points at the same time. The column crossbar switch 5 is a data transfer path between PEs arranged in the vertical direction, and has a degree of parallelism M. That is, M PEs belonging to the same column can each receive and receive data at the same time. The cluster memory 6 is an external storage device shared by each vertically arranged P'E3. The peripheral device 7 is an input/output device, an external storage device, etc.

上記並列計算機を用いて、各ＰＥに付随する記憶装置に
入っているデータを参照する。一対の総和計算を並列に
実行する方法を説明する。ＰＲ（＋、ａ）に付随する記
憶装置に入っているデータをｘ（ｔ＊ＪＬｙ（ｉｔａ）
と記す。各ＰＥの持っＸｐＶは高々１個とする。もしＰ
Ｅが複数個のＸまたはｙを持つ場合は、各ＰＥごとに、
自ＰＥが受は持つデータについての部分和を求め、それ
を改めて！（１，Ｊ）。The parallel computer is used to refer to the data stored in the storage device associated with each PE. We will explain how to perform a pair of summation calculations in parallel. The data stored in the storage device associated with PR(+, a) is x(t*JLy(ita)
It is written as Each PE has at most 1 XpV. If P
If E has multiple X or y, for each PE,
Find the partial sum of the data that your PE owns, and do it again! (1, J).

ｙ（ｉ、ａ）と置けばよい。また、Ｐ　Ｅ　（ｉｙＪ）
にデータがないときは、値がＯのデータがあると見做す
。Just put it as y(i, a). Also, P E (iyJ)
If there is no data, it is assumed that there is data with a value of O.

Ｘｓｕｍ＝：ΣｔＪｘ（ｔｔａ）＝ΣＪ（Σ、Ｘ（、、
、））の様に考え、初めにｉについての和をとり１次に
ｊについての和をとる。ΣＩＪはｉとｊに関する二重布
である。ｉについての和は、ＰＥの横方向の並びの和で
あり、「賃料」と呼ぶ。ｊについての和を「死相」と呼
ぶ。Xsum=:ΣtJx(tta)=ΣJ(Σ,X(,,
, )), first calculate the sum for i and firstly calculate the sum for j. ΣIJ is a double cloth regarding i and j. The sum for i is the sum of the horizontal rows of PEs and is called the "rent". The sum with respect to j is called the "death phase".

賃料をとるため、まずデータを横方向に転送する。ｉが
奇数番のＰＥはデータＸを右隣のＰＥに送り、ｉが偶数
番のＰＥはデータＸを左隣のＰＥに送る。即ち、Ｐ　Ｅ
　（ｘ、Ｊ）はＰ　Ｅ　（ｚ、−）に！（１，Ｊ）を送
り、Ｐ　Ｅ　（ｓｅａ）はｐ　Ｅ（ａｔａ）にＸ（Ｊｌ
、Ｊ）　　を送る。　Ｐ　Ｅ　（ｚ、Ｊ）はＰＥ（１ｅ
ａ）にｙ（ａｅａ）を送り、Ｐ　Ｅ　（４？　Ｊ）はＰ
　Ｅ　（８，））にｙ（ａｅａ）を送る。ただし、ｊは
１，２，３，４すべてにわたる。To collect rent, data is first transferred horizontally. PEs with odd number i send data X to the PE on the right, and PEs with even number i send data X to the PE on the left. That is, P E
(x, J) becomes P E (z, -)! (1, J), P E (sea) sends X (Jl
, J). P E (z, J) is PE (1e
Send y (aea) to a), P E (4? J) is P
Send y(aea) to E (8,)). However, j covers all of 1, 2, 3, and 4.

■横方向第１ステップ・・・・・・ｉが偶数番のＰＥは
左隣のＰＥデータＸを受は取り、自ＰＥの持っＸに加え
込みを行ない、２つ右隣のＰＲへデータを送る。ｉが奇
数番のＰＥは右隣のＰＥからｙを受は取り、自ＰＥの持
つｙに加え込みを行ない、２つ左隣のＰＥへデータを送
る。■First step in the horizontal direction...The PE whose i is an even number receives and takes the PE data X on the left, adds it to its own PE's own X, and sends the data to the PR two places on the right. send. The PE whose i is an odd number receives and takes y from the PE on the right, adds it to y that it owns, and sends the data to the PE two places on the left.

即ち、Ｐ　Ｅ　（ｘｔＪ）は！（１＃Ｊ）を受は取り、
ｘ　（ｚｅａ）　＋　ｘ　（ｚｗＪ）を算出し、その値
をｘ（ｚ、ａ）に代入する。ＰＥ（ｚｔＪ）は！（２１
Ｊ）の値をＰＥ（４ｈＪ）に送るｓ　ｉ＝３．４のＰＥ
についても同様。That is, P E (xtJ) is! (1#J) is received,
Calculate x (zea) + x (zwJ) and assign the value to x (z, a). PE(ztJ) is! (21
Send the value of J) to PE (4hJ) PE with i=3.4
The same goes for.

ＰＥ（１，Ｊ）はｙ（２，））を受は取り−ｙ（１，Ｊ
）＋ｙ（ｚｔａ）を算出し、その値をｙ（ｔｅａ）に代
入する。PE(1,J) takes y(2,)) and -y(1,J
)+y(zta) is calculated and the value is assigned to y(tea).

ＰＥ（３，Ｊ）はＸ（ＩＩＩＪ）の値をＰＥ（ｔｅＪ）
に送る。PE(3,J) is the value of X(IIIJ) PE(teJ)
send to

ｉ＝３．４のＰＥについても同様。ただし、ｊは、１．
２，３．４すべてにわたる。The same goes for PE with i=3.4. However, j is 1.
Covers all 2, 3, and 4.

■横方向第２ステップ・・・・・・上記■と同様、Ｘは
右方のＰＥに送って加え込み、ｙは左方のＰＥに送って
加え込む。即ち、ＰＥ（４，Ｊ）がＸ（２，ａ）＋Ｘ（
４，Ｊ）を算出し、その値を！（４，Ｊ）に代入する。■Second horizontal step...Similar to (■) above, X is sent to the right PE and added, and y is sent to the left PE and added. That is, PE(4,J) becomes X(2,a)+X(
4. Calculate J) and find the value! Substitute into (4, J).

Ｐ　Ｅ　（ｘｔＪ）がｙ　（１，ａ）　＋　ｙ　（ａｙ
ａ）を算出し、その値をｙ（ＬｔＪ）に代入する。ただ
し、ｊは、１，２゜３．４すべてにわたる。P E (xtJ) is y (1, a) + y (ay
Calculate a) and assign the value to y(LtJ). However, j covers all of 1, 2 degrees and 3.4 degrees.

このようにして、ｉ＝４のＰＥ上にＸに関する賃料が求
まり、ｉ＝１のＰＥ上にｙに関する賃料が求まる。次に
、ｉ＝４．ｉ＝１のデータの死相をとることによりＸｇ
ｕｍ、　Ｙｓｕｍが求まる口死相をとるため、まずデー
タ転送を行なう。In this way, the rent related to X is found on the PE with i=4, and the rent related to y is found on the PE with i=1. Next, i=4. By taking the death phase of the data of i=1,
In order to determine um and Ysum, data is first transferred.

Ｘ（４，１）をＰ　Ｅ　（４ｔｚ）に、Ｘ（４，１１）
をＰ　Ｅ　（４１４）に、ｙ（ｔ、ｚ）をＰ　Ｅ　（ｔ
ｔ工）に、ｙ（ｔｔａ）をＰ　Ｅ　（ｚ、ａ）に送る。X(4,1) to P E (4tz), X(4,11)
to P E (414), and y(t, z) to P E (t
t), and sends y(tta) to P E (z, a).

死相を求めるのにも、次の２ステツプが必要である６ ■縦カ行第１ステップ・・・・・・Ｐ　Ｅ　（ａ　ｅ　
ｚ）は、Ｘ（４，１）を受取りｘ　（ａ、ｚ）　＋　ｘ
　（ａ、ｚ）を算出し、その値をｘ（ａ、ｚ）に代入す
る。とおく。Ｐ　Ｅ　（４９４）は、Ｘ（番、８）を受
取りｘ　（ａ、−ａ）　＋　ｘ　（４９４）を算出し、
その値をｘ（ｔｔｔ）に代入する。Ｐ　Ｅ　（４＋ｚ）
は、Ｐ　Ｅ　（ａ、ａ）にｘ（ａｐｔ）を送る。The following two steps are necessary to find the phase of death.6 ■Vertical row 1st step...P E (a e
z) receives X(4,1) x (a,z) + x
(a, z) is calculated and the value is assigned to x(a, z). far. P E (494) receives X (number, 8) and calculates x (a, -a) + x (494),
Assign that value to x(ttt). P E (4+z)
sends x(apt) to P E (a, a).

同様に、ＰＥ（工、ｉ）　、　Ｐ　Ｅ　（ｘ、ｓ）もｙ
に関する加え込み計算を行ない、Ｐ　Ｒ（ｚｔａ）はＰ
　Ｅ　（ｔ、ｚ）にｙ（ａｙａ）　を送る。Similarly, PE (engineering, i) and PE (x, s) are also y
P R(zta) is P
Send y(aya) to E(t,z).

■縦カ行第２ステップ・・・・・・Ｐ　Ｅ　（４１４＞
がｘ（ａ、ｚ）＋！　（４，４）を算出し、その値を！
（４１４）に代入する。また、Ｐ　Ｅ　（ｘ、１）がｙ
（工ｐｓ）　＋　ｙ　（工、８）を算出し、その値をｙ
　（ｘｔｔ）に代入する。■Vertical row 2nd step...P E (414>
is x(a,z)+! Calculate (4, 4) and find the value!
(414). Also, P E (x, 1) is y
(Eng ps) + y (Eng, 8) is calculated, and the value is y
(xtt).

このようにして、求まった！　（４１４）がＸ　ｓｕｎ
であり、　ｙ（ｔ、ｔ）がＹ　ｓｕｍである。This is how I found it! (414) is X sun
and y(t, t) is Y sum.

以上の計算に於ける、データの転送順を、第１図のデー
タ経路図に示す。同図で０印で示した１６台のＰＥは、
第３図に示す並列計算機のＰＥを、データ経路を表示し
やすくするため、左端縦に１次元的に並べ直したもので
ある。また同図で、Ｏ印はデータ、→印はＸ（ｉｅＪ）
に関するデータの経路１０８１．〉印はｙ（ｔｅｌ）に
関するデータの経路を表わす。The data transfer order in the above calculation is shown in the data path diagram of FIG. The 16 PEs marked with 0 in the same figure are
The PEs of the parallel computer shown in FIG. 3 are rearranged one-dimensionally vertically at the left end in order to make it easier to display the data paths. Also, in the same figure, the O mark is data, and the → mark is X (ieJ).
Data path 1081. The > symbol represents the data path regarding y (tel).

Ｌ＝Ｍ＝４の場合は上記の通りであるが、一般の場合に
は次のようにＱ＋ｍステップで、Ｘｓｕｍ　ｔＹ　ｓｕ
＋ａを同時に求めることができる。即ち、総和Ｘｓｕｍ
＝Σ１ＪＸ（ｉｔＪ）を求めるには、まず各行ごとに、
右方向にカスケード・サムを取り、一番右側のＰＥ上に
η相、即ちＸ　ｉ　”ΣｌＸＣ１ｔＪ）を作る。In the case of L=M=4, it is as above, but in the general case, in Q+m steps, Xsum tY su
+a can be obtained at the same time. That is, the sum Xsum
To find =Σ1JX(itJ), first, for each row,
Take the cascade thumb to the right and create the η phase, ie, X i ”ΣlXC1tJ), on the rightmost PE.

ただし、ｊ＝１，２ｙ・・・・・・２Ｍである。第り列
上のＭ台のＰＥにあるデータ（η相Ｘｔ　）について、
上方向にカスケード・サムを取ることにより、−番上側
のＰＥ　（Ｌ、Ｍ）上に総和Ｘｓｕｍ＝ΣｊＸ　Ｊを得
る。However, j=1, 2y...2M. Regarding the data (η phase Xt) in M PEs on the th column,
By taking the cascade sum in the upward direction, we obtain the summation Xsum=ΣjX J on the top PE (L,M).

上記総和計算と同時に、データｙ（ｔ、ａ）の総和Ｙｓ
ｕｍを求める。それにはまず、各行ごとに左方向にカス
ケード・サムを取り、一般左側のＰＥ上に賃料、即ちＹ
　ｔ　＝Σｔｙ（ｉｓＪ）を作る。ただしｊ＝１．２．
・・・・・・２Ｍである。第１列上のＭ台のＰＥに求ま
ったη相について、下方向にカスケード・サムを取るこ
とにより、一番下側のＰ　Ｅ　（ｚ、ｔ）上に総和Ｙｓ
ｕｍ＝！ΣＪＹ□を得る。At the same time as the above summation calculation, the summation Ys of data y(t, a)
Find um. To do this, first take a cascade sum to the left for each row, and write the rent, that is, Y, on the general left PE.
Create t = Σty(isJ). However, j=1.2.
...It is 2M. By taking the cascade sum downward for the η phase found for M PEs on the first row, the sum Ys is added to the bottom PE (z, t).
um=! Obtain ΣJY□.

上記処理手順は、全ＰＥに共通なプログラムとして、統
一的に表わすことが出来る。第２図は、並列計算機用の
擬似フォートラン言語で記述した例である。まず、サブ
プログラムと変数について説明する。同図に於いて、文
番号００１０のＭＹＮＵＭＢ（Ｉ、Ｊ）は自ＰＥ番号を
求めるサブプログラムであり、■に横方向の番号、Ｊに
縦方向の番号が与えられる。文番号００３０他の５ＥＮ
Ｄ（Ｘ、　（Ｉ、Ｊ））は、データＸ３ＰＥ（Ｉ、Ｊ）
に送るサブプログラムであり１文番号００６０他のＲＥ
ＣＥＩＶ（Ｘ、　（Ｉ、Ｊ））はＰＥ（Ｉ、Ｊ）からデ
ータを受は取り、それをＸに代入するサブプログラムで
ある。文番号００４０他（７１ＦＵＮＣＯ（Ｉ、ＩＮＡ
Ｘ）は、■を２，４．−・・・、ＩＭＡＸで割ったとき
の余りがＯとなる回数を求める関数サブプログラムであ
り、この値は昇順にカスケード・サムをとるとき、自Ｐ
Ｅが第何ステップまで演算を行なうかを与えるものであ
る０文番号０１２０他＋７）ＦＵＮＣＩ（Ｉ、ＩＮＡＸ
）は、■を２，４．・旧・・、工ＭＡＸで割ったときの
余りが１となる回数を求める関数サブプログラムであり
、この値は降順にカスケー□ド・サムをとるとき、自Ｐ
Ｅが第何ステップまで演算を行なうかを与えるものであ
る。また、変数の意味はＩＭＡＸ、ＪＭＡＸは、それぞ
れ横方向。The above processing procedure can be expressed uniformly as a program common to all PEs. FIG. 2 is an example written in a pseudo-Fortran language for parallel computers. First, subprograms and variables will be explained. In the figure, MYNUMB (I, J) of sentence number 0010 is a subprogram for calculating the own PE number, and ■ is given a horizontal number, and J is given a vertical number. Sentence number 0030 other 5EN
D(X, (I, J)) is data X3PE(I, J)
This is a subprogram sent to 1 statement number 0060 and other REs.
CEIV (X, (I, J)) is a subprogram that receives data from PE (I, J) and assigns it to X. Sentence number 0040 and others (71FUNCO(I, INA
X) is ■2,4. -..., is a function subprogram that calculates the number of times the remainder is O when divided by IMAX, and this value is calculated when the cascade sum is taken in ascending order.
0 statement number 0120 etc. + 7) FUNCI (I, INAX
) is 2,4.・Old...This is a function subprogram that calculates the number of times the remainder is 1 when divided by
E gives the number of steps to be performed. Also, the meaning of the variables is IMAX and JMAX, respectively, in the horizontal direction.

縦方向のＰＥ台数である。This is the number of PEs in the vertical direction.

次に、処理の概要を説明する。まずＭＹＮＵＭＢを用い
て自ＰＥの番号を求める（文番号００１０）。Next, an overview of the processing will be explained. First, the number of the own PE is obtained using MYNUMB (statement number 0010).

第３図に基づ〈実施例と同様、カスケード・サムは横方
向、縦方向の順に行なう。横方向カスケード・サムは、
工が偶数のとき、文番号００３０〜００９０の処理を行
なう。即ち、左隣のＰＥにデータｙを送り（文番号００
３０）−データＸを左方のＰＥから受は取り、Ｘに加え
込む操作をＮ。Based on FIG. 3 (similar to the embodiment), the cascade thumb is performed in the horizontal direction and then in the vertical direction. Lateral cascading thumb is
When the number of sentences is an even number, processing of sentence numbers 0030 to 0090 is performed. In other words, send data y to the PE on the left (statement number 00
30) - Receive data X from the left PE and add it to X with N.

回数繰り返す（文番号００５０〜ＯＯ８０）。各ＰＥは
最後の加え込みを終了後、その値を右方のＰＥへ送る（
文番号００９０）。但し、ＰＥが１番右端にあるときは
送らない。以上の処理を、全ＰＥが同時に行なうと、右
端のＰＥ上にη相か求まる。一方、■が奇数のときは、
文番号０１１０〜０１７０の処理に従い、ｙについてカ
スケード・サムを降順に行ない、最後に左端のＰＥ上に
η相が求まる。Repeat a number of times (statement numbers 0050 to OO80). After each PE finishes the last addition, it sends the value to the PE on the right (
Sentence number 0090). However, it is not sent when the PE is at the rightmost position. When all the PEs perform the above processing simultaneously, the η phase is found on the rightmost PE. On the other hand, when ■ is an odd number,
According to the processing of statement numbers 0110 to 0170, cascade sum is performed on y in descending order, and finally the η phase is found on the leftmost PE.

縦方向の和に関しても同様に、Ｉ＝ＩＭＡＸ＝ＬのＰＥ
がＸのカスケード・サムを昇順に行ない、１＝１のＰＥ
がｙのカスケード・サムを降順に行なうと、Ｘｓｕｍは
Ｘ（Ｌ、Ｍ）、ＹｓｕｍはＸ（１、１）として求まる。Similarly, regarding the vertical sum, PE of I=IMAX=L
performs the cascade sum of X in ascending order, and PE of 1=1
performs the cascade sum of y in descending order, Xsum is determined as X(L,M), and Ysum is determined as X(1, 1).

これら２つの総和計算のためのデータ転送は、第１．第
Ｑステップに於いて、右方向へＬ／２゜Ｌ／２”　、−
−・−・−，１個、左方向ヘモＬ　／　２　、　Ｌ　／
２２、・・・・・・、１個であり、縦方向への転送は無
い。The data transfer for these two summation calculations is as follows. In the Q-th step, L/2°L/2" to the right, -
-・-・-, 1 piece, left hemo L/2, L/
22, . . ., one piece, and there is no vertical transfer.

横方向のデータ転送並列度はＬであり、データを受は取
るＰＥは必ず異なっているから、同時に転送可能である
。第Ω＋１〜第Ｑ＋ｍステップに於いては、横方向のデ
ータ転送はなく、上方向へＭ／２．Ｍ／２”　、・・・
川、１個、下方向へもＭ／２゜Ｍ／２２．・・・・・・
、１個であり、これらも同時に転送可能である。また、
Ｘの加算に使うＰＥとｙの加算に使うＰＥは、必ず異な
っている。以上のことから、Ｘとｙに関する２つの総和
計算は、同時に実行可能である。The parallelism of data transfer in the horizontal direction is L, and since the PEs that receive and receive data are always different, data can be transferred at the same time. In steps Ω+1 to Q+m, there is no data transfer in the horizontal direction, and there is no data transfer in the upward direction M/2. M/2",...
River, 1 piece, downward M/2° M/22.・・・・・・
, and these can also be transferred at the same time. Also,
The PE used to add X and the PE used to add y are always different. From the above, two summation calculations regarding X and y can be executed simultaneously.

〔Effect of the invention〕

このようにすると、ｘ鬼の総和計算に必要なＰＥは、７
ｘの総和計算に必要なＰＥと重複しないから、Ｘ　ｓｕ
ｍを求める計算とＹ　ｓｕｍを求める計算は、並列に実
行可能である。本発明の方式は、単一のカスケード・サ
ムを行う従来法に比べ、ＰＥの稼働率が約２倍に向上す
る。例えば、Ｎ＝２”０＝１０２４台のとき、ＰＥの稼
働率は、１９．９％に向上する。In this way, the PE required to calculate the sum of x demons is 7
Since it does not overlap with the PE required to calculate the sum of x,
The calculation for determining m and the calculation for determining Y sum can be performed in parallel. The method of the present invention improves the PE utilization rate by about twice as compared to the conventional method that performs a single cascade sum. For example, when N=2''0=1024 units, the PE operating rate improves to 19.9%.

本発明は、総和型計算２つを対にし、それぞれを処理す
るＰＥが異なる様に、仕分けしていることが本質である
。それ故、総和と内積のように異種の計算の組合せにも
適用できるし、第３図と異なる結合方式の並列計算機に
適用しても、効率的である。以下、これらの応用に就い
て述べる。The essence of the present invention is to pair two summation-type calculations and classify them so that the PEs that process each are different. Therefore, it can be applied to a combination of different types of calculations such as summation and inner product, and it is efficient even when applied to a parallel computer with a combination method different from that shown in FIG. These applications will be described below.

[Modified example]

１、総和だけ出なく、他の演算も同様に、対にして計算
できる。例えば、内積＜ｘ、ｘ＞と内積＜ＸＩ　ｙ＞、
あるいは最大値Ｘｍａｘと最小値Ｘ　ｍｉｎを同時に求
めることが出来る。この組み合わせは、総和と内積、内
積と最大（小）値。1. Not only can you calculate the sum, but you can also calculate other operations in pairs. For example, inner product <x, x> and inner product <XI y>,
Alternatively, the maximum value Xmax and the minimum value Xmin can be determined simultaneously. This combination is summation and dot product, dot product and maximum (small) value.

最大（小）値と内積など異積のものでも構わない。It may be a different product such as a maximum (small) value and an inner product.

２、また本発明は、他の結合方式の並列計算機にも容易
に適用できる。例えば、第４図に例示する格子結合方式
の並列計算機でも、第３図の行列クロスバ・スイッチ結
合方式の並列計算機と同様のデータ経路で、対になった
計算が可能である。ただし、格子結合方式の並列計算機
において直接データ転送できるのは、上下左右方向に隣
接するＰＥだけであり、隣接していないＰＥヘデータを
送るには中間のＰＥを経由しなければならない。従って
、ＰＥ（１−Ｊ）からＰ　Ｅ　（１’　、Ｊ’　）への
データ転送と同時に。2. The present invention can also be easily applied to parallel computers using other coupling methods. For example, the lattice-coupled parallel computer illustrated in FIG. 4 is also capable of performing paired calculations using the same data path as the matrix-crossbar-switch-coupled parallel computer shown in FIG. However, in a lattice-coupled parallel computer, data can only be directly transferred to PEs that are vertically and horizontally adjacent, and data must be sent via intermediate PEs in order to send data to non-adjacent PEs. Therefore, at the same time as the data transfer from PE(1-J) to PE(1', J').

Ｐ　Ｅ　（ｔ’　、、＋　）からＰ　Ｅ　（ｉ、Ｊ）へ
のデータ転送が必要な場合、一方のデータ転送が待たさ
れることが生じ得る。しかし、演算に必要なＰＥは決し
て重複しないので、演算は常に並列に実行できる。When data transfer from P E (t', , +) to P E (i, J) is required, one data transfer may be forced to wait. However, since the PEs required for an operation never overlap, the operations can always be executed in parallel.

３、また、ハイパー・キューブ結合方式の並列計算機で
も、本発明の方式による計算は可能である。例えば、第
５図に示す８台のＰＥからなるハイパー・キューブ結合
方式の並列計算機での実行方法を例示する。同図で、Ｐ
Ｅに付された３桁の数字は、２進表現のＰＥ番号である
。第３図の行列クロスバ・スイッチ結合方式の並列計算
機上で実行する場合に現われるプロセッサ番号（ｉ、ｊ
）を、ｎ　＝　ｉ　＋　Ｑ傘（ｊ−１）なる変換規則で
１次元表現し、第５図のハイパー・キューブ結合方式の
並列計算機のＰＥ番号に対応させる。こうすることによ
り、第３図の並列計算機と同様に、対になった計算が可
能である。3. In addition, calculations according to the method of the present invention are also possible on parallel computers using the hypercube combination method. For example, an example of an execution method on a parallel computer using a hypercube combination method consisting of eight PEs shown in FIG. 5 will be described. In the same figure, P
The three-digit number attached to E is the PE number in binary representation. The processor numbers (i, j
) is expressed one-dimensionally using the conversion rule n = i + Q umbrella (j-1), and is made to correspond to the PE number of the parallel computer using the hypercube combination method shown in FIG. By doing so, it is possible to perform paired calculations, similar to the parallel computer shown in FIG.

前例と同様２つの総和Ｘｓｕｍ　、　ＹＳｕｍを求める
場合を考える。片方の演算Ｘ　ｓｕｍに関するデータ経
路は、次の通りである。As in the previous example, let us consider the case of finding two sums Xsum and YSum. The data path for one operation, X sum, is as follows.

■第１ステップ・・・・・・全ＰＥに関して、第１ビツ
ト（ここでは、ビットの位置は右から第１．第２・・・
・・・と数える）が０のＰＥから第１ビツトが１のＰＥ
へデータを送り、後者のＰＥが部分和を算出する。■First step: For all PEs, select the first bit (here, the bit positions are 1st, 2nd, etc. from the right).
...) is 0 to PE whose first bit is 1.
The latter PE calculates the partial sum.

■第２ステップ・・・・・・第１ビツトが１のＰＥに関
して、第２ビツトが０のＰＥから第２ビツトが１のＰＥ
ヘデータを送り、後者のＰＥが部分和を算出する。■Second step... Regarding the PE whose first bit is 1, from the PE whose second bit is 0 to the PE whose second bit is 1.
The latter PE calculates the partial sum.

■第３ステップ・・・・・・第１ビツト及び第２ビツト
が１のＰＥに関して、第３ビツトが０のＰＥから第３ビ
ツトが１のＰＥヘデータを送り、後者のＰＥが部分和を
算出する。■Third step: Regarding the PE whose first and second bits are 1, data is sent from the PE whose third bit is 0 to the PE whose third bit is 1, and the latter PE calculates the partial sum. do.

本例ではＰＥ台数が８＝２８なので、第３ステツプにて
、一番番号の大きい１１１番（１〜８番の１０進表現で
は８番になる）のＰＥ上に総和が求まり、計算は終了す
る。In this example, the number of PEs is 8 = 28, so in the third step, the sum is found on the PE with the largest number 111 (number 8 in decimal representation of numbers 1 to 8), and the calculation ends. do.

一方Ｙ　ｓｕｍを求める計算は、各ステップにおいて、
ビットが１のＰＥからビットが０のＰＥへデータを送っ
て進める。On the other hand, in each step of the calculation to obtain Y sum,
Data is sent from the PE with a bit of 1 to the PE with a bit of 0 to proceed.

■第１ステップ・・・・・・全ＰＥに関して、第１ビツ
トが１のＰＥから第１ビツトがＯのＰＥヘデータを送り
、後者のＰＥが部分和を算出する。(1) First step: For all PEs, data is sent from the PE whose first bit is 1 to the PE whose first bit is O, and the latter PE calculates a partial sum.

■第２ステップ・・・・・・第１ビツトが０のＰＥに関
して、第２ビツトが１のＰＥから第２ビツトがＯのＰＥ
へデータを送り、後者のＰＥが部分和を算出する。■Second step... Regarding the PE whose first bit is 0, from the PE whose second bit is 1 to the PE whose second bit is O
The latter PE calculates the partial sum.

■第３ステップ・・・・・・第１ビツト及び第２ビツト
がＯのＰＥに関して、第３ビツトが１のＰＥから第３ビ
ツトがＯのＰＥへデータを送り、後者のＰＥが部分和を
算出する。■Third step... Regarding the PE whose first and second bits are O, data is sent from the PE whose third bit is 1 to the PE whose third bit is O, and the latter PE calculates the partial sum. calculate.

このようにして、一番番号の若い０００番（１〜８番の
１０進表現では１番になる）のＰＥ上に総和が求まる。In this way, the sum total is found on the PE with the smallest number 000 (number 1 in decimal representation of numbers 1 to 8).

第１ステツプにおいて１対のＰＥ、例えば０００番のＰ
Ｅと００１番のＰＥは互いにデータＸ、データｙを送る
が、通信路は１本しかないため、いずれかの通信が待た
される。第２ステツプ以降は、データの送受関係に干渉
はなく、演算２通信とも並列に処理できる。In the first step, a pair of PEs, e.g.
PE No. E and PE No. 001 send data X and data y to each other, but since there is only one communication path, communication with one of them has to wait. After the second step, there is no interference in the data transmission/reception relationship, and both arithmetic and communication operations can be processed in parallel.

以上説明した通り、本発明は第３図の並列計算機におい
て、２種の総和型計算を、完全に並列に実行できる。第
４〜５図の並列計算機では、通信の１部は並列処理でき
ないものの、演算は完全に並列処理可能で、その分の効
率向上が望める。As explained above, the present invention allows the parallel computer shown in FIG. 3 to execute two types of summation type calculations completely in parallel. In the parallel computers shown in FIGS. 4 and 5, although a part of communication cannot be processed in parallel, calculations can be processed completely in parallel, and efficiency can be improved accordingly.

[Brief explanation of the drawing]

第１図は本発明を用いた計算のデータ経路図、第２図は
並列計算機による総和計算のプログラム例を示す図、第
３図は行列クロスバ・スイッチ結合方式の並列計算機の
構成図、第４図は格子結合方式の並列計算機の構成図、
第５図はハイパー・キューブ結合方式の並列計算機の構
成図である。１・・・ホスト計算機、２・・・アレイ・コントローラ
。３・・・要素プロセッサ、４・・・行うロスバ・スイッ
チ、５・・・列クロスバ・スイッチ、６・・・クラスタ
・メモ第２凹００１０　　　　　　　　　　　　Ｃ＾ししζＹリリＭ
Ｒ（Ｔ、Ｊ）Ｃ−−−−−一項方句カス階−ｙすムーー
ーーーーーーーーーーーーーー−Ｃ−−−−一挨ｊ句カ
スケード・ナムーーーーーーーーーーーーーーー第　３
Ｉ￥１Fig. 1 is a data path diagram for calculations using the present invention, Fig. 2 is a diagram showing an example of a program for summation calculation by a parallel computer, Fig. 3 is a configuration diagram of a parallel computer using a matrix crossbar switch combination method, and Fig. 4 The figure shows the configuration of a lattice-coupled parallel computer.
FIG. 5 is a block diagram of a parallel computer using the hypercube combination method. 1...Host computer, 2...Array controller. 3... Element processor, 4... Lossbar switch to perform, 5... Column crossbar switch, 6... Cluster memo 2nd recess 0010 C^shishi ζY Lily M
R (T, J) C-----One-way phrase cascade-y sumuuuuuuuuuu--C-----one word j phrase cascade nameーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーー
I¥1

Claims

[Claims]

1. Sequential computation that refers to data distributed and stored in storage devices attached to allotropic processors, using a parallel computer consisting of multiple element processors and capable of simultaneous communication between multiple sets of non-adjacent element processors. When processing is performed in a cascade type, a plurality of sets of sequential arithmetic processes are simultaneously executed by using an element processor that becomes vacant in the middle of an operation to perform other sequential arithmetic processes in a cascade type. Cascade parallel processing method.