JPH0962656A

JPH0962656A - Parallel computers

Info

Publication number: JPH0962656A
Application number: JP7216152A
Authority: JP
Inventors: Satoshi Ito; 聡伊藤
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1995-08-24
Filing date: 1995-08-24
Publication date: 1997-03-07

Abstract

PROBLEM TO BE SOLVED: To accelerate the product operation of large sized matrices by calculating the element of a matrix product by transferring the column vector or row vector of a matrix between adjacent processor elements. SOLUTION: A controller 3 distributes first and second matrix data through a communication route 21 to respective PEs 1-N as the data of the column vector and row vector. The respective PEs calculate the inner product of the vector from the respectively distributed two pieces of data and return the calculated result through the communication route 21 to the controller 3 as one element of the matrix product. Also, the respective PEs 1-N transfer data of the row vector stored in a storage means 1b to one of the logically adjacent PEs connected by the communication route 22 provided separately from the communication route 21 so as to perform inner product calculation by the set of the different column and how vectors. Then, the respective PEs successively repeat the inner product operation, the transfer of the row vector and the return of the inner product calculated result. By these operations, all the elements of the matrix product are gathered in the controller 3 and the matrix product is appropriately obtained.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、科学技術計算で頻
繁に現れる行列計算のうち、特に行列積を高速に処理す
る並列計算機に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a parallel computer for processing matrix products at high speed among matrix calculations frequently appearing in scientific and technological calculations.

【０００２】[0002]

【従来の技術】現在行われている科学技術計算の大半は
線形計算である。これには大規模連立一次方程式を解く
ことや行列の固有値、固有ベクトル問題などが含まれ、
これらの代表的な問題に関しては数値解析というソフト
的な検討は勿論のこと、ベクトル型計算機というハード
的な面からも非常に詳しい検討が積み重ねられている。2. Description of the Related Art Most of the scientific and technological calculations currently performed are linear calculations. This includes solving large-scale simultaneous linear equations, matrix eigenvalues, eigenvector problems, etc.
With regard to these typical problems, not only software studies such as numerical analysis, but also very detailed studies have been accumulated from the hardware aspect of vector computers.

【０００３】線形計算の中で最も基本的な演算は行列演
算である。行列演算とは行列同士の四則演算を意味する
が、科学技術計算において特に重要なものは行列積であ
る。今、Ｎ×Ｎ次元行列として、Ａ＝（ａ_ij），Ｂ＝
（ｂ_ij），Ｃ＝（ｃ_ij）を例にして、行列積Ａ＝Ｂ・Ｃ
を計算することを考えてみる。ここで、行列積Ａ＝Ｂ・
Ｃは実際の計算では（１）式から求められる。The most basic operation in linear calculation is matrix operation. Matrix operations mean arithmetic operations between matrices, but the matrix multiplication is particularly important in scientific and technological calculations. Now, as an N × N dimensional matrix, A = (a _ij ), B =
(B _ij ), C = (c _ij ) as an example, the matrix product A = B · C
Consider computing. Here, the matrix product A = B
In actual calculation, C is calculated from the equation (1).

【０００４】[0004]

【数１】 [Equation 1]

【０００５】上記（１）式で現される計算は、単純には
Ｎ³ 回の積和演算が、正確にはＮ³回の乗算と（Ｎ−
１）Ｎ² 回の加算が必要である。すなわち線形計算につ
いて演算総量をＯ（Ｎ^m ）とすると、行列積の計算にお
いてはｍ＝３になる。しかし、行列積計算の非常に詳し
い解析によるとｍ＜３とするスキームが存在することが
知られている（Ｖ．Ｓｔｒａｓｓｅｎ，Ｎｕｍｅｒ．Ｍ
ａｔｈ．１３３５４−３５６（１９６９）参照）。こ
れはアルゴリズムの研究としては非常に興味深いもので
あるが、現実問題としてはこのアルゴリズムは複雑であ
り、また行列の大きさＮへの依存性はＯ（Ｎ^m ）：ｍ＜
３であったとしても、さらに行列の大きさに依存しない
比較的複雑な手順が必要とされるので、実用上は余りメ
リットがない。したがって一般に行列積は定義通りに計
算するほうが良いとされている。The calculation expressed by the above equation (1) is simply N ³ product-sum operations, more precisely N ³ multiplications and (N-
1) N ² additions are required. That is, assuming that the total operation amount is O (N ^m ) for linear calculation, m = 3 in the calculation of the matrix product. However, it is known that there is a scheme for setting m <3 according to a very detailed analysis of matrix product calculation (V. Strassen, Numer. M).
ath. 13 354-356 (1969)). This is very interesting as a study of the algorithm, but in reality, the algorithm is complicated, and the dependence on the matrix size N is O (N ^m ): m <
Even if the number is 3, a relatively complicated procedure that does not depend on the size of the matrix is required, so that there is not much merit in practical use. Therefore, it is generally better to calculate the matrix product as defined.

【０００６】ところで、通常のベクトル型計算機を使っ
て行列積の計算を行なうとすると、幾つかの点で考慮し
なければならないことがある。第一にベクトルレジスタ
をなるべく有効に使うためにベクトル長を長くする必要
があることであり、第二に通常は演算器の演算時間の方
がメモリのアクセス速度よりも圧倒的に速いので、同一
メモリを連続してアクセスしないようにすること、すな
わちバンクコンフリクトを避けることである。図４は、
この様な考慮の下に通常採用されているアルゴリズムの
例を、プログラム記述言語ＦＯＲＴＲＡＮを用いて記述
したものである。By the way, if the matrix product is calculated using an ordinary vector type computer, there are some points that must be taken into consideration. Firstly, it is necessary to lengthen the vector length in order to use the vector register as effectively as possible. Secondly, the computing time of the computing unit is usually much faster than the memory access speed. Preventing continuous access to memory, ie avoiding bank conflicts. FIG.
An example of an algorithm usually adopted under such consideration is described by using the program description language FORTRAN.

【０００７】ここで、図４で示すような行列積の算法を
中間積型と呼ぶ（Ｊ．Ｊ．Ｄｏｎｇａｒｒａｅｔ．ａ
ｌ．，ＳＩＡＭＲｅｖ．２６（１）９１−１１２（１
９８４）参照）。この中間積型演算は最もメモリアクセ
スが速い方法として知られている。しかし演算回数その
ものはＮ³ 回なのでベクトルレジスタの数をＭとすれば
一つの演算器はＮ³ ／Ｍ回の計算をする必要がある。通
常、行列のサイズは数万から数百万になることがある
が、ベクトルレジスタの数は３２個もしくは６４個など
多くても数百のオーダーであるから、個々の演算器は膨
大な回数の計算を行わなくてはならない。さらに行列の
大きさによっては中間積型を使ってもバンクコンフリク
トが生じることがあり、これは避けられない問題となっ
ている。Here, the matrix product arithmetic method as shown in FIG. 4 is called an intermediate product type (JJ Dongarra et.a.
l. , SIAM Rev. 26 (1) 91-112 (1
984)). This intermediate product type operation is known as the method with the fastest memory access. However, since the number of operations itself is N ³ times, if the number of vector registers is M, one operator needs to perform N ³ / M times of calculations. Usually, the size of a matrix may be in the range of tens of thousands to millions, but the number of vector registers is on the order of hundreds at most, such as 32 or 64, so that each computing unit requires a huge number of times. You have to calculate. Furthermore, depending on the size of the matrix, bank conflict may occur even if the intermediate product type is used, which is an unavoidable problem.

【０００８】そのため汎用のベクトル型計算機における
演算器は、この様な膨大な計算を非常に高速処理できる
ような工夫が施されており、最近では１ナノ秒のオーダ
ーで一回の演算を可能としているが、これを実現するた
めには非常に高価な半導体素子や極めて高度な実装技術
が必要とされる。Therefore, the arithmetic unit in a general-purpose vector type computer has been devised so that such a huge amount of calculation can be processed at extremely high speed, and recently, it is possible to perform one calculation in the order of 1 nanosecond. However, in order to realize this, extremely expensive semiconductor elements and extremely advanced packaging technology are required.

【０００９】しかし、行列計算は線形計算の基本であ
り、科学技術計算において非常に多くの場面で現れる
が、そのために必要な計算は基本的には積和演算だけで
ある。従って、加算器と乗算器だけを装備しておけば計
算可能であり、複雑な演算機構を必要とはしないのであ
る。すなわち、ベクトル型計算機では、行列積のような
非常に頻繁に現れる計算を高速で処理するには、不必要
なくらい大掛かりな装備によって行列積を計算する。こ
れは汎用性を旨とするためには避けられない問題であっ
た。However, matrix calculation is the basis of linear calculation and appears in a great many situations in scientific and technological calculation, but the calculation required for that is basically only the product-sum operation. Therefore, if only the adder and the multiplier are installed, the calculation can be performed and a complicated operation mechanism is not required. That is, in a vector computer, a matrix product is calculated with an unnecessarily large-scale equipment in order to process a very frequent calculation such as a matrix product at high speed. This was an unavoidable problem for the purpose of versatility.

【００１０】一方、行列積は単純な計算の繰り返しであ
るから、最近の超並列計算機で実行すれば高速に処理で
きるという期待が持たれている。例えば、分散メモリ方
式の超並列計算機では、通常、各ＰＥ（プロセッサエレ
メント）に数メガバイト程度のメモリが実装されてお
り、データはこの様な分散メモリ全体に亙って適当に配
分される。各ＰＥは通信経路によって接続されており、
各ＰＥ間では必要に応じ該通信経路を介してデータの交
換が行なわれる。ここで各ＰＥ間の接続については、各
ＰＥ間のデータのやり取りをなるべく高速に行なうため
に、例えばオメガ網などの様々なネットワークが提案さ
れている。すなわち超並列計算機においてはネットワー
ク構成技術がそのパフォーマンスを決める重要な要素と
なっている。On the other hand, since the matrix multiplication is a repetition of simple calculation, it is expected that it can be processed at high speed if it is executed by a recent massively parallel computer. For example, in a distributed memory type massively parallel computer, each PE (processor element) is usually equipped with a memory of about several megabytes, and data is appropriately distributed over the entire distributed memory. Each PE is connected by a communication path,
Data is exchanged between the PEs via the communication path as needed. Regarding the connection between the PEs, various networks such as an omega network have been proposed in order to exchange data between the PEs as fast as possible. In other words, network configuration technology is an important factor in determining the performance of massively parallel computers.

【００１１】また、超並列計算機においてはその特性上
データの局所性を高めることが高速演算の鍵になる。つ
まり、いかに優れたネットワークであっても、通信にか
かる時間は演算にかかる時間に比べて非常に遅い。すな
わち演算についてはマイクロプロセッサの内部で行われ
るために１０ナノ秒のオーダーで計算できるが、ＰＥ間
のデータのやり取りを１０ナノ秒のオーダーで行なうこ
とは困難であり、いかに通信処理を減らすかということ
が超並列計算機を効率良く使う上では重要になってくる
からである。In a massively parallel computer, the key to high speed operation is to improve the locality of data due to its characteristics. In other words, no matter how good the network is, the communication time is much slower than the calculation time. That is, since the calculation is performed inside the microprocessor, it can be calculated in the order of 10 nanoseconds, but it is difficult to exchange data between PEs in the order of 10 nanoseconds. This is because it becomes important for efficient use of massively parallel computers.

【００１２】ここで、例えば係数行列が疎行列（スパー
ス行列）になる連立一次方程式を解く場合は、局所性を
保った形でデータを分散メモリに割り当てることが可能
である。しかし、行列積の演算においてはその定義から
完全な意味でデータの局所性は存在しない。そこでこの
様な超並列計算機を使って並列性を極限まで上げようと
すると、最も単純にはすべてのＰＥが行列Ｂ，Ｃのデー
タをすべて所持し、例えばＰＥ（ij）が（２）式で示さ
れる計算をして、最後にすべての行列要素を掻き集めれ
ば、オーダーＮの計算時間で処理することも原理的には
可能である。Here, for example, when solving a simultaneous linear equation in which the coefficient matrix is a sparse matrix (sparse matrix), it is possible to allocate the data to the distributed memory while maintaining the locality. However, in the matrix multiplication operation, there is no data locality from the definition. Therefore, when trying to raise the parallelism to the limit by using such a massively parallel computer, in the simplest case, all PEs have all the data of matrices B and C. For example, PE (ij) is the equation (2). In principle, it is possible to process the calculation time of order N by performing the calculation shown and collecting all the matrix elements at the end.

【００１３】[0013]

【数２】 [Equation 2]

【００１４】しかしＮが数万から数百万のオーダーであ
ることを考えるとＮ² 個のＰＥを実装することは非常に
困難であるし、さらに任意のＰＥから高速でデータをア
クセスするために、電子回路的に複雑なネットワークが
必要になる。また各ＰＥが持っているデータのうち一部
しか使用していないという無駄が生じている。However, considering that N is in the order of tens of thousands to millions, it is very difficult to implement N ² PEs, and in order to access data from any PE at high speed. , A complex network is required in terms of electronic circuits. Moreover, there is a waste that only a part of the data held by each PE is used.

【００１５】ところで、分散メモリ型超並列計算機が非
常に高度のネットワークを具備しなくてはならないの
は、計算機としての汎用性を確保するためである。ここ
で、計算機が解こうとする問題それ自体が並列度の低い
ものである場合は改善のしようがない。これに対し、計
算機が解こうとする問題の並列度は高いけれどもデータ
の局所性が悪い問題を、高速で処理するためには高度の
ネットワークが必要となってくる。ところが行列積の演
算は単純なものであり、これにオメガネットワークのよ
うな汎用ネットワークを具備しておくことは不必要であ
るといえる。By the way, the distributed memory type massively parallel computer must have a very sophisticated network in order to ensure versatility as a computer. Here, if the problem itself that the computer is trying to solve has a low degree of parallelism, it cannot be improved. On the other hand, a high-level network is required to process a problem to be solved by a computer, which has a high degree of parallelism but has poor locality of data, at high speed. However, the matrix multiplication operation is simple, and it can be said that it is unnecessary to provide a general-purpose network such as an Omega network.

【００１６】[0016]

【発明が解決しようとする課題】上述したように、科学
技術計算において頻繁に現れる行列積の演算を高速に処
理するための計算機として、ベクトル型計算機もしくは
超並列計算機が考えられる。しかし、行列積のように非
常に頻繁に現れる演算を行なうための計算機としては、
ベクトル型計算機は高価な半導体素子等を必要とし、超
並列計算機は複雑なネットワークを必要とするという問
題があった。As described above, a vector type computer or a massively parallel computer can be considered as a computer for high-speed processing of matrix product operations that frequently appear in scientific and technological calculations. However, as a computer for performing operations that occur very frequently, such as matrix multiplication,
There is a problem that the vector computer requires expensive semiconductor elements and the like, and the massively parallel computer requires a complicated network.

【００１７】本発明は、上記事情を考慮してなされたも
ので、高価な半導体素子を必要とせず、また複雑なネッ
トワークをも必要としないものでありながら、サイズの
大きい行列積の演算を従来に比べてはるかに高速に行な
う並列計算機を提供することを目的とする。The present invention has been made in consideration of the above circumstances, and does not require expensive semiconductor elements and does not require a complicated network. Its purpose is to provide a parallel computer that runs at a much higher speed than the.

【００１８】[0018]

【課題を解決するための手段】本発明（請求項１）は、
複数のプロセッサエレメントと、各プロセッサエレメン
トにデータを分配し演算結果を収集する制御装置と、該
制御装置と各プロセッサエレメントを接続する第１の通
信経路と、論理的に隣接するプロセッサエレメントを接
続する該第１の通信経路とは別に設けられた第２の通信
経路とを備えた並列計算機であって、前記制御装置は、
前記第１の通信経路により、行列積演算の元となる第１
の行列の要素を縦方向および横方向の一方にグルーピン
グしてなる複数の第１のベクトルを前記プロセッサエレ
メントのそれぞれに排他的に分配するとともに、該行列
積演算の元となる第２の行列の要素を縦方向および横方
向のうち前記第１のベクトルとは異なる方向にグルーピ
ングしてなる複数の第２のベクトルを前記プロセッサエ
レメントのそれぞれに排他的に分配する手段を備え、前
記プロセッサエレメントのそれぞれは、前記第１のベク
トルを格納するための第１の記憶手段と、前記第２のベ
クトルを格納するための第２の記憶手段と、前記第１の
記憶手段に格納された前記第１のベクトルの各要素と前
記第２の記憶手段に格納された前記第２のベクトルの各
要素とを逐次乗算する乗算器と、前記乗算器による乗算
結果を累積加算する加算器と、転送された前記第１のベ
クトルを前記第１の記憶手段に格納し、転送された前記
第２のベクトルを前記第２の記憶手段に格納し、前記加
算器の累積加算結果を前記第１の行列と前記第２の行列
からなる行列積の一要素として前記第１の通信経路によ
り前記制御装置に転送するとともに、前記第２の通信経
路により前記第２のベクトルを論理的に隣接する片方の
前記プロセッサエレメントに転送する制御を行なう制御
手段とを備えたことを特徴とする。The present invention (Claim 1) includes:
A plurality of processor elements, a controller that distributes data to each processor element and collects a calculation result, a first communication path that connects the controller and each processor element, and a processor element that is logically adjacent A parallel computer comprising a second communication path provided separately from the first communication path, wherein the control device comprises:
The first communication path is the first source of matrix multiplication operation.
A plurality of first vectors formed by grouping the elements of the matrix in one of the vertical direction and the horizontal direction are exclusively distributed to each of the processor elements, and Each of the processor elements includes means for exclusively distributing a plurality of second vectors formed by grouping elements in a direction different from the first vector in the vertical direction and the horizontal direction, to each of the processor elements. Is a first storage means for storing the first vector, a second storage means for storing the second vector, and the first storage means stored in the first storage means. A multiplier that sequentially multiplies each element of the vector and each element of the second vector stored in the second storage unit, and cumulatively adds the multiplication results of the multiplier. An adder and the transferred first vector are stored in the first storage means, the transferred second vector are stored in the second storage means, and a cumulative addition result of the adder is stored. The element is transferred to the control device via the first communication path as an element of a matrix product of the first matrix and the second matrix, and the second vector is logically transferred via the second communication path. And a control means for controlling transfer to one of the adjacent processor elements.

【００１９】本発明（請求項２）は、上記発明（請求項
１）において、前記第２の記憶手段は、第１のポートお
よび第２のポートを有するデュアルポートメモリを用い
て構成され、前記第２の通信経路は、各々の前記プロセ
ッサエレメントについて、１つのプロセッサエレメント
のデュアルポートメモリの第１のポートと、該１つのプ
ロセッサエレメントに論理的に隣接する片方の前記プロ
セッサエレメントのデュアルポートメモリの第２のポー
トとを接続するように設けられた複数の専用通信経路か
らなり、前記制御手段は、前記デュアルポートメモリの
前記第１のポートおよび前記第２のポートの一方から前
記第２のベクトルを読み出し、前記専用通信経路により
論理的に隣接する片方の前記プロセッサエレメントに転
送するとともに、他の前記専用通信経路により論理的に
隣接するもう片方の前記プロセッサエレメントから転送
された前記第２のベクトルを該デュアルポートメモリの
該第１のポートおよび該第２のポートの他方から書き込
む制御を行なうことを特徴とする。According to the present invention (claim 2), in the above invention (claim 1), the second storage means is configured by using a dual port memory having a first port and a second port. The second communication path includes, for each of the processor elements, a first port of a dual port memory of one processor element and a dual port memory of one of the processor elements logically adjacent to the one processor element. A plurality of dedicated communication paths provided so as to connect to a second port, wherein the control means includes one of the first port and the second port of the dual port memory and the second vector. Is read out and transferred to one of the processor elements that are logically adjacent by the dedicated communication path, Control of writing the second vector transferred from the other processor element logically adjacent to the dedicated communication path of the second port from the other of the first port and the second port of the dual port memory. It is characterized by

【００２０】本発明（請求項３）は、上記発明（請求項
２）において、前記プロセッサエレメントそれぞれは、
前記加算器の累積加算結果を格納する第１のレジスタ
と、前記乗算器の一方の入力端と前記第１の記憶手段と
の間に設けられた第２のレジスタと、前記第２の記憶手
段を構成する前記デュアルポートメモリの一方のポート
に接続された第３のレジスタと、前記第２の記憶手段を
構成する前記デュアルポートメモリの他方のポートに接
続された第４のレジスタと、前記乗算器の他方の入力端
と前記第３のレジスタと前記第２の通信経路を介して隣
接する前記プロセッサエレメントの前記第４のレジスタ
とに接続され、該乗算器の他方の入力端と前記第３のレ
ジスタとの間および該第３のレジスタと隣接する該プロ
セッサエレメントの該第４のレジスタとの間のいずれか
を選択的に接続するセレクタとをさらに備え、前記制御
手段は、前記第１の記憶手段および前記第２の記憶手段
にそれぞれ格納された前記第１のベクトル・データおよ
び前記第２のベクトル・データを前記第２のレジスタな
らびに前記第３のレジスタおよび前記セレクタをそれぞ
れ介して前記乗算器に与え、前記加算器の累積加算結果
を前記第１のレジスタを介して前記第１の通信経路によ
り前記制御装置に転送し、該第２の記憶手段に格納され
た前記第２のベクトルを前記第４のレジスタ、ならびに
該第３のレジスタおよび該セレクタの一方を介して前記
専用通信経路により論理的に隣接する片方の前記プロセ
ッサエレメントに転送するとともに、他の前記専用通信
経路により論理的に隣接するもう片方の前記プロセッサ
エレメントから転送された前記第２のベクトルを該第４
のレジスタ、ならびに該第３のレジスタおよび該セレク
タの他方を介して該第２の記憶手段に格納する制御を行
なうことを特徴とする。According to the present invention (claim 3), in the above invention (claim 2), each of the processor elements is
A first register for storing the cumulative addition result of the adder, a second register provided between one input end of the multiplier and the first storage means, and the second storage means. And a third register connected to one port of the dual-port memory that constitutes the second storage means, a fourth register connected to the other port of the dual-port memory that constitutes the second storage means, and the multiplication. Connected to the other input end of the multiplier, the third register, and the fourth register of the adjacent processor element via the second communication path, and the other input end of the multiplier and the third register And a selector for selectively connecting the third register with the fourth register of the processor element adjacent to the third register. The multiplication of the first vector data and the second vector data stored in the storage means and the second storage means via the second register and the third register and the selector, respectively. To the controller, transfer the cumulative addition result of the adder to the controller via the first register via the first communication path, and store the second vector stored in the second storage means. Transfer to the one processor element logically adjacent to the dedicated communication path via the fourth register and one of the third register and the selector, and logically via the other dedicated communication path. The second vector transferred from the other adjacent processor element is converted into the fourth vector.
And the third register and the other of the selectors are used to control the storage in the second storage means.

【００２１】本発明（請求項４）は、上記発明（請求項
１）において、前記第２の通信経路は、前記複数のプロ
セッサエレメントを少なくとも単方向に論理的に環状接
続するバスを用いて構成されたものであり、前記制御手
段は、前記第２の記憶手段から前記第２のベクトルを読
み出し、前記複数のプロセッサエレメントについての所
定の順序に従い前記第２の通信経路により論理的に隣接
する片方の前記プロセッサエレメントに転送し、該第２
の通信経路により論理的に隣接するもう片方の前記プロ
セッサエレメントから転送された前記第２のベクトルを
該第２の記憶手段に書き込む制御を行なうことを特徴と
する。According to the present invention (claim 4), in the above invention (claim 1), the second communication path is configured by using a bus which logically connects the plurality of processor elements in at least one direction. The control means reads out the second vector from the second storage means and logically adjoins the second communication path according to a predetermined order for the plurality of processor elements. To the processor element of the second
The second vector transferred from the other processor element logically adjacent to the second vector is controlled to be written in the second storage means.

【００２２】本発明（請求項５）は、上記発明（請求項
４）において、前記プロセッサエレメントそれぞれは、
前記加算器の累積加算結果を格納するレジスタをさらに
備え、前記制御手段は、前記加算器の累積加算結果を前
記レジスタを介して前記第１の通信経路により前記制御
装置に転送する制御を行なうことを特徴とする。According to the present invention (Claim 5), in the above invention (Claim 4), each of the processor elements is
The control means further includes a register for storing the cumulative addition result of the adder, and the control means performs control of transferring the cumulative addition result of the adder to the control device via the register via the first communication path. Is characterized by.

【００２３】（作用）本発明（請求項１）で、まず制御
装置は行列積計算のための第１の行列と第２の行列のデ
ータを互いに異なる２方向のベクトル・データ、つまり
縦ベクトルと横ベクトルのデータとして第１の通信経路
を介して各ＰＥ（プロセッサエレメント）に分配する。
第１の行列を乗算側にし第２の行列を被乗算側にする場
合、第１の行列を元にした第１のベクトルを横ベクトル
とし第２の行列を元にした第２のベクトルを縦ベクトル
とする。一方、第２の行列を乗算側にし第１の行列を被
乗算側にする場合、第２の行列を元にした第２のベクト
ルを横ベクトルとし第１の行列を元にした第１のベクト
ルを縦ベクトルとする。(Operation) In the present invention (Claim 1), first, the control device sets the data of the first matrix and the second matrix for calculating the matrix product as vector data in two directions different from each other, that is, a vertical vector. The data of the lateral vector is distributed to each PE (processor element) via the first communication path.
When the first matrix is on the multiplication side and the second matrix is on the multiplied side, the first vector based on the first matrix is the horizontal vector and the second vector based on the second matrix is the vertical vector. Vector. On the other hand, when the second matrix is on the multiplication side and the first matrix is on the multiplied side, the second vector based on the second matrix is the lateral vector and the first vector based on the first matrix. Is the vertical vector.

【００２４】各ＰＥはそれぞれ分配された二つのデータ
からベクトルの内積を計算し、該計算結果を上記の第１
の行列と第２の行列からなる行列積の一要素として第１
の通信経路を介して制御装置に返送する。Each PE calculates the inner product of the vector from the two distributed data, and the calculated result is used as the first
As an element of the matrix product of the matrix and the second matrix
It returns to the control device via the communication path of.

【００２５】また、各ＰＥは異なる第１と第２のベクト
ルの組による内積計算を行なうために（すなわち、行列
積の他の要素を求めるために）、各ＰＥ間において第２
のベクトルの入れ替えを行なう。すなわち、各ＰＥは第
２の記憶手段に格納されている第２のベクトルを、第１
の通信経路とは別に設けた第２の通信経路によって接続
された論理的に隣接する片方のＰＥにデータ転送する。In addition, each PE uses a second vector between each PE in order to perform inner product calculation with different first and second vector sets (that is, in order to obtain other elements of the matrix product).
Swap the vector of. That is, each PE uses the first vector stored in the second storage means as the first vector.
Data is transferred to one of the logically adjacent PEs connected by a second communication path provided separately from the communication path.

【００２６】しかして、各ＰＥは、上記のような内積計
算および第２のベクトルの転送と、内積計算結果の返送
とを順次繰り返す。以上の動作により、すべての第１と
第２のベクトルの組による内積計算結果、すなわち求め
るべき行列積のすべての要素が制御装置に収集され、制
御装置は計算結果を適切に構成し行列積を求めることが
できる。Thus, each PE sequentially repeats the above inner product calculation and transfer of the second vector, and the return of the inner product calculation result. By the above operation, the inner product calculation result by all the pairs of the first and second vectors, that is, all the elements of the matrix product to be obtained are collected in the control device, and the control device appropriately configures the calculation result and calculates the matrix product. You can ask.

【００２７】ここで、各ＰＥで行われる内積計算におい
て、各ＰＥに格納されているすべてのデータは独立であ
り、各ＰＥにおいて乗算、加算に必要な時間は同じであ
る。すなわち、全てのＰＥにおいて乗算および加算を完
全に同時に行なうことができる。つまり、並列化度は１
００％である。In the inner product calculation performed by each PE, all the data stored in each PE are independent, and the time required for multiplication and addition is the same in each PE. That is, multiplication and addition can be performed at the same time in all PEs. In other words, the degree of parallelization is 1
It is 00%.

【００２８】加えて、内積計算の終了を同期合わせする
必要もなくなるので、第１の通信経路の使用権に関する
調停手続きが不要になり、各ＰＥは内積計算が終わった
時点ですぐにデータ転送を始めることができる。すなわ
ち、従来の超並列計算機とは異なりデータの局所性にお
いても完全に保たれている。In addition, since it is not necessary to synchronize the end of the inner product calculation, the arbitration procedure regarding the right of use of the first communication path becomes unnecessary, and each PE immediately transfers the data when the inner product calculation is completed. You can get started. That is, unlike the conventional massively parallel computers, the data locality is completely maintained.

【００２９】また、あるＰＥがデータ転送を行っている
ときは全てのＰＥがデータ転送を行っており、データ転
送という観点からも並列化度は１００％となる。以上に
より、本発明によれば、行列積計算における処理を従来
に比べて遥かに高速で行なうことができる。Further, when a certain PE is transferring data, all PEs are transferring data, and the parallelization degree is 100% from the viewpoint of data transfer. As described above, according to the present invention, the process in the matrix product calculation can be performed at a much higher speed than the conventional one.

【００３０】また、各ＰＥは加算器、乗算器およびメモ
リを具備していれば良く、特にメモリは行列計算におけ
る行列の桁数、例えばＮ×Ｎ行列の行列積計算において
は、長さがＮのベクトルを二つ保持するだけのメモリが
あればよいので、従来のベクトル計算機のように非常に
高価な半導体素子や極めて高度な実装技術は必要ではな
い。Further, each PE may be provided with an adder, a multiplier and a memory, and in particular, the memory has the number of digits of the matrix in the matrix calculation, for example, in the matrix product calculation of N × N matrix, the length is N. Since it suffices to have a memory capable of holding two vectors of the above, very expensive semiconductor elements and extremely advanced mounting technology unlike the conventional vector computer are not required.

【００３１】また、第２の通信経路はデータを論理的に
隣接するＰＥに送るための機能を有していれば良いの
で、従来の超並列計算機において用いられていたよう
な、任意のＰＥに対してデータを送るための複雑なネッ
トワークを用いる必要がない。The second communication path may have a function for sending data to PEs which are logically adjacent to each other, so that it can be used for any PEs used in a conventional massively parallel computer. There is no need to use complicated networks to send data to and from it.

【００３２】さらに、演算手順が確定しているので、高
速記憶素子によるキャッシュを具備しなくても演算の高
速性を保つことができる。本発明（請求項２，３）で
は、各ＰＥにおける第２の通信経路を介したデータ転送
について、データ転送が行われるメモリはデュアルポー
トメモリにより構成されている。そのため、各ＰＥのデ
ータ転送において、シングルポートメモリを使用する際
に必要なハードウエア、例えばレジスタなどを省くこと
ができる。Further, since the calculation procedure is fixed, the high speed of calculation can be maintained without providing a cache with a high speed storage element. According to the present invention (Claims 2 and 3), regarding the data transfer via the second communication path in each PE, the memory in which the data transfer is performed is a dual port memory. Therefore, in the data transfer of each PE, it is possible to omit the hardware required for using the single port memory, such as a register.

【００３３】また、デュアルポートメモリを使用するこ
とで、第２の通信経路は各ＰＥに共通のものでなくても
良くなる。すなわち、論理的に隣接した各ＰＥが鎖状に
直結されるように第２の通信経路を構成することができ
る。Further, by using the dual port memory, the second communication path does not have to be common to each PE. That is, the second communication path can be configured such that the logically adjacent PEs are directly connected in a chain.

【００３４】この結果、各ＰＥにおけるデータ転送につ
いて、デュアルポートメモリのデータを論理的に隣接す
る隣のＰＥのデュアルポートメモリに順次書き込むとほ
ぼ同時に、別の隣接するＰＥのデュアルポートメモリの
データを自分のデュアルポートメモリに読み込むことが
できる。ただし、この場合、同一メモリセルのアクセス
を防ぐために、データの読み出しをした直後にデータの
書き込みを行なうようにしなければならない。As a result, regarding the data transfer in each PE, when the data in the dual port memory is sequentially written into the dual port memory in the adjacent PE which is logically adjacent, the data in the dual port memory in another adjacent PE is almost simultaneously written. Can be loaded into your own dual-port memory. However, in this case, in order to prevent access to the same memory cell, it is necessary to write the data immediately after reading the data.

【００３５】以上の結果、すべてのＰＥにおいてデータ
転送が同時に行われるので、処理速度の向上を図ること
ができる。本発明（請求項４）では、前記第２の通信経
路は、前記複数のＰＥを少なくとも単方向に論理的に環
状接続するバスを用いて構成されたものであるが、調停
手続きは不要であり、前記複数のＰＥについての所定の
順序に従い転送を行なうことができる。As a result of the above, since data transfer is simultaneously performed in all PEs, the processing speed can be improved. In the present invention (Claim 4), the second communication path is configured by using a bus that logically connects the plurality of PEs in at least one direction logically, but no arbitration procedure is required. , Transfer can be performed according to a predetermined order for the plurality of PEs.

【００３６】また、本発明（請求項３、５）では、各Ｐ
Ｅにおいて行われるベクトルの内積計算の結果は、直接
制御装置に返送されるのではなく、一旦、レジスタに格
納され、制御装置は第１のレジスタに格納されている計
算結果を受け取るので、各ＰＥおける内積計算およびデ
ータ転送と、計算結果の返送とを別個に独立して行なう
ことができる。In the present invention (claims 3 and 5), each P
The result of the inner product calculation of the vector performed at E is not stored directly in the control unit but is temporarily stored in a register, and the control unit receives the calculation result stored in the first register. The inner product calculation and data transfer in the above, and the calculation result return can be performed separately and independently.

【００３７】このようにすると、計算結果の転送に必要
な時間は内積計算の処理の裏に隠れてしまい実際の処理
時間には影響がない。一般に通信に必要とされる時間は
演算に必要とされる時間に比べて非常に遅いことから、
本発明によれば計算結果の返送にかかる時間を無視でき
ることで、処理速度の向上を図ることができる。In this way, the time required to transfer the calculation result is hidden behind the inner product calculation processing and does not affect the actual processing time. Generally, the time required for communication is much slower than the time required for calculation,
According to the present invention, the processing time can be improved because the time required for returning the calculation result can be ignored.

【００３８】[0038]

【発明の実施の形態】以下、図面を参照しながらこの発
明の実施の形態を説明する。（第１の実施の形態）まず、本発明の第１の実施の形態
を説明をする。Embodiments of the present invention will be described below with reference to the drawings. (First Embodiment) First, a first embodiment of the present invention will be described.

【００３９】図１に、この実施の形態に係る並列計算機
のシステム構成を示す。図１に示すように、この並列計
算機は、制御装置３、Ｎ（Ｎは２以上の整数）個のＰＥ
１〜ＰＥＮ、制御装置３とＰＥ１〜ＰＥＮを接続する第
１の通信経路２１、各ＰＥが論理的に隣接するＰＥへデ
ータを転送するために用いる第２の通信経路２２を備え
ている。FIG. 1 shows the system configuration of a parallel computer according to this embodiment. As shown in FIG. 1, the parallel computer includes a controller 3, N (N is an integer of 2 or more) PEs.
1 to PEN, a first communication path 21 that connects the control device 3 and PEs 1 to PEN, and a second communication path 22 that is used for each PE to transfer data to a logically adjacent PE.

【００４０】制御装置３は、ＰＥ１〜ＰＥＮを制御する
制御部３１と、行列積計算のための２つの行列のデータ
（乗算側のＮ×Ｍ行列Ｘと被乗算側のＭ×Ｎ行列Ｙ）お
よびＰＥ１〜ＰＥＮから返送される内積の計算結果をス
トアするメモリ３２を有している。制御部３１は、メモ
リ３２にストアされている行列Ｘ，Ｙのデータを元に要
素数Ｍのベクトル・データ（行列ＸのＮ個の横ベクトル
と行列ＹのＮ個の縦ベクトル）をＰＥ１〜ＰＥＮに排他
的に分配し、またＰＥ１〜ＰＥＮから返送される内積計
算の結果（Ｎ×Ｎ行列Ｚ＝Ｘ・Ｙの各要素）から行列積
を構成する。なお、制御部３１からは、ベクトル・デー
タを分配するにあたって、各ＰＥに明示的に演算命令を
与えるのが好ましい。The control unit 3 controls the PE1 to PEN and the data of two matrices for matrix product calculation (N × M matrix X on the multiplication side and M × N matrix Y on the multiplied side). And a memory 32 for storing the calculation result of the inner product returned from PE1 to PEN. Based on the data of the matrices X and Y stored in the memory 32, the control unit 31 sets the vector data of the number M of elements (N horizontal vectors of the matrix X and N vertical vectors of the matrix Y) PE1 to PE1. The matrix product is constructed from the results of the inner product calculation (N × N matrix Z = each element of XY) which is exclusively distributed to PEN and returned from PE1 to PEN. It is preferable that the control unit 31 explicitly gives an arithmetic command to each PE when distributing the vector data.

【００４１】複数のＰＥの一つであるＰＥ１は、第１の
分散メモリ１ａ、第２の分散メモリ１ｂ、乗算器１ｃ、
加算器１ｄ、レジスタ１ｅを有する。分散メモリ１ａ，
１ｂは、内積計算のためのベクトル・データをストアす
るメモリであり、それぞれ乗算器１ｃの入力側に接続さ
れている。乗算器１ｃの出力は、加算器１ｄの一方の入
力に接続されている。また、加算器１ｄの他方の入力に
は、ループ回路１Ｌにより該加算器１ｄ自身の出力が接
続されており、これによって加算器１ｄは乗算器１ｃの
出力データの総和を計算することになる。加算器１ｄの
出力は、レジスタ１ｅに接続されている。乗算器１ｃお
よび加算器１ｄは、用途に合わせて３２ビットないし６
４ビット程度で構成される。PE1 which is one of the plurality of PEs includes a first distributed memory 1a, a second distributed memory 1b, a multiplier 1c,
It has an adder 1d and a register 1e. Distributed memory 1a,
Reference numeral 1b is a memory that stores vector data for inner product calculation, and is connected to the input side of the multiplier 1c. The output of the multiplier 1c is connected to one input of the adder 1d. Further, the output of the adder 1d itself is connected to the other input of the adder 1d by the loop circuit 1L, whereby the adder 1d calculates the total sum of the output data of the multiplier 1c. The output of the adder 1d is connected to the register 1e. The multiplier 1c and the adder 1d are 32 bits to 6 bits depending on the application.
It consists of about 4 bits.

【００４２】この並列計算機に設けられた他のＰＥ２〜
Ｎも上記のＰＥ１と同様の構成であり、それぞれ、分散
メモリ２ａ〜Ｎａ，２ｂ〜Ｎｂ、乗算器２ｃ〜Ｎｃ、加
算器２ｄ〜Ｎｄ、およびレジスタ２ｅ〜Ｎｅを有してい
る。Other PE2 provided in this parallel computer
N has the same configuration as PE1 described above, and has distributed memories 2a to Na, 2b to Nb, multipliers 2c to Nc, adders 2d to Nd, and registers 2e to Ne, respectively.

【００４３】第１の通信経路２１は、ＰＥ１〜ＰＥＮと
制御装置３を接続するためのものであり、制御装置３か
らＰＥ１〜ＰＥＮへの行列積計算のためのデータの転送
およびＰＥ１〜ＰＥＮから返送される内積の計算結果の
転送等に使用される。The first communication path 21 is for connecting the PE1 to PEN and the control device 3, and transfers data for calculating the matrix product from the control device 3 to PE1 to PEN and from PE1 to PEN. It is used for transferring the calculation result of the returned inner product.

【００４４】第２の通信経路２２は、各ＰＥが論理的に
隣接するＰＥへデータを転送するために設けられたもの
である。図１では、例えば、第ｍ番目のＰＥｍが第ｍ−
１番目のＰＥｍ−１および第ｍ＋１番目のＰＥｍ＋１と
論理的に隣接するものとする。ただし、ＰＥ１はＰＥ２
およびＰＥＮと論理的に隣接し、ＰＥＮはＰＥ１および
ＰＥＮ−１と論理的に隣接するものとする。The second communication path 22 is provided for each PE to transfer data to a logically adjacent PE. In FIG. 1, for example, the m-th PEm is the m-th PE-.
It is assumed to be logically adjacent to the first PEm-1 and the m + 1th PEm + 1. However, PE1 is PE2
And PEN and PEN are logically adjacent to PE1 and PEN-1.

【００４５】通信経路２２は、各ＰＥから論理的に隣接
するＰＥにデータを送る機能を有すれば良く、任意のＰ
Ｅにデータを転送できるネットワークである必要はな
い。また、通信経路２２は、各ＰＥから双方向的に論理
的に隣接するＰＥにデータを送る機能を有しても良い
が、各ＰＥから論理的に隣接する片方のＰＥにデータを
送る機能だけを有しても良い。The communication path 22 only needs to have a function of sending data from each PE to a logically adjacent PE, and any P
It does not have to be a network capable of transferring data to E. Further, the communication path 22 may have a function of bidirectionally transmitting data from each PE to a logically adjacent PE, but only a function of transmitting data from each PE to one logically adjacent PE. May have.

【００４６】なお、各ＰＥ内の動作は、各々のＰＥ内部
に設けられたＰＥ内制御部（図示せず）により制御され
る。ＰＥ内制御部は、制御装置３からの演算命令などに
応答する形で制御を始め、最後の（Ｎ個目の）演算結果
を制御装置３に渡すことで、制御を終了するものとす
る。The operation in each PE is controlled by an in-PE control unit (not shown) provided in each PE. The PE control unit starts the control in response to an arithmetic command from the control device 3 and passes the last (Nth) arithmetic result to the control device 3 to end the control.

【００４７】ここで、ＰＥｍ（ｍ＝１〜Ｎ）における基
本動作を概略的に説明すると、まず制御装置３から転送
され分散メモリｍａ，ｍｂにそれぞれストアされている
内積計算のための２つのベクトル・データの各要素を乗
算器ｍｃによって逐次乗算していく。乗算の結果は次々
とループ回路ｍＬが設けられた加算器ｍｄに与えられ、
ここで累積加算が行なわれ上記２つのベクトル・データ
の内積が求められる。Here, the basic operation in PEm (m = 1 to N) will be briefly described. First, two vectors for inner product calculation transferred from the control device 3 and stored in the distributed memories ma and mb, respectively. -The respective elements of the data are sequentially multiplied by the multiplier mc. The result of the multiplication is successively given to the adder md provided with the loop circuit mL,
Here, cumulative addition is performed to obtain the inner product of the above two vector data.

【００４８】内積計算の結果は、レジスタｍｅに一端ス
トアされた後、制御装置３に返送される。これととも
に、ＰＥｍは、２つのベクトル・データのうちの一方を
隣のＰＥｍ＋１（またはＰＥｍ−１）に渡す。The result of the inner product calculation is temporarily stored in the register me and then returned to the control device 3. At the same time, the PEm passes one of the two vector data to the adjacent PEm + 1 (or PEm-1).

【００４９】そして、各ＰＥは、新たな２つのベクトル
・データについて、上記のように内積を計算し、その結
果を制御装置３に返送するとともに、一方のベクトル・
データを隣のＰＥに渡す。Then, each PE calculates the inner product with respect to the new two vector data as described above, and returns the result to the control device 3, and at the same time,
Pass the data to the adjacent PE.

【００５０】以上のように、各ＰＥにおいて、２つのベ
クトル・データの内積計算と、計算結果への制御装置３
への転送および隣のＰＥへの一方のベクトル・データの
転送とが（Ｎ回）繰り返され、最終的に、行列積の全要
素が制御装置３内に揃うことになる。As described above, in each PE, the inner product calculation of two vector data and the control device 3 for calculating the calculation result are performed.
To N and one vector data to the neighboring PE are repeated (N times), and finally all the elements of the matrix product are aligned in the controller 3.

【００５１】次に、この実施の形態の並列計算機により
行列積を求める動作について、図２に示すフローチャー
トを参照しながら説明する。以下では、Ｎ×Ｎ次元の正
方行列Ａ＝（ａ_ij），Ｂ＝（ｂ_ij），Ｃ＝（ｃ_ij）につ
いて、行列積Ａ＝Ｂ・Ｃを求める場合を例にとって説明
する。Next, the operation of obtaining the matrix product by the parallel computer of this embodiment will be described with reference to the flow chart shown in FIG. In the following, a case where the matrix product A = B · C is obtained for an N × N dimensional square matrix A = (a _ij ), B = (b _ij ), C = (c _ij ) will be described as an example.

【００５２】まず、制御装置３のメモリ３２に行列積計
算のための行列Ｂ，Ｃのそれぞれの要素がストアされる
（ステップＳ１）。次に、制御装置３によって、メモリ
３２にストアされているデータが通信経路２１を介して
例えば以下のようにＰＥ１〜ＰＥＮに分配される（ステ
ップＳ２）。First, the elements of the matrices B and C for matrix product calculation are stored in the memory 32 of the controller 3 (step S1). Next, the control device 3 distributes the data stored in the memory 32 to the PE1 to PEN, for example, as follows via the communication path 21 (step S2).

【００５３】ＰＥ１の第１の分散メモリ１ａ、第２の分
散メモリ１ｂには、行列Ｂの横ベクトル｛ｂ_1k｝、行列
Ｃの縦ベクトル｛ｃ_k1｝がそれぞれストアされる。ＰＥ
２の第１の分散メモリ２ａ、第２の分散メモリ２ｂに
は、横ベクトル｛ｂ_2k｝、縦ベクトル｛ｃ_k2｝がそれぞ
れストアされる。ＰＥ３の第１の分散メモリ３ａ、第２
の分散メモリ３ｂには、横ベクトル｛ｂ_3k｝、縦ベクト
ル｛ｃ_k3｝がそれぞれストアされる。ＰＥＮの第１の分
散メモリＮａ、第２の分散メモリＮｂには、横ベクトル
｛ｂ_Nk｝、縦ベクトル｛ｃ_kN｝がそれぞれストアされ
る。つまり、ＰＥｍ（ｍは１〜Ｎ）の第１の分散メモリ
ｍａ、第２の分散メモリｍｂには、横ベクトル
｛ｂ_mk｝、縦ベクトル｛ｃ_km｝がそれぞれストアされ
る。ここで、ｋ＝１，２，…，Ｎである。The horizontal vector {b _1k } of the matrix B and the vertical vector {c _k1 } of the matrix C are stored in the first distributed memory 1a and the second distributed memory 1b of the PE1, respectively. PE
A horizontal vector {b _2k } and a vertical vector {c _k2 } are stored in the first distributed memory 2a and the second distributed memory 2b, respectively. First distributed memory 3a of PE3, second
The horizontal vector {b _3k } and the vertical vector {c _k3 } are stored in the distributed memory 3b of FIG. The horizontal vector {b _Nk } and the vertical vector {c _kN } are stored in the first distributed memory Na and the second distributed memory Nb of the PEN, respectively. That is, the horizontal vector {b _mk } and the vertical vector {c _km } are stored in the first distributed memory ma and the second distributed memory mb of PEm (m is 1 to N), respectively. Here, k = 1, 2, ..., N.

【００５４】なお、上記とは逆に、ＰＥｍ（ｍは１〜
Ｎ）の第１の分散メモリｍａに行列Ｃの縦ベクトル｛ｃ
_km｝をストアし、第２の分散メモリｍｂに横ベクトル
｛ｂ_mk｝をストアするようにしても良い。Contrary to the above, PEm (m is 1 to
N) of the first distributed memory ma, the vertical vector of the matrix C {c
_{It is also possible} to store _km } and store the lateral vector {b _mk } in the second distributed memory mb.

【００５５】また、各ＰＥへの横ベクトルと縦ベクトル
の分配は、それぞれ排他的であれば、上記のように規則
正しくなくても構わない。どのＰＥにどの横ベクトルや
縦ベクトルを分配しようとも、最終的には、行列積の全
要素を得ることができる。The distribution of the horizontal vector and the vertical vector to each PE need not be regular as described above as long as they are exclusive. No matter which horizontal vector or vertical vector is distributed to which PE, all elements of the matrix product can be finally obtained.

【００５６】上記のようにしてストアが終了した後、各
ＰＥのＰＥ内制御部では、計算回数を記録するための変
数ＣＮＴを０にし初期化を行なう（ステップＳ３）。次
に、各ＰＥのＰＥ内制御部では、変数ＣＮＴとデータ数
Ｎを比較し（ステップＳ４）、ＣＮＴ≧Ｎでないときは
以下のステップＳ５〜Ｓ９を実行する。上記比較を行
い、条件が満たされる間、以下の動作を行なうことで、
ＰＥ１〜ＰＥＮにおいて行列積Ａを構成する要素を順次
求めていくことができる。そして、ステップＳ４でＣＮ
Ｔ≧Ｎが満たされたとき、制御装置３内に行列積Ａの全
要素が揃うことになり、処理を終了する。なお、終了判
断に変数ＣＮＴを用いるのは一例であり、他の方法を用
いても構わない。After the storage is completed as described above, the PE control unit of each PE initializes the variable CNT for recording the number of calculations to 0 (step S3). Next, the in-PE control unit of each PE compares the variable CNT with the number of data N (step S4), and when CNT ≧ N is not satisfied, the following steps S5 to S9 are executed. By comparing the above and performing the following operations while the conditions are satisfied,
The elements forming the matrix product A can be sequentially obtained in PE1 to PEN. Then, in step S4, CN
When T ≧ N is satisfied, all the elements of the matrix product A are arranged in the control device 3, and the process ends. It should be noted that using the variable CNT for the end determination is an example, and other methods may be used.

【００５７】さて、ステップＳ４においてＣＮＴ≧Ｎで
ない場合、ＰＥ１〜ＰＥＮでは、それぞれ配分されたベ
クトルデータ｛ｂ_1k｝〜｛ｂ_Nk｝，｛ｃ_k1｝〜｛ｃ_kN｝
を元にベクトルの内積計算を行なう（ステップＳ５）。
つまり、ＰＥｍ（ｍ＝１〜Ｎ）は、分散メモリｍａ，ｍ
ｂにストアされた二つのベクトル｛ｂ_mk｝と｛ｃ_km｝か
ら（３）式で求められる内積を計算する。If CNT ≧ N is not satisfied in step S4, vector data {b _1k } to {b _Nk }, {c _k1 } to {c _kN } distributed in PE1 to PEN, respectively.
The inner product of the vector is calculated on the basis of (step S5).
That is, PEm (m = 1 to N) is distributed memory ma, m
The inner product obtained by the equation (3) is calculated from the two vectors {b _mk } and {c _km } stored in b.

【００５８】[0058]

【数３】 (Equation 3)

【００５９】例えば、ＰＥ１に着目して動作を説明する
と、まず分散メモリ１ａ，１ｂにストアされているｂ₁₁
とｃ₁₁との積を乗算器１ｃを用いて乗算し加算器１ｄに
送る。次にｂ₁₂とｃ₂₁の積を同様に計算し加算器１ｄに
送る。加算器１ｄは先の結果に今回の結果を累積加算す
る。以下同様にｂ_1Nおよびｃ_N1まで、分散メモリ１ａ，
１ｂにストアされている全てのベクトルについて乗算を
行い加算器１ｄに累積加算していく。分散メモリ１ａ，
１ｂにストアされている全てのデータについての計算が
終わると、加算器１ｄはレジスタ１ｅに計算結果をスト
アする。レジスタ１ｅにストアされた計算結果は行列積
Ａにおける要素ａ₁₁を表している。For example, the operation will be described focusing on PE1. First, b ₁₁ stored in the distributed memories 1a and 1b.
And the product of c ₁₁ are multiplied by the multiplier 1c and sent to the adder 1d. Next, the product of b ₁₂ and c ₂₁ is calculated in the same manner and sent to the adder 1d. The adder 1d cumulatively adds the present result to the previous result. Similarly, the distributed memories 1a, _b1N and _cN1
All vectors stored in 1b are multiplied and cumulatively added to the adder 1d. Distributed memory 1a,
When the calculation is completed for all the data stored in 1b, the adder 1d stores the calculation result in the register 1e. The calculation result stored in the register 1e represents the element a ₁₁ in the matrix product A.

【００６０】上記したＰＥ１と同様にすべてのＰＥにお
いて内積計算が行われ。ＰＥ１〜ＰＥＮにおけるレジス
タ１ｅ〜Ｎｅには要素ａ₁₁〜ａ_NNがストアされる。これ
らの要素は行列積Ａの対角要素を示している。The inner product calculation is performed in all PEs as in PE1 described above. The register 1e~Ne in PE1~PEN elements a ₁₁ ~a _NN is stored. These elements indicate the diagonal elements of the matrix product A.

【００６１】全てのＰＥにおいて乗算および加算に要す
る時間は同一なので、全てのＰＥが同時に並列動作する
ことによって同時に行列積Ａの対角要素が求まることに
なる。その場合、Ｎクロックで行列積Ａの対角要素を求
めることができる。Since the time required for multiplication and addition is the same in all PEs, the diagonal elements of the matrix product A are obtained at the same time when all PEs operate in parallel at the same time. In that case, the diagonal elements of the matrix product A can be obtained in N clocks.

【００６２】上記のようにしてレジスタ１ｅ〜Ｎｅにス
トアされた要素ａ₁₁〜ａ_NNは、通信経路２１を介して制
御装置３に返送される（ステップＳ６）。制御装置３
は、返送されてきた行列積Ａの対角要素ａ₁₁〜ａ_NNを、
メモリ３２における行列積Ａの要素ａ₁₁〜ａ_NNを示す適
切なアドレスにストアする。（ステップＳ７）。The elements a _{11 to} a _NN stored in the registers 1e to Ne as described above are returned to the control device 3 via the communication path 21 (step S6). Control device 3
Is the diagonal elements a _{11 to} a _NN of the returned matrix product A,
The elements a _{11 to} a _NN of the matrix product A in the memory 32 are stored at appropriate addresses. (Step S7).

【００６３】ここで、この実施の形態では、上記のステ
ップＳ６とステップＳ７の実行は、ステップＳ４，Ｓ５
およびその後のラウンドにおけるステップＳ８，Ｓ９の
実行とは独立して行ない得るものである。Here, in this embodiment, the execution of the above steps S6 and S7 is performed by the steps S4 and S5.
And the steps S8 and S9 in the subsequent rounds can be performed independently.

【００６４】さて、上記のようにしてある一組のベクト
ルの組み合わせについて内積計算が終わると、ＰＥ１〜
ＰＥＮは、二つの分散メモリ１ａ〜Ｎａ，１ｂ〜Ｎｂに
記録されているデータのうちいずれか一方を通信経路２
２を介して、論理的に隣接したＰＥに対して転送するよ
うにしている。ここでは、分散メモリ１ａ〜Ｎａ，１ｂ
〜Ｎｂにストアされている二つのデータのうち分散メモ
リ１ｂ〜Ｎｂのデータについて転送を行なうとする。こ
の場合、例えば各ＰＥは以下のように動作する（ステッ
プＳ８）。Now, when the inner product calculation is completed for one set of vector combinations as described above, PE1 to PE1
The PEN uses one of the data recorded in the two distributed memories 1a to Na and 1b to Nb for the communication path 2
The data is transferred to the PEs that are logically adjacent to each other via the P. Here, the distributed memories 1a to Na, 1b.
It is assumed that the data in the distributed memories 1b to Nb among the two data stored in .about.Nb are transferred. In this case, for example, each PE operates as follows (step S8).

【００６５】ＰＥ１は、分散メモリ１ａの横ベクトル
｛ｂ_1k｝をそのまま保持し、分散メモリ１ｂの縦ベクト
ル｛ｃ_k1｝をＰＥＮに転送する。ＰＥ２は、分散メモリ
２ａの横ベクトル｛ｂ_2k｝をそのまま保持し、分散メモ
リ２ｂの縦ベクトル｛ｃ_k2｝をＰＥ１に転送する。ＰＥ
ｍは、分散メモリｍａの横ベクトル｛ｂ_mk｝をそのまま
保持し、分散メモリｍｂの縦ベクトル｛ｃ_km｝をＰＥｍ
−１に転送する。ＰＥＮは、分散メモリＮａの横ベクト
ル｛ｂ_Nk｝をそのまま保持し、分散メモリＮｂの縦ベク
トル｛ｃ_kN｝をＰＥＮ−１に転送する。The PE ₁ holds the horizontal vector {b _1k } of the distributed memory 1a as it is and transfers the vertical vector {c _k1 } of the distributed memory 1b to the PEN. The PE2 holds the horizontal vector {b _2k } of the distributed memory 2a as it is and transfers the vertical vector {c _k2 } of the distributed memory 2b to the PE1. PE
m holds the horizontal vector {b _mk } of the distributed memory ma as it is and the vertical vector {c _km } of the distributed memory mb PEm.
Transfer to -1. The PEN holds the horizontal vector {b _Nk } of the distributed memory Na as it is and transfers the vertical vector {c _kN } of the distributed memory Nb to PEN-1.

【００６６】つまり、ＰＥｍ（ｍは１〜Ｎ）は、分散メ
モリｍａの横ベクトル｛ｂ_mk｝をそのまま保持し、分散
メモリｍｂの縦ベクトル｛ｃ_km｝をＰＥｍ−１に転送す
る。なお、通信経路２２が順次データを転送するような
方式であれば、２クロックで転送を行なうことができ
る。That is, PEm (m is 1 to N) holds the horizontal vector {b _mk } of the distributed memory ma as it is and transfers the vertical vector {c _km } of the distributed memory mb to PEm-1. If the communication path 22 transfers data sequentially, the transfer can be performed in two clocks.

【００６７】上記した例ではＰＥｍのデータはＰＥｍ−
１に対して転送することとしたが、逆方向、すなわちＰ
Ｅｍ＋１のデータをＰＥｍに転送するようにしても良
い。この場合、ＰＥＮのデータはＰＥ１に転送される。In the above example, the PEm data is PEm-
It was decided to transfer to 1, but the opposite direction, that is, P
The data of Em + 1 may be transferred to PEm. In this case, the PEN data is transferred to PE1.

【００６８】次に、データ転送をした各ＰＥのＰＥ内制
御部は、制御装置３から転送されたデータについて、あ
る一組のベクトルの組み合わせについて内積計算が終わ
ったものとして、それぞれ変数ＣＮＴを１だけ増やす
（ステップＳ９）。Next, the in-PE control unit of each PE that has transferred the data assumes that the inner product calculation has been completed for a certain set of vector combinations for the data transferred from the control device 3, and sets the variable CNT to 1 respectively. Only (step S9).

【００６９】さて、前述したように、各ＰＥは上記ステ
ップＳ５〜Ｓ９の工程をステップＳ４でＣＮＴ≧Ｎが満
たされるまで順次繰り返す。これによって、制御装置３
から転送されたデータにおける全ての横ベクトルと縦ベ
クトルの組み合わせについての内積計算が実行される。As described above, each PE repeats the above steps S5 to S9 sequentially until CNT ≧ N is satisfied in step S4. As a result, the control device 3
The inner product calculation is executed for all the combinations of the horizontal vector and the vertical vector in the data transferred from.

【００７０】すなわち、第１回目のラウンドでは、行列
積Ａにおける要素ａ_mm（ｍ＝１〜Ｎ）が得られ、隣接Ｐ
Ｅへのデータ転送が行われた後、第２回目のラウンドで
は、ＰＥｍ（ｍ＝１〜Ｎ）において、（４）式で表され
るような計算が行われる。ただし、ｍ＋１＝（ｍ＋１）
ｍｏｄＮとする。That is, in the first round, the element a _mm (m = 1 to N) in the matrix product A is obtained, and the adjacent P
After the data transfer to E, in the second round PEm (m = 1 to N), the calculation represented by the equation (4) is performed. However, m + 1 = (m + 1)
mod N.

【００７１】[0071]

【数４】 (Equation 4)

【００７２】この計算結果は行列積Ａにおける要素ａ
_mm+1（つまり、ａ₁₂，ａ₂₃，…）となる。以下同様にし
て、各ラウンドにおいて、行列積Ａにおける要素
ａ_mm+2，ａ_mm+3，…，ａ_mm+N-1がそれぞれ得られる。The result of this calculation is the element a in the matrix product A.
_{mm + 1} (that is, a ₁₂ , a ₂₃ , ...). Similarly, in each round, the elements a _{mm + 2} , a _{mm + 3} , ..., A _{mm + N-1} in the matrix product A are obtained.

【００７３】計算結果は、順次、制御装置３に転送され
るので、ステップＳ４においてＣＮＴ≧Ｎが満たされる
場合は、制御装置３のメモリ３２には行列Ａにおける全
ての要素がストアされたことになる。Since the calculation results are sequentially transferred to the control device 3, if CNT ≧ N is satisfied in step S4, it means that all the elements in the matrix A have been stored in the memory 32 of the control device 3. Become.

【００７４】以上が、この実施の形態の並列計算機によ
り行列積を求める動作の一例である。次に、この実施の
形態において、Ｎ×Ｎ行列の行列積計算に必要な時間を
考えてみる。The above is an example of the operation for obtaining the matrix product by the parallel computer of this embodiment. Next, consider the time required for the matrix product calculation of the N × N matrix in this embodiment.

【００７５】一般にＮ×Ｎ行列の行列積計算において
は、積和計算そのものは全部でＮ³ 回の乗算と（Ｎ−
１）Ｎ² 回の加算を必要とするが、この実施の形態によ
れば、Ｎ² 回の乗算と（Ｎ−１）Ｎ回の加算を同時に実
行することができる。すなわち、１００％の並列化度を
得ることができる。また、データの局所性も完全に保っ
ており、データを最も近接したＰＥに転送するときはす
べてのＰＥがデータ転送を行っているので、データ転送
という観点でも並列化度は１００％である。Generally, in the matrix product calculation of N × N matrix, the sum of products calculation itself is N ³ times of multiplication and (N−
1) N requires ^two additions but, according to this embodiment, it is possible to perform N ² times and multiplying the (N-1) N additions simultaneously. That is, a parallelization degree of 100% can be obtained. Further, the locality of data is completely maintained, and all PEs are transferring data when transferring data to the closest PEs, so the parallelization degree is 100% from the viewpoint of data transfer.

【００７６】行列積の基本演算である乗算と加算は必ず
一定のクロック数で計算が終了する。乗算と加算はそれ
ぞれ２クロックで処理できるものとする。加算そのもの
は１クロックで実行でき、並列計算機に乗算器を設けた
場合、乗算そのものも１クロックで実行できるが、デー
タフェッチを行なうために最低で２クロックを必要とす
るからである。上述したようにこの実施の形態では、Ｎ
² 回の乗算および（Ｎ−１）Ｎ回の加算を行なうので、
乗算および加算に必要なクロック数はＮ² ×２＋（Ｎ−
１）Ｎ×２クロックとなる。The multiplications and additions, which are the basic operations of the matrix product, are always completed with a fixed number of clocks. It is assumed that multiplication and addition can be processed in 2 clocks each. This is because the addition itself can be executed in one clock, and when the multiplier is provided in the parallel computer, the multiplication itself can be executed in one clock, but at least two clocks are required to perform the data fetch. As described above, in this embodiment, N
^{Since two} multiplications and (N-1) N additions are performed,
The number of clocks required for multiplication and addition is N ² × 2 + (N-
1) N × 2 clocks.

【００７７】各ＰＥにおけるデータ転送に必要な時間
は、各ＰＥそれぞれが書き込みおよび読み込みを行なう
ことから、２クロックが必要であるとする。ここで、各
ＰＥにおける乗算と加算に必要な時間は確定しているの
で、各ＰＥでの内積計算の終了を同期合わせしなくて
も、制御装置３の制御部３１から演算開始の命令が各Ｐ
Ｅに与えられ、ある時に各ＰＥで一斉に内積演算が開始
しているのであれば、それぞれのＰＥは内積計算が終わ
った時点でデータの転送を始めて良い。It is assumed that the time required for data transfer in each PE is 2 clocks because each PE writes and reads. Here, since the time required for multiplication and addition in each PE is fixed, even if the end of the inner product calculation in each PE is not synchronized, the control unit 31 of the control device 3 issues an instruction to start the calculation. P
If the PEs are given to E at the same time and the inner product operations are started simultaneously at each PE, each PE may start data transfer at the time when the inner product calculation is completed.

【００７８】ところで、各ＰＥで計算された内積を制御
装置３に返すには、通信経路２１を経由する。その際、
これらのデータはすべて独立であるので、同時にデータ
伝送することは原理的には可能であるが、仮にそれぞれ
のデータが３２ビットやあるいは６４ビットであり、Ｐ
Ｅ数が１０⁴ のオーダーであるとすれば、これを同時に
行ないうるようにインプリメントすることは実際には難
かしい。そこでこの実施の形態では通常のバスと同じ構
成を採用しているが、各ＰＥにおいて処理すべきことは
一定であるので、通信経路２１の使用権の調停手続きは
必要ない。すなわち、あるＰＥから順番に、例えばＤＭ
Ａ（ＤｉｒｅｃｔＭｅｍｏｒｔＡｃｃｅｓｓ）転送
と同じ様にデータ転送を行えば良い。この場合、１つの
ＰＥから通信経路２１を介して制御装置３にデータを取
り込むには、アドレスとデータとを合わせて２クロック
を要する。結局、データはＮ個あるので２Ｎクロックが
必要となる。しかし、このデータの転送は各ＰＥの乗算
器や加算器の動作とは無関係であるので、各ＰＥにレジ
スタ１ｅ〜Ｎｅを設けることによって、各ＰＥが内積を
処理している間にデータ転送することが可能である。従
って、この時間は実際には内積演算の処理の後ろに隠れ
てしまい、処理時間には影響を与えない。By the way, in order to return the inner product calculated by each PE to the control device 3, it goes through the communication path 21. that time,
Since all these data are independent, it is possible in principle to transmit data at the same time, but if each data is 32 bits or 64 bits, P
Given that the E-number is on the order of 10 ⁴ , it is actually difficult to implement so that this can be done simultaneously. Therefore, in this embodiment, the same configuration as that of a normal bus is adopted, but since the processing to be performed by each PE is constant, the arbitration procedure for the right to use the communication path 21 is not necessary. That is, in order from a certain PE, for example, DM
Data transfer may be performed in the same way as A (Direct Memory Access) transfer. In this case, it takes two clocks in total for the address and the data to fetch the data from one PE into the control device 3 via the communication path 21. After all, since there are N pieces of data, 2N clocks are required. However, since this data transfer has nothing to do with the operation of the multiplier or adder of each PE, the registers 1e to Ne are provided in each PE to transfer the data while each PE processes the inner product. It is possible. Therefore, this time is actually hidden behind the processing of the inner product operation and does not affect the processing time.

【００７９】一方、データを隣接するＰＥに送るには、
２クロックあれば可能である。隣接するＰＥへのデータ
転送は、Ｎ回行なわれるので、全部で２Ｎクロックが必
要となる。なお、この実施の形態では、通信経路２２も
通常のバスと同じ構成を採用しているが、通信経路２１
と同様、その使用権の調停手続きは必要ない。すなわ
ち、あるＰＥから順番にデータ転送を行えば良い。On the other hand, to send data to the adjacent PE,
It is possible with two clocks. Since data transfer to the adjacent PE is performed N times, 2N clocks are required in total. In this embodiment, the communication path 22 has the same configuration as that of a normal bus, but the communication path 21
As with, no arbitration procedure for the right of use is required. That is, data may be transferred sequentially from a certain PE.

【００８０】以上の結果、この実施の形態においてＮ×
Ｎ行列の行列積に必要なクロック数は、（乗算および加
算）＋（データ転送）＝Ｎ² ×２＋（Ｎ−１）Ｎ×２＋
２×Ｎ＝４Ｎ² クロックとなる。As a result of the above, N × in this embodiment.
The number of clocks required for the matrix multiplication of N matrices is (multiplication and addition) + (data transfer) = N ² × 2 + (N−1) N × 2 +
2 × N = 4N ² clocks.

【００８１】最近の標準的なマイクロプロセッサのクロ
ック周波数は１００ＭＨｚ程度のものが多いので、この
実施の形態においてクロック周波数が１００ＭＨすなわ
ちクロック時間が１０ナノ秒であるとすると、Ｎ＝１０
⁵ の行列積を求めるには４００秒で計算できる。これを
単位時間当たりの浮動小数点演算の回数に焼き直すと５
ＴＦＬＯＰＳ（５×１０¹²ＦＬＯＰＳ）となる。すなわ
ち、この実施の形態によれば、現存する最速のスーパー
コンピュータが１０ＧＦＬＯＰＳ程度の実行性能を示し
ているのに対して、その１００倍以上の処理速度を実現
することができる。The clock frequency of recent standard microprocessors is often about 100 MHz. Therefore, assuming that the clock frequency is 100 MH, that is, the clock time is 10 nanoseconds in this embodiment, N = 10.
It takes 400 seconds to obtain the matrix product of ⁵ . If this is recalculated to the number of floating point operations per unit time, 5
It becomes TFLOPS (5 × 10 ¹² FLOPS). That is, according to this embodiment, while the fastest existing supercomputer exhibits an execution performance of about 10 GFLOPS, it is possible to realize a processing speed 100 times or more that higher.

【００８２】また、この実施の形態によれば、各々のＰ
Ｅには、記憶素子、乗算器、加算器のように基本的な素
子だけを設ければ良いので、従来のベクトル計算機のよ
うに非常に高価な半導体素子や極めて高度な実装技術は
必要ではない。Further, according to this embodiment, each P
Since only basic elements such as a memory element, a multiplier, and an adder need to be provided in E, extremely expensive semiconductor elements and extremely high-level mounting technology unlike conventional vector computers are not required. .

【００８３】ベクトル・データを格納する分散メモリ
は、行列計算における行列のサイズ、例えばＮ×Ｎ行列
の行列積計算においては、長さがＮのベクトルを保持す
ればよいので、通常の超並列計算機のように大容量のも
のは必要ない。The distributed memory for storing the vector data has the size of the matrix in the matrix calculation, for example, in the matrix product calculation of the N × N matrix, it is sufficient to hold the vector of length N. There is no need for a large capacity device like.

【００８４】ベクトル・データを転送するための通信経
路はデータを論理的に隣接するＰＥに送るための機能を
有すれば良いので、従来の超並列計算機において用いら
れていたような任意のＰＥに対してデータを送るための
複雑なネットワークを用いる必要がない。The communication path for transferring the vector data may have a function for sending the data to the logically adjacent PE, so that it can be connected to any PE as used in the conventional massively parallel computer. There is no need to use complicated networks to send data to and from it.

【００８５】演算手順が確定しているので、高速記憶素
子によるキャッシュを具備しなくても演算の高速性を保
つことができる。（第２の実施形態）次に、図３を参照して本発明の第２
の実施の形態を説明をする。なお、この実施の形態にお
いては、図１と相対応する部分に同一符号を付して、第
１の実施形態との相違点を中心に述べる。Since the operation procedure is fixed, the high speed operation can be maintained even if the cache of the high speed storage element is not provided. (Second Embodiment) Next, referring to FIG. 3, a second embodiment of the present invention will be described.
The embodiment will be described. In this embodiment, parts corresponding to those in FIG. 1 are designated by the same reference numerals, and differences from the first embodiment will be mainly described.

【００８６】図３に、この実施の形態に係る並列計算機
のシステム構成を示す。図３に示すように、この並列計
算機は、第１の実施の形態の並列計算機における２つの
分散メモリのうち一方（隣接ＰＥへのデータ転送に供す
るメモリ１ｂ〜Ｎｂ）をデュアルポートメモリ（図３中
の１ｂ２〜Ｎｂ２）で構成し、隣接ＰＥを直結するよう
にしたものである。FIG. 3 shows the system configuration of the parallel computer according to this embodiment. As shown in FIG. 3, in this parallel computer, one of the two distributed memories in the parallel computer of the first embodiment (memory 1b to Nb used for data transfer to an adjacent PE) is a dual port memory (FIG. 3). 1b2 to Nb2), and the adjacent PEs are directly connected.

【００８７】また、この実施の形態では、各ＰＥ１〜Ｎ
において、第１の分散メモリ１ａ〜Ｎａと乗算器１ｃ〜
Ｎｃとの間には第２のレジスタ１ｆ〜Ｎｆを接続し、第
２の分散メモリ１ｂ２〜Ｎｂ２の一方のポートと乗算器
１ｃ〜Ｎｃとの間には第３のレジスタ１ｇ〜Ｎｇおよび
セレクタ１ｈ〜Ｎｈを接続している。さらに、第２の分
散メモリ１ｂ２〜Ｎｂ２の他方のポートには、第４のレ
ジスタ１ｉ〜Ｎｉを接続している。Further, in this embodiment, each of PE1 to N is
, The first distributed memories 1a to Na and the multiplier 1c to
Second registers 1f to Nf are connected to Nc, and third registers 1g to Ng and a selector 1h are provided between one port of the second distributed memories 1b2 to Nb2 and the multipliers 1c to Nc. ~ Nh are connected. Further, the fourth registers 1i to Ni are connected to the other ports of the second distributed memories 1b2 to Nb2.

【００８８】内積演算の際に、第２のレジスタ１ｆ〜Ｎ
ｆおよび第３のレジスタ１ｇ〜Ｎｇは、第１の分散メモ
リ１ａ〜Ｎａおよび第２の分散メモリ１ｂ２〜Ｎｂ２の
データをそれぞれストアするもので、これらを介して乗
算器１ｃ〜Ｎｃにデータを転送している。また、第３の
レジスタ１ｇ〜Ｎｇおよび第４のレジスタ１ｉ〜Ｎｉ
は、隣接ＰＥ間でのデータ転送の際に、第２の分散メモ
リ１ｂ２〜Ｎｂ２から読み出したデータや第２の分散メ
モリ１ｂ２〜Ｎｂ２へ書き込むデータを一旦ストアす
る。At the time of inner product calculation, the second registers 1f to N
f and the third registers 1g to Ng store the data of the first distributed memories 1a to Na and the second distributed memories 1b2 to Nb2, respectively, and transfer the data to the multipliers 1c to Nc via them. are doing. In addition, the third registers 1g to Ng and the fourth registers 1i to Ni
Temporarily stores the data read from the second distributed memories 1b2 to Nb2 and the data to be written to the second distributed memories 1b2 to Nb2 at the time of data transfer between the adjacent PEs.

【００８９】セレクタ１ｈ〜Ｎｈは、各ＰＥ内でのデー
タの流れを制御するために、図示しないＰＥ内制御部に
より切替え制御される。つまり、各ＰＥｍのセレクタｍ
ｈは、図２のステップＳ５の内積計算の際には、乗算器
ｍｃとレジスタｍｇを接続し、図２のステップＳ８の隣
接ＰＥへのデータ転送の際には、レジスタｍｇと、該Ｐ
Ｅｍに論理的に隣接するＰＥｍ−１のレジスタｍ−１ｉ
とを接続するように切替えられる。The selectors 1h to Nh are switching-controlled by a PE control unit (not shown) in order to control the flow of data in each PE. That is, the selector m of each PEm
When the inner product is calculated in step S5 in FIG. 2, h connects the multiplier mc and the register mg, and in the data transfer to the adjacent PE in step S8 in FIG.
Register m-1i of PEm-1 logically adjacent to Em
Switch to connect and.

【００９０】ここで、ＰＥ１について隣接ＰＥへのデー
タ転送の一例を説明すると、まず、第２の分散メモリ１
ｂ２からデータが読み出され、一旦、転送レジスタ１ｉ
に格納される。そして、通信経路２２によりＰＥ２に転
送される。また、ＰＥＮの転送レジスタＮｉに記録され
ていたデータが通信経路２２によりＰＥ１に転送されて
くる。このデータは、セレクタ１ｈから第２のレジスタ
１ｇに送られる。そして、第２の分散メモリ１ｂ２に書
き込まれる。ここで、ＰＥ２へのデータの転送とＰＥＮ
からのデータの転送は同時に行われる。以上のようなデ
ータ転送が他の全てのＰＥにおいて同時に行われること
になる。Here, an example of the data transfer to PE1 to the adjacent PE will be described. First, the second distributed memory 1
Data is read from b2 and once transferred to transfer register 1i
Stored in. Then, it is transferred to the PE 2 via the communication path 22. Further, the data recorded in the transfer register Ni of PEN is transferred to PE1 via the communication path 22. This data is sent from the selector 1h to the second register 1g. Then, the data is written in the second distributed memory 1b2. Here, transfer of data to PE2 and PEN
The transfer of data from is done at the same time. The data transfer as described above is simultaneously performed in all other PEs.

【００９１】なお、データ転送の方向は、上記と逆の方
向に行なっても良い。この場合、隣接ＰＥへのデータ転
送におけるレジスタｍｇとレジスタｍｉの働きが逆にな
る。一方、ベクトル・データの内積計算の際には、第１
の分散メモリ１ａ〜Ｎａおよび第２の分散メモリ１ｂ２
〜Ｎｂ２から、それぞれ第１のレジスタ１ｆ〜Ｎｆなら
びに第２のレジスタ１ｇ〜Ｎｇおよびセレクタ１ｈ〜Ｎ
ｈを介して、乗算器１ｃ〜Ｎｃにデータが送られる。The data transfer may be performed in the opposite direction. In this case, the functions of the register mg and the register mi in the data transfer to the adjacent PE are reversed. On the other hand, when calculating the inner product of vector data,
Distributed memories 1a to Na and second distributed memory 1b2
-Nb2 from first register 1f to Nf, second register 1g to Ng and selector 1h to N, respectively.
Data is sent to the multipliers 1c to Nc via h.

【００９２】以上のような第２の実施の形態の並列計算
機により、図２と同様の処理が行なわれ、制御装置３の
メモリ３２に行列積の演算結果が格納される。このよう
な構成にすることにより、各ＰＥのデータ転送におい
て、第１の実施形態のようにシングルポートメモリを使
用する際のデータ入れ替えについて必要なハードウエ
ア、例えばレジスタなどを省くことができる。The parallel computer of the second embodiment as described above performs the same processing as in FIG. 2 and stores the calculation result of the matrix product in the memory 32 of the control device 3. With such a configuration, in the data transfer of each PE, it is possible to omit the hardware necessary for data exchange when using the single-port memory as in the first embodiment, such as a register.

【００９３】また、第２の分散メモリ１ｂ２〜Ｎｂ２と
してデュアルポートメモリを用いたことで、論理的に隣
接した各ＰＥがそれぞれ鎖状に直結されるような構成に
することが可能となる。この結果、第２の分散メモリ１
ｂ２〜Ｎｂ２のデータを読出し、これを論理的に隣接す
る隣のＰＥに転送するとともに、もう一方の隣接するＰ
Ｅから転送されたデータを自分の第２の分散メモリ１ｂ
２〜Ｎｂ２に書き込む一連の動作を、全てのＰＥについ
て同時に行なうことができる。Further, by using the dual port memory as the second distributed memories 1b2 to Nb2, it becomes possible to directly connect the PEs logically adjacent to each other in a chain. As a result, the second distributed memory 1
The data of b2 to Nb2 is read and transferred to the adjacent PE that is logically adjacent to the other, and the other adjacent P
The data transferred from E is stored in its second distributed memory 1b.
A series of operations for writing 2 to Nb2 can be simultaneously performed for all PEs.

【００９４】さらに、第２の分散メモリ１ｂ２〜Ｎｂ２
のデータをあらかじめレジスタ１ｉ〜Ｎｉに格納してお
いて、レジスタ１ｉ〜Ｎｉ内に格納されているデータが
論理的に隣接したＰＥに対して転送され、一旦、レジス
タ１ｇ〜Ｎｇに格納される。そのため、各ＰＥの第２の
分散メモリについて、データ読み出しとデータ書き込み
とをほぼ同時に行なうことができる。もちろん、第２の
分散メモリの同一メモリセルのアクセスを防ぐために、
データを読み出した直後に書込みを行なうようにするこ
とは言うまでもない。Furthermore, the second distributed memories 1b2 to Nb2
Data is stored in the registers 1i to Ni in advance, and the data stored in the registers 1i to Ni is transferred to the logically adjacent PEs and temporarily stored in the registers 1g to Ng. Therefore, the data reading and the data writing can be performed almost simultaneously in the second distributed memory of each PE. Of course, in order to prevent access to the same memory cell of the second distributed memory,
It goes without saying that writing is performed immediately after reading the data.

【００９５】このようにこの実施の形態によれば、隣接
ＰＥ間でのデータ転送をより高速に行なうことができ、
行列積演算の処理速度のさらなる向上を図ることができ
る。ところで、上記した各々の実施の形態では、Ｎ×Ｎ
次元の正方行列同士の行列積を求める場合の例を説明し
たが、もちろん、本発明は、Ｎ×Ｍ行列とＭ×Ｎ行列の
行列積のように、掛ける行列の横ベクトル数と掛けられ
る行列の縦ベクトル数が等しく、さらに掛ける行列の縦
ベクトル数と掛けられる行列の横ベクトル数が等しくな
るという条件を満たせば、正方行列でない行列同士の行
列積計算を求める場合にも適用することができる。As described above, according to this embodiment, data transfer between adjacent PEs can be performed at a higher speed,
It is possible to further improve the processing speed of the matrix product calculation. By the way, in each of the above-described embodiments, N × N
An example of obtaining the matrix product of two-dimensional square matrices has been described, but of course, the present invention, like the matrix product of N × M matrix and M × N matrix, is a matrix that can be multiplied by the number of lateral vectors of the matrix to be multiplied. Can be applied to the matrix product calculation of non-square matrices as long as the condition that the number of vertical vectors of is equal and the number of vertical vectors of the matrix to be multiplied and the number of horizontal vectors of the matrix to be multiplied are equal .

【００９６】また、上記した各々の実施の形態において
は、各ＰＥにおけるデータ転送は各ＰＥの乗算器や加算
器を含む演算部を経由せずに通信経路２２を使って行わ
れているが、通信経路２２を設けずに各ＰＥの演算部を
経由して通常のバス２１経由でデータ転送を行なうこと
も可能である。ただし、この場合、データを一回入れ替
えるのに２クロックが必要であり、このデータは内積を
求めるための元のデータであるので、この時間は演算の
裏に隠すことはできないので、行列積にかかる全計算時
間は６Ｎ² −２Ｎクロックになる。つまり、通信に要す
る時間が全計算時間の１／３強を占めることになる。Further, in each of the above-mentioned embodiments, the data transfer in each PE is performed using the communication path 22 without passing through the arithmetic unit including the multiplier and the adder of each PE. It is also possible to perform data transfer via the ordinary bus 21 via the arithmetic unit of each PE without providing the communication path 22. However, in this case, two clocks are required to replace the data once, and since this data is the original data for obtaining the inner product, this time cannot be hidden behind the operation, so the matrix product is used. The total calculation time is 6N ² -2N clocks. That is, the time required for communication occupies a little over 1/3 of the total calculation time.

【００９７】また、上記した各々の実施の形態において
は、ＰＥは専用のハードとして構成されている。しかし
ＰＥは乗算と加算が実行でき二つのベクトルを保持する
ための論理的に独立なメモリを持っていればよい。した
がって例えば市販のパソコンやＥＷＳ（エンジニアリン
グ・ワークステーション）を個々のＰＥと見なして行列
積専用のハードウエアを構成することも可能である。本
発明は、上述した実施の形態に限定されるものではな
く、その技術的範囲において種々変形して実施すること
ができる。Further, in each of the above-mentioned embodiments, the PE is constructed as a dedicated hardware. However, it is sufficient for the PE to have a logically independent memory capable of performing multiplication and addition and holding two vectors. Therefore, for example, it is possible to regard a commercially available personal computer or EWS (engineering workstation) as each PE and configure hardware dedicated to matrix multiplication. The present invention is not limited to the above-described embodiment, and can be implemented with various modifications within the technical scope.

【００９８】[0098]

【発明の効果】本発明によれば、科学技術計算において
頻繁に現れる行列積計算を行なうために、各プロセッサ
エレメントに比較的構成が単純な加算器と乗算器を設け
るとともに、各プロセッサエレメントでは行列積計算を
行なう元となる行列の縦ベクトルまたは横ベクトルのい
ずれかを隣接プロセッサエレメント間での転送により持
ち回りしながら、分散的、順次的に、行列積の要素を計
算していくようにしたので、ベクトル型計算機のように
高価な半導体素子を使わず、また超並列計算機のように
複雑で高度の実装技術を必要とするネットワークを使用
せずに、従来よりも遥かに高速に行列積計算を行なうこ
とができる。According to the present invention, in order to perform a matrix product calculation that frequently appears in scientific and technological calculations, each processor element is provided with an adder and a multiplier having a relatively simple structure, and each processor element has a matrix. Since either the vertical vector or the horizontal vector of the matrix that is the source of the product calculation is carried around by transferring between adjacent processor elements, the elements of the matrix product are calculated in a distributed and sequential manner. , Without using expensive semiconductor devices such as vector computers, and without using networks that require complicated and sophisticated mounting technology such as massively parallel computers, matrix product calculation is much faster than before. Can be done.

[Brief description of drawings]

【図１】本発明の第１の実施の形態に係る並列計算機の
システム構成を示す図FIG. 1 is a diagram showing a system configuration of a parallel computer according to a first embodiment of the present invention.

【図２】本発明の第１の実施の形態に係る並列計算機の
動作フローチャートを示す図FIG. 2 is a diagram showing an operation flowchart of the parallel computer according to the first embodiment of the present invention.

【図３】本発明の第２の実施の形態に係る並列計算機の
システム構成を示す図FIG. 3 is a diagram showing a system configuration of a parallel computer according to a second embodiment of the present invention.

【図４】従来のベクトル計算機における行列積計算のア
ルゴリズムの一例を示す図FIG. 4 is a diagram showing an example of a matrix product calculation algorithm in a conventional vector computer.

[Explanation of symbols]

１〜Ｎ…ＰＥ（プロセッサエレメント）１ａ〜Ｎａ，１ｂ〜Ｎｂ…分散メモリ１ｂ２〜Ｎｂ２…デュアルポートメモリ１ｃ〜Ｎｃ…乗算器１ｄ〜Ｎｄ…加算器１Ｌ〜ＮＬ…ループ回路１ｅ〜Ｎｅ…レジスタ，１ｆ〜Ｎｆ，１ｇ〜Ｎｇ，１ｉ
〜Ｎｉ…レジスタ１ｈ〜Ｎｈ…セレクタ２１，２２，２３₁〜２３_N…通信経路（バス）３…制御装置３１…制御部３２…メモリ1-N ... PE (processor element) 1a-Na, 1b-Nb ... Distributed memory 1b2-Nb2 ... Dual port memory 1c-Nc ... Multiplier 1d-Nd ... Adder 1L-NL ... Loop circuit 1e-Ne ... Register, 1f to Nf, 1g to Ng, 1i
～Ni ... register 1H～Nh ... selector 21, 22, 23 ₁ ~ 23 _N ... communication path (bus) 3 ... controller 31 ... controller 32 ... memory

Claims

[Claims]

1. A processor is logically adjacent to a plurality of processor elements, a control device for distributing data to each processor element and collecting operation results, and a first communication path connecting the control device and each processor element. A parallel computer having a second communication path provided separately from the first communication path connecting a processor element, wherein the control device uses the first communication path to perform matrix product calculation A plurality of first vectors obtained by grouping the elements of the first matrix, which are arranged in one of the vertical direction and the horizontal direction, to each of the processor elements exclusively, A plurality of second vectors formed by grouping the elements of the two matrices in the vertical direction and the horizontal direction in a direction different from the first vector; Each of the processor elements has a first storage means for storing the first vector, and a second storage means for storing the second vector. Storage means, and a multiplier for sequentially multiplying each element of the first vector stored in the first storage means and each element of the second vector stored in the second storage means, An adder for accumulatively adding multiplication results by the multiplier, storing the transferred first vector in the first storage means, and storing the transferred second vector in the second storage means And the cumulative addition result of the adder is the first
Of the matrix and the second matrix are transferred to the control device through the first communication path as one element of the matrix product, and the second vector is logically adjacent to the other one side through the second communication path. And a control means for controlling the transfer to the processor element.

2. The second storage means is configured by using a dual port memory having a first port and a second port, and the second communication path is one for each processor element. A plurality of dedicated units provided to connect a first port of a dual port memory of a processor element and a second port of a dual port memory of one of the processor elements logically adjacent to the one processor element. A communication path, and the control means includes the first port of the dual port memory.
The second vector from one of the second port and the second port, transfers the second vector to one of the processor elements that is logically adjacent to the dedicated communication path, and logically adjacent to the other dedicated communication path. 2. The control for writing the second vector transferred from the other processor element of the dual port memory from the other one of the first port and the second port of the dual port memory. Parallel computer.

3. The processor elements each include a first register for storing a cumulative addition result of the adder, and a first register provided between one input end of the multiplier and the first storage means. A second register, a third register connected to one port of the dual port memory forming the second storage means, and a third port connected to the other port of the dual port memory forming the second storage means. Connected to a fourth register connected thereto, the other input end of the multiplier, the third register and the fourth register of the adjacent processor element via the second communication path, Selectively connecting between the other input terminal of the multiplier and the third register and between the third register and the fourth register of the adjacent processor element; A selector, and the control means stores the first vector data and the second vector data stored in the first storage means and the second storage means, respectively, in the second register. And the cumulative addition result of the adder is given to the multiplier via the third register and the selector, respectively, and transferred to the control device via the first communication path via the first register, The second vector stored in the second storage means is stored in the fourth register;
And via one of the third register and the selector to the one processor element that is logically adjacent to the dedicated communication path, and the other of the processor elements that is logically adjacent to the other dedicated communication path. A control for storing the second vector transferred from the processor element in the second storage means via the fourth register and the other of the third register and the selector is performed. The parallel computer according to Item 2.

4. The second communication path is configured by using a bus that logically annularly connects the plurality of processor elements in at least one direction, and the control means is the second storage. Means for reading the second vector, transferring the second vector to the one processor element logically adjacent to the second communication path according to a predetermined order for the plurality of processor elements, and logically transferring the second vector to the processor element. 2. The parallel computer according to claim 1, wherein control is performed to write the second vector transferred from the other processor element that is physically adjacent to the second storage means.

5. Each of the processor elements further comprises a register for storing a cumulative addition result of the adder, and the control means controls the cumulative addition result of the adder via the register to the first communication path. 5. The parallel computer according to claim 4, wherein the control is performed to transfer to the control device.