JP2610500B2

JP2610500B2 - Parallel vector calculator

Info

Publication number: JP2610500B2
Application number: JP27876588A
Authority: JP
Inventors: 達夫野木
Original assignee: 達夫野木
Priority date: 1988-11-02
Filing date: 1988-11-02
Publication date: 1997-05-14
Anticipated expiration: 2012-05-14
Also published as: JPH02123481A

Description

【発明の詳細な説明】産業上の利用分野本発明は、超高速科学技術計算用のSIMD形並列計算機
に関するものである。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a SIMD parallel computer for ultra-high-speed scientific and technical calculations.

従来の技術従来から、科学技術計算、特に、偏微分方程式系を数
学モデルとするシミュレーション計算では、大容量かつ
高速計算が要求され、それに応えるために、配列型マル
チプロセッサ並列計算機が提案され、開発されてきた。
それらの代表格がILLIAC−IVやDAPやPAXなどである。そ
れらは、２次元的プロセッサ配列からなり、基本的に
は、隣接するプロセッサ間のみでバス接続されているだ
けで、離れたプロセッサ間でデータ交換を必要とする場
合には、次々とプロセッサに渡していく形をとることに
なり、そのデータ転送の制御は複雑であるとともに、時
間もかかるという重大な欠点をもっている。そのような
ことから、近年においてはもっと多様なデータ転送を可
能にするための種々のデータ路ネットワークが提案され
ている。発明者もまた先に昭和54年特許願第92927号
「並列処理計算機」（特許第1386622号）及び昭和55年
特許願第68044号「並列処理計算機」（特許第1404753
号）において、多次元シミュレーション解法の普遍的ア
ルゴリズムを有効に処理するために、バッファメモリア
レイをもつユニークなMIMD形並列計算機を提案した。し
かしながら、これらの方式においても、プロセッサ台数
の増加につれ、ネットワーク用バス線本数が莫大とな
り、当然線長も大きくなって、転送スピードの低下を来
すという欠点が生じてきた。2. Description of the Related Art Conventionally, science and technology calculations, especially simulation calculations using a PDE as a mathematical model, have required large-capacity and high-speed calculations. To meet this demand, array-type multiprocessor parallel computers have been proposed and developed. It has been.
Typical examples are ILLIAC-IV, DAP and PAX. They consist of a two-dimensional processor array. Basically, they are only connected by buses between adjacent processors. If data exchange is required between distant processors, they are passed to the processors one after another. This has the serious disadvantage that the control of the data transfer is complicated and time-consuming. Under such circumstances, various data path networks have been proposed in recent years to enable more diverse data transfer. The inventor has also previously filed a Patent Application No. 92927 “Parallel Processing Computer” (Patent No. 1386622) and a Patent Application No. 68044 “No. 68044” “Parallel Processing Computer” (Patent No. 1404753).
Issue), a unique MIMD-type parallel computer with a buffer memory array was proposed to effectively process universal algorithms for multidimensional simulation. However, even in these systems, as the number of processors increases, the number of network bus lines becomes enormous, the line length naturally increases, and the transfer speed decreases.

一方、汎用性を重視するという観点からは、従来型計
算機の延長上にあるベクトルパイプライン方式が大きく
発展し、現在のスーパーコンピュータの主流となってい
る。しかし、その方式はハード上の物理的限界に近づき
つつあり、大幅な速度向上は得がたく、しかも、シミュ
レーション計算で実効速度がさほど高くないという場合
も多い。そのため、大容量・高速計算の要求に対しては
結局のところ並列化を求める趨勢にある。On the other hand, from the viewpoint of emphasizing versatility, the vector pipeline system, which is an extension of the conventional computer, has greatly developed and is now the mainstream of supercomputers. However, this method is approaching the physical limit of hardware, and it is difficult to obtain a significant speed increase, and in many cases, the effective speed is not so high in the simulation calculation. For this reason, there is a tendency to demand parallelization after all for the demand for large-capacity and high-speed calculation.

発明が解決しようとする課題そこで、本発明は先に発明した並列計算機のアイデア
と、単体として極度に高速性を追求したベクトルパイプ
ライン方式とを結合することにより、先の発明の欠点を
緩和し、現在のスーパーコンピュータ技術の上に、最も
自然な並列化方式を提供しようとするものである。この
方式は次世代スーパーコンピュータの最有力アーキテク
チャと目されるであろう。SUMMARY OF THE INVENTION Accordingly, the present invention alleviates the disadvantages of the prior invention by combining the idea of the parallel computer previously invented with the vector pipeline system pursuing extremely high speed as a single unit. It aims to provide the most natural parallelization method on the current supercomputer technology. This method will be regarded as the dominant architecture of the next generation supercomputer.

課題を解決するための手段（アーキテクチャ）本発明の並列ベクトル計算機は、１台の汎用計算機を
フロントエンドプロセッサとし、それに付加して利用す
るものである。フロントエンドプロセッサには、通常、
入出力装置、ディスク記憶装置、コンソールなどがつな
がる。フロントエンドプロセッサは、データの入出力や
コンパイルを行うとともに、本発明計算機に対して、オ
ブジェクトコードの引渡し、制御メッセージの送受等を
行うものである。Means for Solving the Problems (Architecture) In the parallel vector computer of the present invention, one general-purpose computer is used as a front-end processor, and is used in addition thereto. Front-end processors usually have
I / O devices, disk storage devices, consoles, etc. are connected. The front-end processor performs input / output and compilation of data, and transfers an object code to the computer of the present invention, sends and receives a control message, and the like.

本発明の並列ベクトル計算機として、単一ブロックの
もの（請求項（１）のもの）と複合ブロックのもの（請
求項（２）のもの）がある。前者は、１台の制御用プロ
セッサCU及びCU用記憶装置CMからなる制御部CBと、Ｎ
（ある決まった整数）台の並列演算用ベクトルプロセッ
サPU［α］，α＝1,2,…,N及び各PUに対応したスカラプ
ロセッサとスカラデータ用メモリPM［α］からなるプロ
セッサ部PB、並びにそれぞれ一定記憶容量をもつM³ヶ
（Ｍ＝pN,pもある自然数）のメモリユニットMUの立方配
列｛MU［i,j,k］,i,j,k＝1,2,…,M｝からなる主記憶ベ
クトルバンク部MBを基本構成要素にもつもので、（1²,1
³）−並列ベクトル計算機と呼ぶ。一方、後者のもの
は、上記プロセッサ部を単位として、L²単位分の配列
｛PB［J,K］,J,K＝1,2,…,L｝と上記主記憶バンク部を
単位として、L³単位分の立方配列｛MB［I,J,K］,I,J,K
＝1,2,…,L｝を組合せたもので、（L²,L³）−並列ベク
トル計算機と呼ぶ。The parallel vector computer of the present invention includes a single block computer (claim 1) and a composite block computer (claim 2). The former includes a control unit CB including one control processor CU and a storage device CM for CU, and N
A processor unit PB comprising (a fixed integer) number of parallel operation vector processors PU [α], α = 1, 2,..., N, a scalar processor corresponding to each PU, and a scalar data memory PM [α]; and M ³ months, each with a certain storage capacity cubic array of memory units MU of (M = pN, p be a natural number in) {MU [i, j, k], i, j, k = 1,2, ..., M It has a main memory vector bank part MB consisting of に as a basic component. (1 ² , 1
³ )-Called a parallel vector computer. On the other hand, the latter ones, in units of the processor unit, the sequence of L ² unit of {PB [J, K], J, K = 1,2, ..., L} of the main memory bank portion as a unit and, L ³ units cubic array ｛MB [I, J, K], I, J, K
= 1, 2,..., L}, and is called (L ² , L ³ ) -parallel vector computer.

まず、（1²,1³）−並列ベクトル計算機について、そ
の構成を述べる。制御用プロセッサCUは、CM上に保有さ
れたオブジェクトコードを逐次解読して、すべての演算
用ベクトルプロセッサに対して、同一の制御信号を発す
る。どのベクトルプロセッサPU［α］，α＝1,2,…,N
も、完全同期して動く。個々のベクトルプロセッサは、
スカラレジスタ、スカラ演算器に加えて、幾つかのベク
トルレジスタ及びベクトルパイプライン演算装置を備え
たものとする。スカラデータ用として、各PU［α］に対
応した私的メモリPM［α］を備える。ベクトルデータ用
としては、主記憶バンク部にMBが設けられる。このバン
ク部MBは基本的に、一定容量のメモリユニットMUの３次
元配列｛MU［i,j,k］,i,j,k＝1,2,…,M｝から構成され
る。第２にPU［α］は、各番号α（α＝1,2,…,N）毎に
第_ｑ（＝ｐ（α−１）＋ｑ）層目のMUの２次元部分配
列｛MU［i,j,_ｑ］,i,j＝1,2,…,M｝をｑ＝1,2,…,pに
ついて集めたものと、１組の高速“行”バス（ｉ方向を
“行”と呼ぶ）を介してつながり、メモリユニットの列
ベクトルバンク｛MU［i,1,_１］,MU［i,2,_１］，…,
MU［i,M,_１］,MU［i,1,_２］,MU［i,2,_２］，…,M
U［i,M,_２］，…,MU［i,1,_ｐ］,MU［i,2,_ｐ］，
…,MU［i,M,_ｐ］,i＝1,2,…,Mのそれぞれに対してイ
ンターリーブ方式によるベクトルデータアクセスが可能
である。すなわち、与えられたｉ値に対して、上記列ベ
クトルバンクのメモリユニット内の指定された同一アド
レスにあるデータを成分とする長さpNのベクトルが高速
にアクセスされる。第３に、同じベクトルプロセッサの
組は、別のＮ組の高速“列”バス（ｊ方向を“列”と呼
ぶ）を介して、同じ主記憶バンク部MBにアクセスできる
ようにもなっている。各番号α（α＝1,2,…,N）毎に、
PU［α］は、第_ｑ（＝ｐ（α−１）＋ｑ列目の２次元
部分配列｛MU［_q,j,k］,j,k＝1,2,…,M｝をｑ＝1,2,
…,pに渡って集めたものと、１組の列バスを介してメモ
リユニットの垂直ベクトルバンク｛MU［₁,j,1］,MU
［₁,j,2］，…,MU［₁,j,M］,MU［₂,j,1］,MU［
₂,j,2］，…,MU［₂,j,M］，…,MU［_p,j,1］,MU
［i_p,,2］，…,MU［_p,j,M］｝,j＝1,2,…,Mのそれ
ぞれに対してインターリーブ方式によるベクトルアクセ
スを行うことも可能である。すなわち、与えられたｊ値
に対して、上記の行ベクトルバンクのメモリユニット内
の指定された同一アドレスにあるデータを成分とする長
さpNのベクトルが高速にアクセスされる。First, the configuration of the (1 ² , 1 ³ ) -parallel vector computer will be described. The control processor CU sequentially decodes the object code stored in the CM and issues the same control signal to all the operation vector processors. Which vector processor PU [α], α = 1,2,…, N
Also works in perfect synchronization. Each vector processor is
In addition to the scalar register and the scalar operation unit, some vector registers and vector pipeline operation units are provided. A private memory PM [α] corresponding to each PU [α] is provided for scalar data. An MB is provided in the main memory bank for vector data. The bank section MB is basically composed of a three-dimensional array {MU [i, j, k], i, j, k = 1, 2,..., M} of memory units MU having a fixed capacity. Second, PU [α] is a two-dimensional partial array {MU [i] of MUs in the _q-th (= p (α-1) + q) layer for each number α (α = 1, 2,..., N). , j, _q ], i, j = 1, 2,..., M｝ for q = 1, 2,..., p, and a set of high-speed “row” buses (i-direction connection via the called), column vector bank of memory units _{{MU [i, 1, 1} ], MU [i, 2, 1], ...,
_{MU [i, M, 1]} , MU [i, 1, 2], MU [i, 2, 2], ..., M
U [i, M, ₂ ], ..., MU [i, 1, _p ], MU [i, 2, _p ],
, MU [i, M, _p ], i = 1, 2,..., M, can access vector data by the interleave method. That is, for the given i value, a vector of length pN having data as a component at the designated same address in the memory unit of the column vector bank is accessed at high speed. Third, the same set of vector processors can also access the same main memory bank MB via another N sets of high-speed "column" buses (j-direction is referred to as "column"). . For each number α (α = 1,2,…, N),
PU [α] is obtained by _{adding q} = 1 (= p (α−1) + q-th two-dimensional subarray {MU [ _q , j, k], j, k = 1, 2,..., M} to q = 1 , 2,
, P, and the vertical vector bank of the memory unit {MU [ ₁ , j, 1], MU via a set of column buses
[ ₁ , j, 2],…, MU [ ₁ , j, M], MU [ ₂ , j, 1], MU [
₂ ,, j, 2],…, MU [ ₂ , j, M],…, MU [ _p , j, 1], MU
_{[I p ,, 2], ...} , MU [p, j, M]}, j = 1,2, ..., it is also possible to perform vector access by interleaving for each M. That is, for a given j value, a vector of length pN having data as a component at the same designated address in the memory unit of the row vector bank is accessed at high speed.

次に、（L²,L³）−並列ベクトル計算機の構成を述べ
る。これは、（1²,1³）−並列ベクトル計算機の拡張シ
ステムであり、より一層の大規模化の方途を与えるもの
であり、前記プロセッサ部PB（1²に相当）を単位とし
て、L²単位分を正方形状の配列｛PB［J,K］,J,K＝1,2,
…,L｝とし、同じく前記主記憶ベクトルバンク部MB（1³
に相当）を単位として、L³単位分を立方状に配列｛MB
［I,J,K］,I,J,K＝1,2,…,L｝したものである。各PB上
の制御用プロセッサCUは、夫々のCM上に保有された全く
同一のオブジェクトコードを逐次解読し、担当するPUを
駆動し、全体としても完全同期して動作する。各PB［J,
K］は、その番号対［J,K］に対応した主記憶ベクトルバ
ンクの行（MB［1,J,K］,MB［2,J,K］，…,MB［L,J,
K］）と共通行バスの組を介してつながる。この際、PB
［J,K］が各MB［I,J,K］につながる仕方は、（1²,1³）
システムのときの行バス系と同様であり、行バスの組が
共通だという点が重要である。PB全体としては、長いベ
クトルの組を並列的に（行方向で）アクセスすることが
できる。また各PB［K,I］は、その番号対［K,I］に対応
した主記憶ベクトルバンクの列（MB［I,1,K］,MB［I,2,
K］，…,MB［I,L,K］）と共通の列バスの組を介してつ
ながる。この場合もPB［K,I］が各MB［I,J,K］につなが
る仕方は（1²,1³）システムのときの列バス系と同様で
あり、列バスの組が共通だという点が重要である。PB全
体としては、長いベクトルの組を並列的に（列方向で）
アクセスすることができる。結局、（1²,1³）−システ
ム及び（L²,L³）−システムのいずれにおいても、１つ
か又は複数のPBが、それぞれにつながる行バスの組を通
じて主記憶ベクトルバンク（又はバンク行）を同時にア
クセスしながら並列演算すること（行演算）と、同じく
列バスの組を通じて主記憶ベクトルバンク（又はバンク
列）を同時にアクセスしながら並列演算すること（列演
算）を、時分割的に繰り返しつつ計算を行っていくの
が、本計算機の基本的な利用態様である。Next, the configuration of the (L ² , L ³ ) -parallel vector computer will be described. This includes (1 ^2, 1 ³⁾ - an extension system of the parallel vector computer, which gives a way of further large scale, as a unit (corresponding to 1 ²⁾ said processor unit PB, L ² The unit is a square array ｛PB [J, K], J, K = 1,2,
, L｝, and the main memory vector bank MB (1 ³
), And ³ units of L are arranged in a cube in MB
[I, J, K], I, J, K = 1, 2,..., L｝. The control processor CU on each PB sequentially decodes exactly the same object code held on each CM, drives the assigned PU, and operates in perfect synchronization as a whole. Each PB [J,
K] is a row (MB [1, J, K], MB [2, J, K],..., MB [L, J,
K]) and a set of common buses. At this time, PB
The way [J, K] is connected to each MB [I, J, K] is (1 ² , 1 ³ )
It is important to note that this is the same as the row bus system in the system, and that the set of row buses is common. As a whole PB, a long set of vectors can be accessed in parallel (in the row direction). Further, each PB [K, I] is a column of the main storage vector bank (MB [I, 1, K], MB [I, 2,
K],..., MB [I, L, K]) through a common set of column buses. Also in this case, the way that PB [K, I] is connected to each MB [I, J, K] is the same as the column bus system in the (1 ² , 1 ³ ) system, and the column bus group is common. The point is important. As a whole PB, a set of long vectors is parallel (in column direction)
Can be accessed. As a result, in both the (1 ² , 1 ³ ) -system and the (L ² , L ³ ) -system, one or more PBs are connected to the main storage vector bank (or bank row) through a set of row buses connected to each. ) Are accessed in parallel (row operation), and the parallel operation is also performed while simultaneously accessing the main storage vector bank (or bank column) through a set of column buses (column operation). The basic usage of the computer is to perform calculations while repeating.

作用（データ構造及びアクセス表現を中心に）この計算を表現するため、本計算機のデータ構造を準
備する。Operation (centering on data structure and access expression) In order to express this calculation, the data structure of this computer is prepared.

本計算機では、配列データと主記憶ベクトルバンクに
蓄える仕方に際立った特徴があり、まずこれを例説す
る。いま（1²,1³）−システムでｐ＝１の場合とする。
Ｎは決まった数で、PUの台数がＮに等しい場合とする。
このとき、主記憶ベクトルバンクはＮ×Ｎ×Ｎのユニッ
ト配列MUになっている。いま、典型的な３次元データ配
列Ｕで、普通、 REAL Ｕ（N,N,N,）と宣言すべきものを考える。データ配列｛Ｕ（I,J,K）,
I,J,K＝1,2,…,N｝を主記憶ベクトルバンクにストアす
るとき、要素それぞれを対応するメモリユニットMU［i,
j,k］の同一アドレスの位置におくが、このとき、主記
憶ハード装置に固定された‘ｋ軸方向’を、物理空間座
標軸K,J,I方向のいずれに対応させるかによって３通り
のものが考えられる。便宜上、それぞれをａ配置、ｂ配
置、ｃ配置と呼び、次のように異なる表現を与える：ａ（ｋ軸をＫ方向に一致させる）:U（I,J,/K/）ｂ（ｋ軸をＪ方向に一致させる）:U（I,/J/,K）ｃ（ｋ軸をＩ方向に一致させる）:U（/I/,J,K）これらは、ストアされたデータ配列表現であり、スト
ック表現と呼ぶ。これらのデータがいずれも必要となる
場合には、必要なだけの種類を宣言しなければならず、
宣言されると、メモリユニットの異なるアドレス部分に
メモリが確保される。たとえば、はじめの２種のものを
利用しようというときは、 REAL Ｕ（N,N,/N/）,U（N,/N/,N）と宣言する。The present computer has a distinctive feature in the arrangement data and the way of storing it in the main memory vector bank, and this will be described first. Now, assume that p = 1 in the (1 ² , 1 ³ ) -system.
N is a fixed number, and the number of PUs is equal to N.
At this time, the main memory vector bank has an N × N × N unit array MU. Now, consider a typical three-dimensional data array U that should normally be declared as REAL U (N, N, N,). Data array @U (I, J, K),
When I, J, K = 1, 2,..., N} are stored in the main storage vector bank, each element is stored in the corresponding memory unit MU [i,
j, k] at the same address. At this time, there are three types of “k-axis direction” fixed to the main storage hardware device, depending on which of the physical space coordinate axes K, J, and I directions corresponds. Things are conceivable. For convenience, they will be called a configuration, b configuration, and c configuration, respectively, and give different expressions as follows: a (match k axis in K direction): U (I, J, / K /) b (k axis Are matched in the J direction): U (I, / J /, K) c (k axis is matched in the I direction): U (/ I /, J, K) These are stored data array expressions. Yes, called stock expression. If you need all of this data, you must declare as many types as you need,
When declared, memory is allocated at different addresses in the memory unit. For example, to use the first two, declare REAL U (N, N, / N /), U (N, / N /, N).

さらに、本計算機では、その構成について述べたよう
に、データ配列をアクセスしながら並列処理する場合
に、行バスの組を利用したり、列バスの組を利用したり
できる。その処理態様を区分するデータ表現を与え、そ
れをアクセス表現と呼ぶ。a,b,c配置それぞれに２種の
アクセス表現が対応するので、合計で６種のアクセス表
現が存在しうる。データ配列Ｕについてそれらの表現を
次表に示す：本計算機によれば、相異なるストック表現のものの間
でデータの置き換えが容易にできる。このことが、本計
算機を利用するときに重要となる。たとえば｛Ｕ（I,/J
/,K）｝でもって｛Ｕ（I,J,/K/）｝の値とするときはの処理をする。また｛Ｕ（/I/,J,K）｝でもって｛Ｕ
（I,/J/.K）｝の値とするときはとすればよい。あるいは、｛Ｕ（I,J,/K/）｝でもっ
て、｛Ｕ（/I/,J,K）｝の値とするときはとすればよい。上記それぞれの逆経路も可能である。Furthermore, as described above, the computer can use a set of row buses or a set of column buses when performing parallel processing while accessing a data array. A data expression for classifying the processing mode is provided, and is referred to as an access expression. Since two types of access expressions correspond to each of the a, b, and c arrangements, there can be a total of six types of access expressions. The following table shows their representation for data array U: According to this computer, data can be easily replaced between different stock representations. This is important when using this computer. For example, ｛U (I, / J
/, K)} to get the value of {U (I, J, / K /)} Process. Also, use U (/ I /, J, K) for U
(I, / J / .K) とする And it is sufficient. Alternatively, when {U (I, J, / K /)} and {U (/ I /, J, K)} And it is sufficient. Each of the above reverse paths is also possible.

第１図（ａ）、（ｂ）及び（ｃ）には、並列ベクトル
計算機のa,b,c配置状態のモデルが示されている。これ
らの図において、中央の立方体は、主記憶ベクトルバン
クMBである。メモリユニットの立方配列に固定した３軸
をi,j,k軸としている。分かり易くするために、ｋ軸に
垂直な一面を基準面とし、ハッチングしてある。四角形
ABCDがベクトル計算機PUの並びであり、ABCDに平行な線
分でもって、各ベクトル計算機PUを示している。２つの
ABCDのうち、各スライス（PU）の方向が主記憶バンク
（i,j）面に平行した状態が行アクセス状態であり、
（j,k）面に平行した状態が列アクセス状態である。各
図の傍に配列Ｕの対応するアクセス表現が例示してあ
る。1 (a), 1 (b) and 1 (c) show models of the parallel vector computer in the arrangement state of a, b and c. In these figures, the central cube is the main storage vector bank MB. The three axes fixed to the cubic arrangement of the memory units are the i, j, and k axes. For easy understanding, one surface perpendicular to the k-axis is set as a reference surface and hatched. Square
ABCD is an array of vector computers PU, and each vector computer PU is indicated by a line segment parallel to ABCD. Two
In the ABCD, a state where the direction of each slice (PU) is parallel to the main storage bank (i, j) plane is a row access state,
A state parallel to the (j, k) plane is a column access state. A corresponding access expression of the array U is illustrated beside each figure.

並列演算指示の例を与える。 An example of a parallel operation instruction will be given.

あるいは簡略形でと与える。これは、ａ配置において、行アクセスに基づ
く並列演算を指示している。番号Ｋのそれぞれに対応し
たベクトルプロセッサが、並列的に、Ｊの変化に応じた
成分をもつベクトル（各K,Iに対応した部分配列）に対
して、D0文内の演算を処理する形をとる。当然、このPD
0パラグラフにでてくる式中のデータアクセス表現は同
一種で揃っている必要があり、しかも各配列の前括弧内
の添数は、全く同一の変数名のもので揃っている必要が
ある（式は許されない。）この条件があるので（Ｉ）′
のような簡略形も可能になる。後括弧内にでてくる添数
には式も許される。 Or in shorthand And give. This indicates a parallel operation based on a row access in the arrangement a. The vector processor corresponding to each of the numbers K processes the operation in the D0 statement in parallel with a vector (a partial array corresponding to each of K and I) having a component corresponding to a change in J. Take. Naturally, this PD
The data access expressions in the expression appearing in the paragraph 0 must be of the same kind, and the indices in parentheses of each array must be of the same variable name ( Expressions are not allowed.) Given this condition, (I) '
Simple forms such as are also possible. Expressions are allowed for indices that appear in parentheses.

上では、もともと１番目の添数に式を許す場合を例示
したが、２番目の添数に式を許す形で式を書きたいとき
は、ｂ配置で処理する。In the above, a case where an expression is allowed for the first subscript is exemplified. However, when an expression is to be written in a form that allows the expression for the second subscript, processing is performed in the b arrangement.

なお、ここで、先のプログラム（Ｉ）′で得た｛Ｖ
（I,J,/K/）｝を用いて、直ちにプログラム（II）を続
けて書くことができる。それは単に、行アクセス状態か
ら列アクセス状態に切り換えるだけでよいからである。
ところが、（Ｉ）′で得た｛Ｖ（I,J,/K/）｝を用い
て、//のついた添数（いまの場合３番目のもの）に式を
許すような代入式を続けたい場合、予めデータ配列の編
集を行っておく必要がある。いまの場合、を行い、ｂ配置データ｛Ｖ（I,/J/,K）｝を得たあと、などとするか、または、を行い、ｃ配置データ｛Ｖ（/I/,J,K）｝を得たあと、などとすればよい。 Here, the ΔV obtained by the previous program (I) ′
Using (I, J, / K /)｝, the program (II) can be written immediately. This is because it is only necessary to switch from the row access state to the column access state.
However, using {V (I, J, / K /)} obtained in (I) ′, an assignment expression that allows the expression to be added to the index with // (in this case, the third one) To continue, it is necessary to edit the data array in advance. In this case, To obtain b placement data {V (I, / J /, K)}, Or To obtain c placement data {V (/ I /, J, K)}, And so on.

これまでみてきた例では、データ配列の大きさ（Ｎ×
Ｎ×Ｎ）が、主記憶バンクのメモリユニット配列の大き
さに一致する場合で、しかもｐ＝１という（1²,1³）−
システムで済む場合をみてきた。一般的には、そうはう
まくいかず、いずれかの次元の大きさがＮより小さかっ
たり、大きかったりする。Ｎより小さい場合には、その
次元に対してマスク演算したり、短い長さのベクトル処
理を行えばよい。Ｎより大きい場合、主記憶を多重に用
い（Ｎを法としたが添数が一致するものは同一ユニット
の異なるアドレスに提供）、ベクトルプロセッサの組も
多重に用いる（一重毎に済ますシリアル処理にはな
る。）主記憶が大きくとれる場合には、ｐ＞１として、
メモリ多重度を減らし、ベクトル長の大きな（pN）の処
理を行うようにする。さらに並列度をあげて、ベクトル
処理の多重度を減らすためには、拡張した（L²,L³）−
システムを利用すればよい。In the example we have seen so far, the size of the data array (N ×
N × N) is, in the case that matches the size of the memory unit array of the main memory banks, moreover p = 1 that (1 ^2, 1 ³⁾ -
We have seen the case where the system suffices. In general, this is not the case, and the magnitude of any dimension is smaller or larger than N. If it is smaller than N, a mask operation may be performed on the dimension, or a short-length vector process may be performed. If it is larger than N, the main memory is multiplexed (modulo N, but those with the same indices are provided to different addresses of the same unit), and the set of vector processors is also multiplexed (only for each single serial processing. If the main memory is large, p> 1 and
The memory multiplicity is reduced, and processing with a large vector length (pN) is performed. In order to further increase the degree of parallelism and reduce the degree of multiplicity of vector processing, the extended (L ² , L ³ ) −
Use the system.

また、本計算機では、２次元配列データを扱うことも
可能である。それは、４次元配列データへのマッピング
によって実現される。ここでも例で説明する。簡単のた
め、（1²,1³）−システムでｐ＝１の場合とする。PU台
数がＮの場合とする。いま、典型的な２次元配列で、普
通なら REAL Ｕ（Ｎ＊＊2,N＊＊２）と宣言されるものを考える。この場合には、配置の種類
はａ′,b′の２種で、アクセス表現はそれぞれにつき１
種とする。The computer can also handle two-dimensional array data. It is realized by mapping to four-dimensional array data. Here also, an example will be described. For simplicity, it is assumed that p = 1 in the (1 ² , 1 ³ ) -system. It is assumed that the number of PUs is N. Now, consider a typical two-dimensional array that would normally be declared as REAL U (N ** 2, N ** 2). In this case, there are two types of arrangement, a 'and b', and the access expression is one for each.
Seeds.

この２次元配列を、３次元形状の主記憶メモリユニッ
ト配列に割りつけるために、固定された２つの配置の対
（例として、a,c配置に決める）を選択し、次のように
マッピングする：添数I,JをＩ＝Ｐ＊Ｎ＋Ｉ′，Ｊ＝Ｑ＊Ｎ＋Ｊ′ と表現し、ＩとＪをそれぞれ対（P,I′），（Q,J′）で
もって表わす。そして、アクセス表現Ｕ（,/J/）（Ｉ）
とＵ（/I/,）（Ｊ）のそれぞれにＵ（,J′,/Q/）（Ｉ）
とＵ（/P/,,I′）（Ｊ）を対応させる。データ編集は次のように表現される：｛Ｕ（,J′,/Q/）（Ｉ）｝を
４次元的行アクセス配列｛Ｕ（,J′,/Q/）（Ｉ′）
（Ｐ）｝（Ｐは多重性の添数にあたる）と表したとき、
これは列アクセス配列｛Ｕ（Ｉ′,,/Q/）（Ｊ′）
（Ｐ）｝に読みかえられるが、これに対して｛Ｕ（Ｉ′,,/Q/）（Ｊ′）（Ｐ）｝→｛Ｕ（/P/,,
I′）（Ｊ′）（Ｑ）｝の編集作業を行えばよい。この際、Ｐの変化に対して、
アドレスがＮヶ跳びのベクトル成分を１つ跳びのものに
編集する作業が含まれる（各ベクトルはインターリーブ
方式でアクセスできる。）。 In order to allocate this two-dimensional array to the three-dimensional main memory unit array, two fixed pairs of arrangements (for example, a and c arrangements are selected) are mapped and mapped as follows. : The indices I and J are expressed as I = P * N + I ', J = Q * N + J', and I and J are expressed as pairs (P, I ') and (Q, J'), respectively. Then, the access expression U (, / J /) (I)
And U (/ I /,) (J) for each of U (, J ', / Q /) (I)
And U (/ P / ,, I ') (J). Data editing Is represented as follows: {U (, J ′, / Q /) (I)} is transformed into a four-dimensional row access array {U (, J ′, / Q /) (I ′).
(P)｝ (P is the index of multiplicity)
This is the column access array {U (I ',, / Q /) (J')
(P)}, but {U (I ′ ,, / Q /) (J ′) (P)} → {U (/ P / ,,
The editing work of I ') (J') (Q)｝ may be performed. At this time, for a change in P,
Includes the task of editing the vector component whose address jumps N steps into one that jumps one step (each vector can be accessed in an interleaved manner).

実施例図面を参照して、本発明による並列ベクトル計算機の
具体的アーキテクチャを説明する。第２図において、破
線枠内に示したこの並列ベクトル計算機の基本構成は制
御部CB、演算部PB、及び主記憶部MBからなり、制御部CB
は、バス及び通信線１によって汎用型のフロント／エン
ド計算機２につながれる。したがって、データ入出力や
プログラムコンパイル等はこのフロント／エンド計算機
で取り扱われる。制御部は、単一の制御部プロセッサCU
3とプログラムメモリCM4からなる。フロントエンド計算
機から送られたプログラムコードはプログラムメモリに
置かれる。CU3は、CM4にあるコードを順次解読し、CU3
自身において制御動作を行うとともに、信号線５を通じ
て演算部PBに制御信号を発信する。すなわち、CU3はフ
ロント／エンド計算機と並列ベクトル演算装置との間に
おけるデータ入出力の窓口的役割を演じる。演算部PBに
は、スカラ演算ユニット６とベクトル演算ブロック７が
ある。スカラ演算ユニット６は、スカラ演算とスカラデ
ータ入出力を行う。ベクトル演算ブロック７は、ベクト
ルパイプライン演算とベクトルデータ入出力を行う。ス
カラ演算ユニット６には、スカラメモリ８が、またベク
トル演算ブロック７にはベクトルメモリユニットMU9の
立方配列がつながり、このベクトルメモリユニットMUの
配列が主記憶部MBを構成する。Embodiment A specific architecture of a parallel vector computer according to the present invention will be described with reference to the drawings. In FIG. 2, the basic configuration of the parallel vector computer shown in a broken line frame includes a control unit CB, an operation unit PB, and a main storage unit MB.
Is connected to a general-purpose front / end computer 2 by a bus and a communication line 1. Therefore, data input / output, program compilation, and the like are handled by this front / end computer. The control unit is a single control unit processor CU
3 and the program memory CM4. The program code sent from the front-end computer is stored in the program memory. CU3 sequentially decodes the code in CM4,
In addition to performing a control operation by itself, a control signal is transmitted to the arithmetic unit PB via the signal line 5. That is, CU3 plays the role of a window for data input / output between the front / end computer and the parallel vector operation device. The operation unit PB includes a scalar operation unit 6 and a vector operation block 7. The scalar operation unit 6 performs scalar operation and scalar data input / output. The vector operation block 7 performs a vector pipeline operation and vector data input / output. A scalar memory 8 is connected to the scalar operation unit 6, and a cubic array of vector memory units MU9 is connected to the vector operation block 7, and the array of the vector memory units MU forms a main storage unit MB.

演算処理に関しては、CU3から発せられる制御信号
（インストラクション）によってスカラ演算ユニット６
とベクトル演算ブロック７全体が完全に同期して作動す
るSIMD方式であり、各ユニットは、それぞれ自ら処理す
べきインストラクションにのみ反応して処理する。本発
明装置の中心部分はベクトル系統であり、スカラ系統
は、従来の計算機と変わらないため、以下ベクトル系統
について詳述することとする。なお、制御部CB以降の２
重線10〜14はデータバス及び通信線である。Regarding the arithmetic processing, the scalar arithmetic unit 6 is controlled by a control signal (instruction) issued from CU3.
And the entire vector operation block 7 is a SIMD system in which the units operate completely synchronously, and each unit processes only in response to an instruction to be processed by itself. The central part of the apparatus of the present invention is a vector system, and a scalar system is the same as a conventional computer. Therefore, the vector system will be described in detail below. Note that 2 after the control unit CB
Double lines 10 to 14 are a data bus and a communication line.

第３図は、個々のベクトルユニットPUの概要（ベクト
ル演算ブロック７の関連部分）を示す図である。ベクト
ルユニットPUは制御プロセッサCUから受けたインストラ
クションを、第２図の信号線５に対応するバス15よりレ
ジスタ16に受け、内部制御信号によって、ベクトル処理
部17とDMAコントローラ18を作動させる。PUはまた、外
向きラインとしてのデータバス11とインストラクション
バス15及びユニット選択線19によりCUにつながり、内向
きラインとしてのデータバス13、14とアドレスバス20に
よりベクトルメモリブロックMBにつながる。FIG. 3 is a diagram showing an outline of each vector unit PU (related parts of the vector operation block 7). The vector unit PU receives the instruction received from the control processor CU via the bus 15 corresponding to the signal line 5 in FIG. 2 to the register 16, and operates the vector processing unit 17 and the DMA controller 18 according to the internal control signal. The PU is also connected to the CU by the data bus 11 as an outward line, the instruction bus 15 and the unit selection line 19, and is connected to the vector memory block MB by the data buses 13 and 14 as the inward lines and the address bus 20.

第４図にはメモリユニットMUの立方配列21と、それと
一つのPUユニット22とのデータバス接続の模式図を掲げ
る。ただし、MUは、それぞれ図の格子点上に位置するも
のとする。図において、太い実線枠に囲まれた水平な四
角形領域を、ｋ枚目の実ボード23とし、その上に、第α
層目の部分メモリ配列MU（・，・，α）（黒丸●24の配
列）がのせられるものとする。同じく、太い破線枠に囲
まれた垂直な四角形領域を、第α枚目の仮想ボード25と
みなし、その上に、第α列目の部分メモリ配列MU（α，
・，・）（白丸○26の配列）がのっているものとする。
個々の黒丸も白丸も単一のメモリユニットを代表し、図
では、両ボードの交わったところだけ、立方箱27で囲
み、重なった黒丸24と白丸26が単一のメモリユニットで
あることを特に示したものである。実ボード23上には、
複実線で示された行バス28を形成し、これによりボード
上の各メモリユニットが、PU（α）22に接続される。PU
（α）からのｉ選択アドレスにもとづき、各列ベクトル
MU（i,・，α）がインターリーブ方式のアクセスをうけ
る。第４図では、共通の行バス28につながる各分枝29に
配列された各ユニットがベクトル成分を有している。各
成分がメモリユニット配列内のどのユニットを占めるか
は、PU（α）22からのポイントアドレスによって指定さ
れる。仮想ボード25上には、複破線30で示された列バス
を形成し、これによってそのボード25上の各メモリユニ
ット26が、PU（α）22に接続される。PU（α）からのｊ
選択アドレスにもとづき、各行ベクトルMU（α,j,・）
がインターリーブ方式のアクセスを受ける。第４図にお
いては、共通の列バス30につながる各分枝31に配列され
た各ユニットが共通のポイントアドレスのところに対応
するベクトル成分を保有している。FIG. 4 shows a schematic diagram of a cubic array 21 of memory units MU and a data bus connection between the cubic array 21 and one PU unit 22. However, it is assumed that the MUs are respectively located on the grid points in the figure. In the figure, a horizontal rectangular area surrounded by a thick solid line frame is defined as a k-th real board 23,
It is assumed that the partial memory array MU (.,.,. Alpha.) Of the layer (array of black circles 24) is placed. Similarly, the vertical rectangular area surrounded by the thick broken line frame is regarded as the α-th virtual board 25, and the partial memory array MU (α, α,
・, ・) (Array of white circles ○ 26).
Each black circle and white circle represent a single memory unit.In the figure, only the intersection of both boards is surrounded by a cubic box 27, and it is particularly noted that the overlapping black circle 24 and white circle 26 are a single memory unit. It is shown. On the real board 23,
A row bus 28 is formed as indicated by a solid line, whereby each memory unit on the board is connected to the PU (α) 22. PU
Based on the i-selection address from (α), each column vector
MU (i, ·, α) receives interleaved access. In FIG. 4, each unit arranged in each branch 29 connected to a common row bus 28 has a vector component. Which unit in the memory unit array each component occupies is specified by a point address from the PU (α) 22. On the virtual board 25, a column bus indicated by a double dashed line 30 is formed, whereby each memory unit 26 on the board 25 is connected to the PU (α) 22. J from PU (α)
Based on the selected address, each row vector MU (α, j ,.)
Receive interleaved access. In FIG. 4, each unit arranged in each branch 31 connected to the common column bus 30 has a vector component corresponding to a common point address.

次に、前述した並列ベクトル計算機の多重配列型とも
いうべき拡張システムによる並列ベクトル演算装置を図
面によって説明する。Next, a description will be given of a parallel vector operation device based on an extended system which can be called a multiple array type of the parallel vector computer described above with reference to the drawings.

第５図は、演算装置として、演算ブロック（スカラ演
算ユニット＋ベクトル演算ブロック）PBを２×２＝４ブ
ロック、すなわちPB（J,K）,J,K＝1,2と主記憶装置を２
×２×２＝８ブロック分、すなわちMB（I,J,K）,I,J,K
＝1,2からなる場合の概念的構成を示している。以下で
は、ベクトル系統に着目し、演算ユニットをベクトルプ
ロセッサと呼ぶものとする。図面の左側の（j,k）平面3
2の上に、４つのプロセッサブロック33が描かれてい
る。プロセッサブロック内には棒状で示すＮ個のベクト
ルプロセッサユニット22が配列されている。また、（i,
j,k）空間内の中空に浮かぶ形で描かれている、何枚か
のメモリボード34の集まり35が主記憶装置である。図で
は、複雑さを避けるために、ブロック単位で、最も大き
な番号“α”をもつベクトルプロセッサと、主記憶メモ
リボードをつなぐ行バスだけを実線36で示している。メ
モリボード34上で、櫛の歯状に描かれたものの一本一本
の歯がアクセスされるベクトルの単位を示している。ま
た、同じく最も大きな番号“α”をもつベクトルプロセ
ッサと、主記憶の垂直（仮想的）メモリボードで、ｉの
最も大きな番号のものにつなぐ列バスだけを破線37で示
している。この場合もメモリボード群35内で櫛の歯状に
描かれたものの一本一本の歯がアクセスされるベクトル
の単位である。仮想的というのは、実メモリボードを串
ざす形において、垂直面内で成り立つメモリユニット列
の組を呼ぶものだからである。行バス（実線）について
は、PB（1,1）とMB（1,1,1）,MB（2,1,1）,PB（2,1）と
MB（1,2,1）,MB（2,2,1）,PB（1,2）とMB（1,1,2）,MB
（2,1,2），そしてPB（2,2）とMB（1,2,2）,MB（2,2,
2）がつながる。列バス（破線）については、PB（1,1）
とMB（1,1,1）,MB（1,2,1）,PB（2,1）とMB（1,1,2）,M
B（1,2,2）,PB（1,2）とMB（2,1,1）,MB（2,2,1），そ
してPB（2,2）とMB（2,1,2）,MB（2,2,2）がつながる。
このようにすることで、２倍のベクトル長をもつベクト
ルプロセッサが２倍数組み合わされたような演算装置に
おいて、第２〜４図に示した基本構成の演算装置と機能
的に変わらないで４倍の高速性能をもつ拡張型演算装置
が出来あがる。FIG. 5 shows that as an arithmetic unit, an arithmetic block (scalar arithmetic unit + vector arithmetic block) PB is 2 × 2 = 4 blocks, that is, PB (J, K), J, K = 1,2 and the main memory is 2
× 2 × 2 = 8 blocks, ie, MB (I, J, K), I, J, K
1 shows a conceptual configuration in the case of = 1,2. In the following, focusing on the vector system, the arithmetic unit is referred to as a vector processor. (J, k) plane 3 on the left side of the drawing
Above 2, four processor blocks 33 are depicted. In the processor block, N vector processor units 22 shown in a bar shape are arranged. Also, (i,
(j, k) A main storage device is a collection 35 of several memory boards 34, which are drawn in a hollow shape in the space. In the figure, in order to avoid complexity, only the vector processor having the largest number “α” and the row bus connecting the main memory board are indicated by solid lines 36 in block units. The unit of a vector to which each tooth drawn in a comb-teeth shape on the memory board 34 is accessed is shown. Similarly, the broken line 37 indicates only the vector processor having the largest number “α” and the column bus connected to the vertical (virtual) memory board of the main memory which has the largest number of i. In this case as well, each tooth drawn like a comb in the memory board group 35 is a unit of a vector to be accessed. This is because the term “virtual” refers to a set of memory unit rows that are formed in a vertical plane in the form of skewing real memory boards. About row bus (solid line), PB (1,1) and MB (1,1,1), MB (2,1,1), PB (2,1)
MB (1,2,1), MB (2,2,1), PB (1,2) and MB (1,1,2), MB
(2,1,2), and PB (2,2) and MB (1,2,2), MB (2,2,
2) leads. PB (1,1) for the column bus (dashed line)
And MB (1,1,1), MB (1,2,1), PB (2,1) and MB (1,1,2), M
B (1,2,2), PB (1,2) and MB (2,1,1), MB (2,2,1), and PB (2,2) and MB (2,1,2) , MB (2,2,2) is connected.
In this way, in an arithmetic unit in which a vector processor having a double vector length is doubled in combination, the arithmetic unit having the basic configuration shown in FIGS. An extended computing device with high-speed performance is created.

実施例１これは、メモリユニット線M³を規定するＭ＝pNのｐが
１の場合の一例である。第６図に示す通り、この実装例
ではベクトル長16ワード（１ワード64ビット）のベクト
ルパイプライン処理を行うプロセッサPU101と16×16＝2
56ヶのメモリユニット配列（１メモリユニット 64kワ
ード）を１ボード上に乗せ、行バス102をすべてそのボ
ード上に配したものである。そのプロセッサは、同一ボ
ード上の16ヶの16メモリユニット長列103に対してベク
トルアクセスが可能である。また、そのユニットの列の
各々は、それぞれに共通の列バス104につながり、16ヶ
の列バスがボード外につながるエッジ端子をもつ。これ
ら、１ボードあたり16ヶの端子が16ボード分で計256ヶ
と、ベクトルプロセッサの列バス端子16ヶが１枚のマザ
ーボード105に差し込まれる。マザーボード上では列バ
ス配線が行われる。この場合、行バスに比べて列バス配
線長が大きくなるので、アクセスタイムの余裕に差を設
ける必要がある。主記憶容量は全体で256Mワードであ
る。Example 1 This, p of M = pN defining a memory unit line M ³ is an example of a case 1. As shown in FIG. 6, in this implementation example, a processor PU101 that performs a vector pipeline process of a vector length of 16 words (1 word and 64 bits) and 16 × 16 = 2
An array of 56 memory units (64 k words per memory unit) is mounted on one board, and all the row buses 102 are arranged on the board. The processor can perform vector access to the 16 memory unit length strings 103 on the same board. Also, each of the columns of the unit has an edge terminal that connects to a common column bus 104 and that the 16 column buses connect outside the board. A total of 256 terminals, 16 terminals per board, for 16 boards, and 16 column bus terminals of the vector processor are inserted into one motherboard 105. Column bus wiring is performed on the motherboard. In this case, since the column bus wiring length is longer than the row bus, it is necessary to provide a difference in the margin of the access time. The main storage capacity is 256M words in total.

実装例２この実装例は、同じくｐ＝１の場合の第２の例であ
り、第７図に示す通り、４台のベクトル演算ユニツト20
1とそれぞれの長さ４単位のベクトルメモリユニット202
の16個の配列を同一ボード上にのせたものである。行バ
ス203も列バス204も同一ボード内で配線可となり、主記
憶アクセスに関する問題は軽減される。Implementation Example 2 This implementation example is a second example in the case of p = 1 as well. As shown in FIG. 7, four vector operation units 20 are used.
Vector memory unit 202 with 1 and 4 units each
Are arranged on the same board. Both the row bus 203 and the column bus 204 can be wired on the same board, so that problems relating to main memory access are reduced.

実装例３この例もまた本発明の基本形における並列ベクトル計
算機であるが、ｐ＝２の場合の一例である。これは、第
８図に示す通り、実装例２と同じ主記憶容量分をもちな
がら、２台のベクトル演算ユニット301でもって、長さ
８単位のベクトル計算を基本とするシステムを示してい
る。すなわち、長さ４単位のベクトルメモリユニット30
2を２つずつ組合せて、長さ８のベクトルを蓄えるよう
にしてある。そのために行バス303と列バス304の接続に
工夫がしてある。Implementation Example 3 This example is also a parallel vector computer according to the basic form of the present invention, but is an example when p = 2. As shown in FIG. 8, this shows a system that has the same main storage capacity as that of the implementation example 2 and is based on vector calculation in units of 8 units with two vector operation units 301. That is, a vector memory unit 30 having a length of 4 units
2 is combined two by two to store a vector of length 8. For this purpose, the connection between the row bus 303 and the column bus 304 is devised.

実装例４これは、いわゆる基本形の拡張形態である（L²,L³）
方式の例であり、第９図に示す通り、単位プロセッサ部
に、単一ベクトルプロセッサをもつｐ＝４相当のものを
用いた（2²,2³）システムである。単一ベクトルプロセ
ッサがベクトルメモリユニット列（ベクトル長、４）の
ものを１ブロックあたり４列分（ベクトル長、16）アク
セスできる。プロセッサ401と402がそれぞれ行バス403
と404を用いて並列的にアクセスする典型的なベクトル
メモリ列（ベクトル長、32）が図でｒと記したものであ
る。同じくプロセッサ401と402がそれぞれ列バス405と4
06を用いて並列的にアクセスする典型的なベクトルメモ
リ列（ベクトル長、32）が図でｃと記したものである。Implementation Example 4 This is an extension of the so-called basic form (L ² , L ³ )
This is an example of the system, and as shown in FIG. 9, a (2 ² , 2 ³ ) system using a unit corresponding to p = 4 having a single vector processor in the unit processor section. A single vector processor can access a vector memory unit column (vector length, 4) for 4 columns (vector length, 16) per block. Processors 401 and 402 each have a row bus 403
And 404, a typical vector memory column (vector length, 32) accessed in parallel is indicated by r in the figure. Similarly, processors 401 and 402 have column buses 405 and 4, respectively.
A typical vector memory column (vector length, 32) accessed in parallel using 06 is indicated by c in FIG.

発明の効果（性能評価）現在の技術レベルで十分可能でかつ次世代スーパーコ
ンピュータとして実用化が確実と思われるシステムを提
示し、その性能評価を与える。Effects of the Invention (Performance Evaluation) A system that is sufficiently possible at the current technical level and is considered to be practically used as a next-generation supercomputer is presented, and its performance is evaluated.

（１）基本形（1²,1³）方式の並列ベクトル計算機に
おいてＮ＝16,p＝４の場合、単体のベクトル計算機は64
ワード（１ワードは64ビット）長のベクトルレジスタを
複数個保持し、マシンクロックは3.125nsとする。これ
は無限長のベクトルに対して単純な演算を繰り返すとき
に１秒間に320百万回の浮動小数点演算（320MFLOPS）を
こなす能力をもつことを意味する。一方、主記憶ベクト
ルバンク部には長さ64単位の列ベクトルバンク（或は行
ベクトルバンク）を配置し、各要素メモリユニットのア
クセスタイムは200nsとする。（この値は通常レベルの
ものである。CMOS技術などによって電力消費を極力少な
くすることを狙っている。）ｐ＝４ということで、４つ
の列ベクトルバンク（或は行ベクトルバンク）に対して
４つの道（各64ビット幅）を提供する。このとき、主記
憶部ベクトルバンク部の最大バンド幅は1.28Gワードと
なる。(1) When N = 16 and p = 4 in a parallel vector computer of the basic type (1 ² , 1 ³ ), a single vector computer requires 64
A plurality of word registers (one word is 64 bits) are held, and the machine clock is 3.125 ns. This means that it has the ability to perform 320 million floating point operations per second (320 MFLOPS) when repeating simple operations on infinite length vectors. On the other hand, a column vector bank (or row vector bank) having a length of 64 units is arranged in the main memory vector bank unit, and the access time of each element memory unit is 200 ns. (This value is a normal level. It aims at minimizing power consumption by CMOS technology or the like.) Since p = 4, four column vector banks (or row vector banks) are used. Provides four paths, each 64 bits wide. At this time, the maximum bandwidth of the main storage vector bank is 1.28 G words.

単体のベクトル計算機の実効性能をみるためにベクト
ルの成分同志の単純な四則演算、たとえばａ） Ai^(p)＋Bi^(p)→Ci^(p)（ｉ＝1,2,3,…,64）或はｂ） A^(p)＋Bi^(p)→Ci^(p) の形のものをｐ＝1,2,3,4,1,2,3,4,…と繰り返すものと
しよう。このとき64ワード分のメモリアクセスや演算を
一つので示したとき、計算の流れは次のように表わされる。To see the effective performance of a single vector computer, simple four arithmetic operations on the components of the vector, for example, a) Ai ^(p) + Bi ^(p) → Ci ^(p) (i = 1,2,3, ..., 64) Or b) ^Let the form of A ^(p) + Bi ^(p) → Ci ^(p) be repeated with p = 1,2,3,4,1,2,3,4, ... At this time, 64 words of memory access and operation , The flow of calculation is expressed as follows.

ここで、Ｌは主記憶バンクからベクトルレジスタへの
ロードを意味し、Ｒは計算結果（Result）を導出する段
階、Ｓはベクトルレジスタからバンクへのストアを意味
する。定常状態で繰り返しが続くところでの実効スピー
ドをみると、ａ）では400nsに64回の浮動演算を行うの
で160MFLOPS、ｂ）では200nsに64回の浮動演算を行うの
で320MFLOPSということになる。 Here, L means loading from the main storage bank to the vector register, R means the stage of deriving the calculation result (Result), and S means storing from the vector register to the bank. Looking at the effective speed where repetition continues in a steady state, a) is 160 MFLOPS since 64 floating operations are performed in 400 ns, and b) is 320 MFLOPS since b) is performed 64 floating operations in 200 ns.

Ｎ＝16台のベクトル計算機からなる並列システム全体
ではａ）2.56GFLOPS ｂ）5.12GFLOPSという実効性能を
うる。An effective performance of a) 2.56 GFLOPS b) 5.12 GFLOPS is obtained in the entire parallel system including N = 16 vector computers.

（２）拡張形（L²,L³）方式の並列ベクトル計算機
で、上記（１）を単位システムとして、Ｌ＝２の場合の
ものを構成すれば上記ａ）やｂ）の演算について実効ス
ピードは（１）の場合の４倍の性能が得られ、ａ）10.2
4GFLOPS ｂ）20.48GFLOPSとなる。(2) An extended type (L ² , L ³ ) type parallel vector computer, in which the above (1) is used as a unit system, and when L = 2, the effective speed for the above operations a) and b) Is four times the performance of (1), and a) 10.2
4GFLOPS b) It becomes 20.48GFLOPS.

これらの性能は、本発明並列ベクトル計算機が次世代
スーパーコンピュータを約束するものであることを示し
ている。These performances show that the parallel vector computer of the present invention promises the next generation supercomputer.

[Brief description of the drawings]

第１図（ａ）、（ｂ）及び（ｃ）は本発明の並列計算機
における基本構成の３つの論理的配置状態をそれぞれ示
す模式図、第２図はその基本構成のシステム全体を示すブロック
図、第３図は１つのベクトルユニットPUの機能構成を示す線
図、第４図はメモリユニットの立方配列と１つのベクトルユ
ニットとのバス接続の模様を示した模式図、第５図は多数の基本構成ブロックを複合したシステム例
におけるバス接続の模式図、第６図および第７図は基本構成の並列ベクトル計算機
（ｐ＝１の場合）の２つの実装例を示す模式図、第８図は同じくｐ＝２の場合の実装例を示す模式図、第９図は拡張形態における並列ベクトル計算機の実装例
を示す模式図である。１……バス及び通信線２……フロントエンド計算機３……制御用プロセッサ（CU）４……プログラムメモリ（CM）５……信号線６……スカラ演算ユニット７……ベクトル演算ブロック８……スカラメモリ（PM）９……ベクトルメモリユニット（MU） 10〜14……データバス及び通信線1 (a), 1 (b) and 1 (c) are schematic diagrams respectively showing three logical arrangement states of a basic configuration in a parallel computer of the present invention, and FIG. 2 is a block diagram showing an entire system of the basic configuration. FIG. 3 is a diagram showing the functional configuration of one vector unit PU, FIG. 4 is a schematic diagram showing a cubic arrangement of memory units and a pattern of bus connection with one vector unit, and FIG. FIG. 6 and FIG. 7 are schematic diagrams showing two implementation examples of a parallel vector computer (when p = 1) having a basic configuration, and FIG. 8 is a schematic diagram of a bus connection in a system example in which basic configuration blocks are combined. Similarly, FIG. 9 is a schematic diagram showing an implementation example of p = 2, and FIG. 9 is a schematic diagram showing an implementation example of a parallel vector computer in an extended form. DESCRIPTION OF SYMBOLS 1 ... Bus and communication line 2 ... Front end computer 3 ... Control processor (CU) 4 ... Program memory (CM) 5 ... Signal line 6 ... Scalar operation unit 7 ... Vector operation block 8 ... Scalar memory (PM) 9: Vector memory unit (MU) 10 to 14: Data bus and communication line

Claims

(57) [Claims]

1. A high-speed computer for use in addition to a front-end processor comprising a general-purpose computer, comprising: (a) a control unit CB comprising a control processor CU and a CU storage device CM; (A certain natural number) parallel processing vector processors PU [α] (α = 1,2,…, N) and each
Scalar memory PM that holds scalar data for PU
[Α] (α = 1,2, ..., N) the processor unit consisting of PB, as well as cubic array of memory units MU each M ³ months with a constant storage capacity (M = pN, also natural numbers with p) {MU [I, j, k],
Main memory vector bank MB consisting of i, j, k = 1,2,…, M｝
(B) the control processor CU is directly connected to a front-end processor to mediate input and output of data and the like, and has an instruction decoding device for decoding a program on the CU storage device CM. And the same control command sequence is transmitted to all the vector processors PU to instruct synchronous parallel operation. (C) Each vector processor PU includes a scalar operation device and a fixed-length vector pipeline. (D) Each PU [α] has an MB number corresponding to the number.
2 in the ij plane of the _q (= p (α-1) + q) layer
Dimensional subarray {MU [i, j, _q ], i, j = 1,2, ..., M}
1, 2,..., P, and connected via a set of high-speed “row” buses (where i-direction is “row”), a column vector (eg, j
Column vector bank is a sequence of a "column" direction) {MU [i, 1, 1], MU [i, 2, 1], ..., MU [i, M,
_{1], MU [i, 1} , 2], MU [i, 2, 2], ..., MU [i, M,
₂ ], ..., MU [i, 1, _p ], MU [i, 2, _p ], ..., MU
[I, M, _p ]｝, i = 1,2, ..., M, vector data access by interleave method, PU [α]
⇔MU [i, j, _q ] is possible, and (e) In another aspect, each PU [α]
The two-dimensional partial array {MU [ _q , j, k], j, k = 1, 2,..., M} in the ik plane of the _q (= p (α-1) + q) column is represented by q
= 1,2, ..., p, and connected through another set of high-speed "column" buses (again, j-directions are "columns") to form a vertical vector of memory units. A vertical vector bank {MU [ ₁ , j, 1], MU [ ₁ , j, 2],... MU
[ ₁ , j, M], MU [ ₂ , j, 1], MU [ ₂ , j, 2],…, MU [
₂ , j, M],…, MU [ _p , j, 1], MU [ _p , j, 2],… MU
For each of [ _p , j, M]｝, j = 1,2, ..., M, vector data access by the interleave method and PU
[Α] ⇔MU [ _q , j, k] is possible, (f) since every PU performs parallel operations while accessing MBs in parallel through each set of row buses in the above item (d). A row operation and a column operation consisting of performing a parallel operation while accessing the MB simultaneously and in parallel through each set of column buses in the above item (e), in a time-division manner. SIMD parallel vector calculator.

2. A control unit CB comprising a control processor CU and a storage device CM for CU, and N (a predetermined natural number)
Vector processors PU [α] (α = 1,2,
, N) and a processor unit PB consisting of a scalar memory PM [α] (α = 1, 2,..., N) for holding scalar data for each PU, and ^two L units are arranged as a unit. ｛(CB [J,
K], PB [J, K ]), J, K = 1,2, ..., the memory unit MU of the set of L}, M ₃ months, each with a predetermined storage capacity (M = pN, natural numbers is also p) ｛MU [i, j, k], i, j, k = 1,2, ...,
L ³ in units of main memory vector bank MB consisting of M｝
Individually cubically arranged {MB [I, J, K], I, J, K = 1, ..., L}, and (b) CU in all CBs is It is directly connected to the front-end processor to mediate input / output of data, etc., and has an instruction decoding device that decodes the same program on the CU storage device CM.
The B｝ pair performs a parallel synchronous operation according to the clock signal of the front-end processor. (C) Each vector processor PU has a scalar operation device and a constant-length vector pipeline operation device, (D) PU [α] belonging to each PB [J, K] is the row {MB of the main storage vector bank corresponding to the number pair [J, K].
I-th of the _q-th (= p (α-1) + q) layer of the MB in each of [1, J, K], MB [2, J, K],..., MB [L, J, K]}
Two-dimensional sub-array in the j-plane ｛MU [i, j, _q ], i, j = 1,2,
..., M} for q = 1,2, ..., p can be accessed via a common set of high-speed "row" buses, and for a selected MB the memory in that MB column vector of unit (however, j direction and "columns") column vector bank _{{MU [i, 1, 1} ] is a sequence _{of, MU [i, 2, 1} ],
..., MU [i, M, 1], MU [i, 1, 2], MU [i, 2, 2],
…, MU [i, M, ₂ ],…, MU [i, 1, _p ], MU [i, 2,
_p ], ..., MU [i, M, _p ] i, i = 1,2, ..., M, vector data access by interleave method, PU [α] ⇔MU [i, j, _q ] It is possible to perform parallel vector access for all the PUs over the entire PB in parallel, and (e) PU [α] belonging to each PB [K, I] has its number pair [ K, I] column of the main storage vector bank (MB
[I, 1, K], MB [I, 2, K], ..., MB [I, L, K] of the MB in the respective} The _{q (= p (α-1} ) + q) th column i-
Two-dimensional subarray in the k-plane ｛MU [ _q , j, k], j, k = 1,2,
.., P are accessible via a common set of high-speed “column” buses and the memory in that MB for the selected MB A vertical vector bank ｛MU [ ₁ , j, 1], MU [ ₁ , j, 2], which is a sequence of vertical vectors (where k is vertical) composed of units
…, MU [ ₁ , j, M], MU [ ₂ , j, 1], MU [ ₂ , j, 2],…, M
U [ ₂ , j, M], ..., MU [ _p , j, 1], MU [ _p , j, 2], ... MU
For each of [ _p , j, M]｝, j = 1,2, ..., M, vector data access by the interleave method and PU
[Α] ⇔MU [ _q , j, k] is possible, and these PUs perform vector access in parallel over the entire PB in parallel. (F) All PBs A row operation consisting of a parallel operation while simultaneously accessing a main storage column vector bank through a row bus leading to a column bus, and a column operation consisting of a parallel operation while simultaneously accessing a main storage vertical vector bank through a column bus. A SIMD parallel vector computer characterized by being realized in a divided manner.